Tuesday, June 07, 2022

Power Bi - Filtering by a list of values

When you need to filter a Power Bi report by multiple values and your users want to copy & paste the list of values by which they wish the report filtered, there isnt an out of the box way to do it in Power Bi.

I recently stumbled upon an AppSource visual called "Mass Filter" and it does exactly that!

And a quick video that shows you how to use it: Power Bi - Filter by Multiple Comma Separated Values - YouTube

Friday, June 03, 2022

Understanding PowerBi Incremental Refresh with Detect Changes

Incremental refresh involves partitioning of the data and this concept isnt something that is dwelled upon a lot in the documentation. Because of what fields you use for partitioning can make or break incremental refresh, a lot of times, not understanding the underlying concepts can cause weird behavior later on. Most often the issue, is duplicate records or ghost records. These issues typically surface a months after the incremental refresh is implemented, as the time periods for the archived and incremental refresh periods are measured in multiple months.

Here is some typical account data that one might encounter in a CRM system. I will be using this data to illustrate how Incremental Refresh works.

Id

Name

Created Date

Last Modified Date

1

Microsoft

2022-01-01

2022-01-02

2

Sales Force

2022-02-01

2022-02-02

3

Hitachi

2022-03-01

2022-03-01

4

NCM

2022-04-01

2022-04-05

5

Tesla

2022-05-01

2022-06-03

Picking the partitioning field

Incremental refresh works by storing the data in separate partitions. If you choose "months" then a partition is created for each month. These partitions are important for PowerBi as when data needs to be refreshed, the entire partition's data is reloaded. When a partition is older than the number of months you wish to archive, then that partition is dropped. The partitions are always created using a date column.

The first thing to remember is that one needs to pick a date-time column that does not change (invariant). This means that one cannot use the LastModifiedDate column, which many systems capture, as that will change all the time. Typically, one will end up using the CreatedDate field, as that should never change once the record has been created. 

Next you need to create 2 parameters named "RangeStart" and "RangeEnd". These need to be of type Date or Date/Time. You will use these 2 parameters to filter the above table by its CreatedDate. You can use any values you like for the RangeStart and RangeEnd on your computer, while developing your dataset, they will be reset upon the first refresh that is performed in PowerBi.com.


This is what the formula looks like:

= Table.SelectRows(#"Reordered Columns", each [createddate] >= RangeStart and [createddate] < RangeEnd)

The next step is to setup the incremental refresh.





For this example, I am using Archive data set to 6 months and Incrementally refresh data starting 3 months before refresh date. This will make it easier for me to illustrate the way incremental refresh works. You will likely use very different values in your production datasets. Take note of the fact that I have set the "detect data changes" option. This allows you to fine tune the incremental refresh even more (and a setting I think you should definitely use). Instead of refreshing all the partitions in the "Incremental Refresh" period, only those partitions that have changes are refreshed. But how does PowerBi know which partitions changed? You use a field that is updated anytime a record is updated (an audit date column). Typically in most systems this is a field called LastModifiedDate. PowerBi, keeps track of the max value of this field for each Incremental Refresh partition. The next time the dataset is refreshed, only those partitions are reloaded where the source data returns a higher LastModifiedDate than the one PowerBi captured the last time it ran. SalesForce note: you should use the SysModStamp column instead of the LastModifiedDate column, as even backend process changes will update the SysModStamp column, where as LastModifedDate is typically only updated by user based changes.

With the settings as defined above (3 months of incremental refresh and another 3 months of archived data, for a total of 6 months of data), on the very first refresh in PowerBi.com, we will see 3 + (3*2) queries getting fired off (3 for the archived months and 6 for incremental refresh months). The reason you see the additional set of queries for the incremental refresh is that PowerBi will send 2 queries per partition (in this case its by month). These queries will look like this:

select max("systemmodstamp") as "C1" from  (     select "systemmodstamp"     from account     where "createddate" >= timestamp '2022-01-01 00:00:00' and "createddate" < timestamp '2022-02-01 00:00:00'       ) as "ITBL"

select "id",     "name",     "isdeleted",     "createddate",     "systemmodstamp" from account where "createddate" >= timestamp '2022-01-01 00:00:00' and "createddate" < timestamp '2022-02-01 00:00:00'

The 1st query is on the "detect data changes" column (sysmodstamp) and gets the max value for that partition. The 2nd query gets the data for that partition. After this first refresh in PowerBi.com, all 6 months of data are loaded into the dataset.

Now on every subsequent refresh, PowerBi will first send similar queries as this for each partition that is part of "Incremental Refresh". In our case it will run the queries for the months of April, May, June.

select max("systemmodstamp") as "C1" from  (     select "systemmodstamp"     from account     where "createddate" >= timestamp '2022-04-01 00:00:00' and "createddate" < timestamp '2022-05-01 00:00:00'       ) as "ITBL"

PowerBi will then compare these max values to the previous max values it retrieved. PowerBi will only refresh those partitions for which the new max values are different from the previous max values on the detect-changes-column. By doing this, PowerBi can be even more efficient about which partitions it is reloading. If you had not used the detect-data-changes option, then for every subsequent refresh, PowerBi would reload all 3 "incremental refresh" partitions.

Now lets get back to the data we were loading. Initially this data was (I have added the partition each record would be loaded into, in this table). Lets assume this occurred on 6/3/2022

Id

Name

Created Date

Last Modified Date

Partition

1

Microsoft

2022-01-01

2022-01-02

January (archived)

2

Sales Force

2022-02-01

2022-02-02

February (archived)

3

Hitachi

2022-03-01

2022-03-01

March (archived)

4

NCM

2022-04-01

2022-04-05

Apr (Incr. Refresh)

5

Tesla

2022-05-01

2022-06-03

May (Incr. Refresh)

Lets assume the following changes happened on the next day (6/4/2022)

Id

Name

Created Date

Last Modified Date

Partition

1

Microsoft (deleted)

2022-01-01

2022-06-04

January (archived)

2

SalesForce (SFDC)

2022-02-01

2022-02-02

February (archived)

3

Hitachi

2022-03-01

2022-03-01

March (archived)

4

National CineMedia

2022-04-01

2022-06-04

Apr (Incr. Refresh)

5

Tesla/Twitter

2022-05-01

2022-06-04

May (Incr. Refresh)

6

Disney

2022-06-04

2022-06-04

June (Incr. Refresh

In the above case, because rows 1 and 2 are in archived partitions they will not be reloaded. This can be a problem if you support soft-deletes, as now your report potentially has 1 additional record than the source (as row 1 was soft-deleted in the source system). The name change from row 2 will also not show up in your report. The only changes that will be loaded are the changes to the rows 4, 5 and 6.

Finally, when July rolls around, the January partition will be dropped entirely from the dataset and a July Partition will be added.

Id

Name

Created Date

Last Modified Date

Partition

1

Microsoft (deleted)

2022-01-01

2022-06-04

January (archived)

Partition is dropped

2

SalesForce (SFDC)

2022-02-01

2022-02-02

February (archived)

3

Hitachi

2022-03-01

2022-03-01

March (archived)

4

National CineMedia

2022-04-01

2022-06-04

Apr (archived)

Partition type changed

5

Tesla/Twitter

2022-05-01

2022-06-04

May (Incr. Refresh)

6

Disney

2022-06-04

2022-06-04

June (Incr. Refresh

7

Paramount Studios

2022-07-04

2022-07-04

July (Incr. Refresh)

Partition is added

Incremental Refresh and CRM data:

When you use "CreatedDate" as your partition field, one thing you have to think about is what period of data you need refreshed. Imagine if you set up your Incremental Refresh to be 12 months and Archive period to be 24 months. On an entity like "Account", sometimes accounts may get created and may not get sold to or touched for a long time. So, if the account was created more than 12 months ago, then when it gets finally used and if any changes were to occur on that entity, you would not see those changes. In the above example, Microsoft is no longer even in the dataset, as its created date is outside the archival range. If Microsoft had an order on 2022-06-04, you would see the order's details, but the order would not have the account details, as its no longer being loaded.

For this reason, one needs to know what is the maximum time period between last-modified on an entity and its created date. You would then have to setup your refresh settings based on the usage statistics of the entities at your company. This is also one of the biggest reasons that it is very hard to setup an incremental refresh with CRM data.

When it comes to Order/Opportunity data, you have a little more lee-way. Typically one should use a field like "Order-Date" and not something like "Shipped-Date", as the latter could change. Also, you could institute a rule that states that orders/opportunities that are older than x months should be closed and a new one created.

What would happen if you use a partition column that might change

Lets take the following data as an example:

Id

Order Name

Order Date

Last Modified Date

Order Value

Partition

1

My Big Order

2021-01-01

2022-01-02

$1 million

January


If we were to have a refresh setting where the partition field were set to "Order Date" and where we incrementally refresh data in the last 12 months and archive another 12 months of data. The above order would have be in an "archived" partition. Now if the Order Date were to be changed to 2022-06-01:

Id

Order Name

Order Date

Last Modified Date

Order Value

Partition

1

My Big Order

2022-06-01

2022-06-01

$1.5 million

January


This same record would be loaded into the "Incremental Refresh" partition. So you would have 2 copies of the same order in your dataset and your order value would be over stated by $1 million. One could try and de-dup this data, but PowerBi is very inefficient at de-duping large record sets. Also, your algorithm would have to be smart, as you would want to keep the latest version of the order.

Important things to remember:

  1. The incremental refresh column must be a date/time column. You can use an integer column but it needs to be of type yyyymmdd (so you cant use any arbitrary integer column).
  2. Hard deletes cannot be handled by incremental refresh. You can use it only if your entity implements soft-deletes (i.e., uses a flag to denote a delete). So if your system performs hard-deletes, then you are SOL!
  3. Make sure that query folding is supported by your data-source and that it is occurring, otherwise, you will end up pulling multiple copies of the data from the data-source and end up considerably slowing down your refresh and possibly inviting the wrath of your data-source administrators. (Don't ask me how I know!)

Some Tips and Notes:

  1. If you implement incremental refresh in your dataset, then know that you will not be able to download your PBIX file from powerbi.com. Plan for this ahead of time. Another important consideration is that if you implement incremental refresh in your dataset and you need to make model changes, then the partitions will be deleted and the entire data reloaded. This is another reason to not implement incremental refresh directly in your dataset.
  2. If you do need to implement incremental refresh in your dataset, then consider:
    1. Storing your PBIX file in a shared location and have an agreed upon method for your team to work on that file. I highly recommend using Sharepoint for this.
    2. You can publish your changes using a tool like "ALM Toolkit" This will not cause your dataset to loose its changes. You can also use the deployment pipeline to push your changes in PowerBi and again this will not delete your existing partitions.
  3. Often times you need to implement incremental refresh because your source is very slow. In this case, you could implement the incremental refresh as a dataflow. You would pull the dataflow into your dataset. In this case, you may not have to implement incremental refresh in your dataset, as the data may load fast enough from the data-flow into your dataset.
  4. PowerBi will setup a partitions based on your incremental refresh settings. When powerbi refreshes data, it is refreshing the entire partition (i.e., it will delete the partition and reload the data from the data-source). So you need to carefully consider the incremental refresh settings and figure out how many rows on average might fall in a partition. If you partition by year and each year has a billion rows, then you might want to consider partitioning by month.

Thursday, June 02, 2022

Error Correction using PID (Proportional Integral Derivative) - The Math

PID or Proportional Integral Derivate is used often times when you need an automated system to self correct. Think for example a robot that has to drive straight or follow a line, or self balance.

Driving or Flying Straight: In this case the robot needs to keep to a certain heading or angle. Various forces might make it deviate from that heading (wind, wheel slip or friction, etc). The difference between the target heading and the actual heading is the error and PID can be used to correct that heading.

Following a line: In line following the robot attempts to follow a dark line to its target. In this case, typically a sensor that senses reflected light is used. The robot attempts to keep the reflected light within a certain range and tries to correct its direction to keep tracking the line. The difference between the amount of reflected light sensed and target reflected light is the error value and used in self correcting the direction the robot is travelling.

Self Balancing Robot: Think about a segway like robot that attempts to self balance using 2 motors. In this case the robot's target orientation is upright and the error is the angle from the vertical plane. The robot then attempts to self correct by applying a force towards the vertical plane. PID again provides the amount of self correcting needed. 

PID is most important in the self balancing example, as one cannot use a constant amount of correction, as the robot would likely topple. This is because while the robot is attempting to balance, the amount of deviation from the vertical plane will vary randomly.

Quad Copter: This is probably the most intuitive applications to work with. Think of a quad copter on the ground. You wish to have it take off and hover at a certain height (eg: 10 feet). When the quad copter first takes off, its altitude sensor will start at 0 feet and the error is 10 feet. The quad copter increases its propeller speed and starts its journey to 10 feet. It will invariably overshoot the target altitude (say 11 feet), so it will have to slow down its propeller to return back to its target of 10 feet. Wind and other factors will keep moving the copter away from its target altitude and so, as the sensor returns its altitude readings, the quad copter will have to continuously change the propeller speed to increase or decrease its altitude.


PID is again a better algorithm to use for a quad-copter trying to attain a target height, compared to a constant correction amount, as it will allow it to get to its target height and maintain it better. 

In a PID based system, the math is trying to reduce the error at every calculation step (or loop) and bring it to its target value. The systems typically never ever reaches the their target value and continuously fluctuates around the target value.


 

Even though PID uses fancy words like integral and derivate, the math itself is very simple.

Lets go through PID math by its various components:

Proportional

One can use only the proportional part of PID (with I = 0 and D = 0) in many simple systems (eg: line following robot). The job of this component of the PID math is to provide a correction towards the target value.

Kp: constant of proportion (usually called the "proportional gain constant")

SP: Set Point (or the target value we want on the sensor)

PV(t): The process value at time t (or the current sensor value)

E(t) = SP - PV(t) : This is the error at any given time t. (or the error in the graph above).

The proportional value P(t) is calculated as Kp * E(t) or Kp * (SP - PV(t)).

P(t)     = Kp * E(t) 

P(t)     = Kp * (SP - PV(t))

In other words its the error multiplied by a certain tuning factor (Kp). Kp can be used to magnify or dampen the effect of the error on the driving components.

In the diagram below, its Kp * 9ft (where Kp could be any value that you pick, other than 0!). If Kp = 1, then the value would be 9.



Integral

The job of the integral is make larger corrections if the Proportional component is not providing enough of a correction. It accumulates the errors and applies it to the correction. Again a Integral gain constant is used to dampen or magnify the effect of this component.

Ki: integral gain constant

I(t) = Ki * E(t) + I(t-1)

Where I(t-1) is the I(t) that was calculated in the previous cycle. If the terms (t-1) and (t) are confusing, you can also write this as:

I = Ki * E + Iprev




In the above diagram if Ki = 1, then at time = 1:

I = 1 * 9 + 0 = 9     (0, as I starts at 0)

at time = 2,  (now I = 9)

I = 1 * 7 + 9 = 16

Derivate

The job of the derivate is to slow down the effect of the Proportional and Integral components. It attempts to minimize overshoot that can happen as the previous two components attempt to bring the system to the target value. From the graph above its the difference in value of the 2 error values shown above at (t-1) and t.

Kd: derivate gain constant

D(t) = Kd * (E(t) - E(t-1))

or alternatively:

D = Kd * (E - Eprev)


In the above example, if Kd = 1, then at t = 1:

D = 1 * (9-0) = 9

at t=2, E(t-1) or Eprev = 9:

D = 1 * (7-9) = 1 * -2 = -2

Putting it all together:

C(t) = P(t) + I(t) + D(t)    where C is the correction.

C(t) = Kp*E(t) + (Ki*E(t) + Ki*I(t-1)) + Kd*(E(t) - E(t-1))

Going back to our quad-copter example, this is what its flight might have looked like



In fancy math terms this is the PID equation:

The first term is Proportional component, 2nd is Integral and 3rd term is Derivative component.

And as an Algorithm:

  1. Kp = 1, Ki = 1, Kd = 1    (where the values can be different and tuned for your system)
  2. Target = 100              (this is the set-point. This the target value you wish your system to achieve)
  3. CurrentValue = 0       (this is the current sensor measurement or Process value - PV)
  4. ErrPrev = 0
  5. I = 0
  6. while true
    1. Err = Target - CurrentValue(from Sensor)
    2. P = Kp * Err
    3. I = Ki * Err + I
    4. D = Kd * (Err - ErrPrev)
    5. ErrPrev = Err
    6. C = P + I + D
    7. Do something with C    (turn your robot by C degrees, etc).
The above loop is typically continuously evaluated until a certain condition is met (quad-copter has been asked to return home, or robot has reached its target or robot has been running for x minutes, etc).


Wednesday, June 01, 2022

Lego 51515 - Tricky - Understanding the line following code

One of the activities one can code with the Tricky bot is the line follower. But how does the line following code work?

Here is the line following code that tricky uses:


The first line is controls the loop execution and run the line following code until the sensor senses the green piece that is part of the basket.

But what is the the code within the loop doing? In a word: Propotional Line Following Algorithm (ok 4 words!)

Lets break down the code into its various bits:

the various bits in that one line of code. order of operations is bottom up


  1. The code uses the sensor in "reflected light" mode. In this mode, the color sensor reports back the amount of light being reflected back to the sensor. Remember dark colors reflect less light compared to light colors.
  2. Next up is the subtract block. The sensor value is one of the operands and the other operand here (the 60) is the target value.

    The idea here is that we want the robot to try and get to the target reflected light value of 60. If it senses more, than it needs to reduce it and if its less, then it needs to increase it. One can call this the "error" value.

  3. The next block is a multiply block. It multiplies the "error" by a constant (in this case 2). One can consider this entire expression upto here, as the "correction value". Here is the complete line:


  4. Next up is the "start moving with steering at power" block. It will move the robot at the specified power (35%) and at the angle defined in the first field. The value we will use for the angle is the "correction value" that we calculated previously. This is what the final line looks like:
How does it all work?
First lets look at how start moving block works. The first input to the block is the direction in which the robot is supposed to move. The values it can take are -100 to 100 (negative values = Left, positive values = Right, 0 = Straight).
The sensor outputs a value between 0 and 100 (0 on black, 100 on white).



The line following algorithm tries to follow the edge of the line, as this is a lot more reliable, than trying to stay completely on the white or on the black part of the line. You should try this as an exercise with your robot (Set target to 0)

Here are the different values for the output angle based on a target value of 60.
Target60
Kp2
SensorErrorAngle CalculationAngleDirection
5552 x (55 - 60)-10Left
65-52 x (65 - 60)10Right
6002 x (60 - 60)0Straight
70-102 x (70 - 60)20Right
80-202 x (80 - 60)40Right
50102 x (50 - 60)-20Left
40202 x (40 - 60)-40Left
100-402 x (100 - 60)80Right
0602 x (0 - 60)-120Left

One thing to note is that with a target value = 60 and Kp = 2, the robot will follow the line when placed on the left side of the line. This is because, values above 60 (more white), will make the robot turn right (towards the line) and values below 60 (more black), will make the robot turn left. If you wanted the robot to travel on the right side of the line what do you think you will have to change? (see answer at the bottom). 

You can play with the above calculations using Google Sheets: https://docs.google.com/spreadsheets/d/1RHkM3w1KmZSh7D-QXvpsda7rRfRzInH4p_jKrHQJQg0/edit?usp=sharing

When you use this piece of code, you might have to change the value of the target (in this case 60) and the multiplier. 

Summary:

Why is this called a proportional line follower? As the robot drifts away from the line, the error value will become larger. We make this error value even larger by using a constant as a multiplier. This error value is then used as the angle to turn the robot. So the larger the difference, the bigger the angle that will applied for steering. Hence the proportional line follower!

So where does the math for all of this come from? There is a concept called PID: Proportional-Integral-Derivative and this concept is used in a lot robotic controllers. In the line-follower, we used just the proportional component of the math. PID uses the following complicated looking formula: 
We used just the first term of the above equation in our code:
If you want to learn more about how PID is used and implemented and the math, read this post of mine: Error Correction using PID (Proportional Integral Derivative) - The Math

You can also learn more about PID here: PID controller - Wikipedia

Answer:
To make the robot travel on the right side of the line, you will have to reverse the output angles. The easiest way to do this is to change the sign of Kp. You would set Kp = -2.