Friday, June 03, 2022

Understanding PowerBi Incremental Refresh with Detect Changes

Incremental refresh involves partitioning of the data and this concept isnt something that is dwelled upon a lot in the documentation. Because of what fields you use for partitioning can make or break incremental refresh, a lot of times, not understanding the underlying concepts can cause weird behavior later on. Most often the issue, is duplicate records or ghost records. These issues typically surface a months after the incremental refresh is implemented, as the time periods for the archived and incremental refresh periods are measured in multiple months.

Here is some typical account data that one might encounter in a CRM system. I will be using this data to illustrate how Incremental Refresh works.

Id

Name

Created Date

Last Modified Date

1

Microsoft

2022-01-01

2022-01-02

2

Sales Force

2022-02-01

2022-02-02

3

Hitachi

2022-03-01

2022-03-01

4

NCM

2022-04-01

2022-04-05

5

Tesla

2022-05-01

2022-06-03

Picking the partitioning field

Incremental refresh works by storing the data in separate partitions. If you choose "months" then a partition is created for each month. These partitions are important for PowerBi as when data needs to be refreshed, the entire partition's data is reloaded. When a partition is older than the number of months you wish to archive, then that partition is dropped. The partitions are always created using a date column.

The first thing to remember is that one needs to pick a date-time column that does not change (invariant). This means that one cannot use the LastModifiedDate column, which many systems capture, as that will change all the time. Typically, one will end up using the CreatedDate field, as that should never change once the record has been created. 

Next you need to create 2 parameters named "RangeStart" and "RangeEnd". These need to be of type Date or Date/Time. You will use these 2 parameters to filter the above table by its CreatedDate. You can use any values you like for the RangeStart and RangeEnd on your computer, while developing your dataset, they will be reset upon the first refresh that is performed in PowerBi.com.


This is what the formula looks like:

= Table.SelectRows(#"Reordered Columns", each [createddate] >= RangeStart and [createddate] < RangeEnd)

The next step is to setup the incremental refresh.





For this example, I am using Archive data set to 6 months and Incrementally refresh data starting 3 months before refresh date. This will make it easier for me to illustrate the way incremental refresh works. You will likely use very different values in your production datasets. Take note of the fact that I have set the "detect data changes" option. This allows you to fine tune the incremental refresh even more (and a setting I think you should definitely use). Instead of refreshing all the partitions in the "Incremental Refresh" period, only those partitions that have changes are refreshed. But how does PowerBi know which partitions changed? You use a field that is updated anytime a record is updated (an audit date column). Typically in most systems this is a field called LastModifiedDate. PowerBi, keeps track of the max value of this field for each Incremental Refresh partition. The next time the dataset is refreshed, only those partitions are reloaded where the source data returns a higher LastModifiedDate than the one PowerBi captured the last time it ran. SalesForce note: you should use the SysModStamp column instead of the LastModifiedDate column, as even backend process changes will update the SysModStamp column, where as LastModifedDate is typically only updated by user based changes.

With the settings as defined above (3 months of incremental refresh and another 3 months of archived data, for a total of 6 months of data), on the very first refresh in PowerBi.com, we will see 3 + (3*2) queries getting fired off (3 for the archived months and 6 for incremental refresh months). The reason you see the additional set of queries for the incremental refresh is that PowerBi will send 2 queries per partition (in this case its by month). These queries will look like this:

select max("systemmodstamp") as "C1" from  (     select "systemmodstamp"     from account     where "createddate" >= timestamp '2022-01-01 00:00:00' and "createddate" < timestamp '2022-02-01 00:00:00'       ) as "ITBL"

select "id",     "name",     "isdeleted",     "createddate",     "systemmodstamp" from account where "createddate" >= timestamp '2022-01-01 00:00:00' and "createddate" < timestamp '2022-02-01 00:00:00'

The 1st query is on the "detect data changes" column (sysmodstamp) and gets the max value for that partition. The 2nd query gets the data for that partition. After this first refresh in PowerBi.com, all 6 months of data are loaded into the dataset.

Now on every subsequent refresh, PowerBi will first send similar queries as this for each partition that is part of "Incremental Refresh". In our case it will run the queries for the months of April, May, June.

select max("systemmodstamp") as "C1" from  (     select "systemmodstamp"     from account     where "createddate" >= timestamp '2022-04-01 00:00:00' and "createddate" < timestamp '2022-05-01 00:00:00'       ) as "ITBL"

PowerBi will then compare these max values to the previous max values it retrieved. PowerBi will only refresh those partitions for which the new max values are different from the previous max values on the detect-changes-column. By doing this, PowerBi can be even more efficient about which partitions it is reloading. If you had not used the detect-data-changes option, then for every subsequent refresh, PowerBi would reload all 3 "incremental refresh" partitions.

Now lets get back to the data we were loading. Initially this data was (I have added the partition each record would be loaded into, in this table). Lets assume this occurred on 6/3/2022

Id

Name

Created Date

Last Modified Date

Partition

1

Microsoft

2022-01-01

2022-01-02

January (archived)

2

Sales Force

2022-02-01

2022-02-02

February (archived)

3

Hitachi

2022-03-01

2022-03-01

March (archived)

4

NCM

2022-04-01

2022-04-05

Apr (Incr. Refresh)

5

Tesla

2022-05-01

2022-06-03

May (Incr. Refresh)

Lets assume the following changes happened on the next day (6/4/2022)

Id

Name

Created Date

Last Modified Date

Partition

1

Microsoft (deleted)

2022-01-01

2022-06-04

January (archived)

2

SalesForce (SFDC)

2022-02-01

2022-02-02

February (archived)

3

Hitachi

2022-03-01

2022-03-01

March (archived)

4

National CineMedia

2022-04-01

2022-06-04

Apr (Incr. Refresh)

5

Tesla/Twitter

2022-05-01

2022-06-04

May (Incr. Refresh)

6

Disney

2022-06-04

2022-06-04

June (Incr. Refresh

In the above case, because rows 1 and 2 are in archived partitions they will not be reloaded. This can be a problem if you support soft-deletes, as now your report potentially has 1 additional record than the source (as row 1 was soft-deleted in the source system). The name change from row 2 will also not show up in your report. The only changes that will be loaded are the changes to the rows 4, 5 and 6.

Finally, when July rolls around, the January partition will be dropped entirely from the dataset and a July Partition will be added.

Id

Name

Created Date

Last Modified Date

Partition

1

Microsoft (deleted)

2022-01-01

2022-06-04

January (archived)

Partition is dropped

2

SalesForce (SFDC)

2022-02-01

2022-02-02

February (archived)

3

Hitachi

2022-03-01

2022-03-01

March (archived)

4

National CineMedia

2022-04-01

2022-06-04

Apr (archived)

Partition type changed

5

Tesla/Twitter

2022-05-01

2022-06-04

May (Incr. Refresh)

6

Disney

2022-06-04

2022-06-04

June (Incr. Refresh

7

Paramount Studios

2022-07-04

2022-07-04

July (Incr. Refresh)

Partition is added

Incremental Refresh and CRM data:

When you use "CreatedDate" as your partition field, one thing you have to think about is what period of data you need refreshed. Imagine if you set up your Incremental Refresh to be 12 months and Archive period to be 24 months. On an entity like "Account", sometimes accounts may get created and may not get sold to or touched for a long time. So, if the account was created more than 12 months ago, then when it gets finally used and if any changes were to occur on that entity, you would not see those changes. In the above example, Microsoft is no longer even in the dataset, as its created date is outside the archival range. If Microsoft had an order on 2022-06-04, you would see the order's details, but the order would not have the account details, as its no longer being loaded.

For this reason, one needs to know what is the maximum time period between last-modified on an entity and its created date. You would then have to setup your refresh settings based on the usage statistics of the entities at your company. This is also one of the biggest reasons that it is very hard to setup an incremental refresh with CRM data.

When it comes to Order/Opportunity data, you have a little more lee-way. Typically one should use a field like "Order-Date" and not something like "Shipped-Date", as the latter could change. Also, you could institute a rule that states that orders/opportunities that are older than x months should be closed and a new one created.

What would happen if you use a partition column that might change

Lets take the following data as an example:

Id

Order Name

Order Date

Last Modified Date

Order Value

Partition

1

My Big Order

2021-01-01

2022-01-02

$1 million

January


If we were to have a refresh setting where the partition field were set to "Order Date" and where we incrementally refresh data in the last 12 months and archive another 12 months of data. The above order would have be in an "archived" partition. Now if the Order Date were to be changed to 2022-06-01:

Id

Order Name

Order Date

Last Modified Date

Order Value

Partition

1

My Big Order

2022-06-01

2022-06-01

$1.5 million

January


This same record would be loaded into the "Incremental Refresh" partition. So you would have 2 copies of the same order in your dataset and your order value would be over stated by $1 million. One could try and de-dup this data, but PowerBi is very inefficient at de-duping large record sets. Also, your algorithm would have to be smart, as you would want to keep the latest version of the order.

Important things to remember:

  1. The incremental refresh column must be a date/time column. You can use an integer column but it needs to be of type yyyymmdd (so you cant use any arbitrary integer column).
  2. Hard deletes cannot be handled by incremental refresh. You can use it only if your entity implements soft-deletes (i.e., uses a flag to denote a delete). So if your system performs hard-deletes, then you are SOL!
  3. Make sure that query folding is supported by your data-source and that it is occurring, otherwise, you will end up pulling multiple copies of the data from the data-source and end up considerably slowing down your refresh and possibly inviting the wrath of your data-source administrators. (Don't ask me how I know!)

Some Tips and Notes:

  1. If you implement incremental refresh in your dataset, then know that you will not be able to download your PBIX file from powerbi.com. Plan for this ahead of time. Another important consideration is that if you implement incremental refresh in your dataset and you need to make model changes, then the partitions will be deleted and the entire data reloaded. This is another reason to not implement incremental refresh directly in your dataset.
  2. If you do need to implement incremental refresh in your dataset, then consider:
    1. Storing your PBIX file in a shared location and have an agreed upon method for your team to work on that file. I highly recommend using Sharepoint for this.
    2. You can publish your changes using a tool like "ALM Toolkit" This will not cause your dataset to loose its changes. You can also use the deployment pipeline to push your changes in PowerBi and again this will not delete your existing partitions.
  3. Often times you need to implement incremental refresh because your source is very slow. In this case, you could implement the incremental refresh as a dataflow. You would pull the dataflow into your dataset. In this case, you may not have to implement incremental refresh in your dataset, as the data may load fast enough from the data-flow into your dataset.
  4. PowerBi will setup a partitions based on your incremental refresh settings. When powerbi refreshes data, it is refreshing the entire partition (i.e., it will delete the partition and reload the data from the data-source). So you need to carefully consider the incremental refresh settings and figure out how many rows on average might fall in a partition. If you partition by year and each year has a billion rows, then you might want to consider partitioning by month.

No comments: