Aggregated Intelligence: 2022

Tuesday, November 01, 2022

Connecting to AWS Athena using JDBC (Dbeaver)

A quick post on setting up a connection to Athena with OKTA using JDBC. The SQL client I am using is DBeaver, but any other tool could be used.

Figure out your connection string.
replace the highlighted sections with values that are valid for your org.
jdbc:awsathena://AwsRegion=us-west-2;S3OutputLocation=s3://my-athena-results-bucket/results-folder/;AwsCredentialsProviderClass=com.simba.athena.iamsupport.plugin.OktaCredentialsProvider;user={user};password={password};idp_host=mycompany.okta.com;App_ID=0aosd2asadsdsdsd/123;SSL_Insecure=true;
In Dbeaver, click on "New Connection", which will bring up the "Connect to a database" dialog. Search for Athena and click next.
Enter your username & password. For my org, I just had to enter my WindowsNT id without the domain, but it might be different for your org.
Click on the button Driver Settings
Replace the "URL Template" with the value from step 1 on this page (connection string)

Click OK.
Test your connection.

That should make it work!

Creating an ODBC connection is very similar:

Wednesday, September 28, 2022

PowerBi - Row Level Security behavior - without table filter expressions

This is a quick post on RLS behavior in a powerbi dataset, where you dont define any table filter expressions.

Basically, when you setup a role and you dont define any table filter DAX expressions, it allows you to control who has access to your dataset. This is useful, when you have multiple datasets in a workspace and you need to reduce the access to a particular group.

Here is an example:

In the above case, people not assigned to the RLS role, will get the following message:

I have found this behavior useful many times, when a dataset gets published, and all of a sudden, someone comes up with a late breaking security requirement and needs the dataset locked down. I quickly add a role and publish the dataset and that allows me to control who can see the data. Also, as people can see the reports, just with an error, it allows them to ask for access to this dataset (instead of completely hiding the dataset from them).

Tuesday, August 23, 2022

PowerBi - Datasets - Create a report from a template

Have you wondered how to use the "Create a report" >> "From a template" option from a dataset in PowerBi Service?

If you look at the documentation: Dataset details page - Power BI | Microsoft Docs, it is rather unhelpful:

It turns out that, there is some additional information a little earlier in that page: Dataset details page - Power BI | Microsoft Docs

So it turns out that if you create a report from your dataset and post fix the report name as "(template)", it will use that report as the template for the new report. This is extremely helpful, as you can now create a report with instructions and some sample pages that can help users get started. I LOVE THIS FEATURE!

Note: (2022-8-23): Unfortunately, this does not seem to work when you try and create a report directly (from outside the dataset details page).

The "Auto-create report" option does not work! I really wish this did as currently we provide access to our datasets via PowerBi apps and so users dont have access to the dataset details page. Ideally, when the pick the "auto-create report" option with a dataset, it should use the template associated with the dataset.

Wednesday, July 27, 2022

Bash script and /bin/bash^M: bad interpreter: No such file or directory

If you get the error: "Bash script and /bin/bash^M: bad interpreter: No such file or directory", then you are likely trying to run a shell script (.sh) from your windows drive. Most likely, the issue is that your line-endings are CR+LF and not the Unix style LF.

If you need to fix this, the simplest way is to open the file in NotePad++ and change the line-endings (bottom right side of the status bar).

Another simple way to do this is to run the following script that runs regex to replace CR+LF with just LF.

sed -i -e 's/\r$//' myfolder/*.*

Another option, if you are using GIT is to add a .gitattributes file with line-endings set to LF.

* text eol=lf

Please note that if you do this, you may have to first delete all files in your repo (expect .git folder and .gitattributes and then run git reset to get the latest files again).

Finally, another option is to not download the files directly from your command prompt or windows explorer. Instead use your WSL terminal to download the files and they should come down with the correct line endings.

Wednesday, July 20, 2022

PowerBi PowerQuery - Convert Unix Timestamp to DateTime

The Unix timestamp represents the number of seconds since 1970-01-01. So to convert it to a DateTime in PowerQuery, use the following code:

= Table.AddColumn(Source, "unixTimeStamp to DateTime", each #datetimezone(1970,1,1,0,0,0,0,0) + #duration(0,0,0,[unixtimestamp]))

Tuesday, June 07, 2022

Power Bi - Filtering by a list of values

When you need to filter a Power Bi report by multiple values and your users want to copy & paste the list of values by which they wish the report filtered, there isnt an out of the box way to do it in Power Bi.

I recently stumbled upon an AppSource visual called "Mass Filter" and it does exactly that!

And a quick video that shows you how to use it: Power Bi - Filter by Multiple Comma Separated Values - YouTube

Friday, June 03, 2022

Understanding PowerBi Incremental Refresh with Detect Changes

Incremental refresh involves partitioning of the data and this concept isnt something that is dwelled upon a lot in the documentation. Because of what fields you use for partitioning can make or break incremental refresh, a lot of times, not understanding the underlying concepts can cause weird behavior later on. Most often the issue, is duplicate records or ghost records. These issues typically surface a months after the incremental refresh is implemented, as the time periods for the archived and incremental refresh periods are measured in multiple months.

Here is some typical account data that one might encounter in a CRM system. I will be using this data to illustrate how Incremental Refresh works.

Id	Name	Created Date	Last Modified Date
1	Microsoft	2022-01-01	2022-01-02
2	Sales Force	2022-02-01	2022-02-02
3	Hitachi	2022-03-01	2022-03-01
4	NCM	2022-04-01	2022-04-05
5	Tesla	2022-05-01	2022-06-03

Picking the partitioning field

Incremental refresh works by storing the data in separate partitions. If you choose "months" then a partition is created for each month. These partitions are important for PowerBi as when data needs to be refreshed, the entire partition's data is reloaded. When a partition is older than the number of months you wish to archive, then that partition is dropped. The partitions are always created using a date column.

The first thing to remember is that one needs to pick a date-time column that does not change (invariant). This means that one cannot use the LastModifiedDate column, which many systems capture, as that will change all the time. Typically, one will end up using the CreatedDate field, as that should never change once the record has been created.

Next you need to create 2 parameters named "RangeStart" and "RangeEnd". These need to be of type Date or Date/Time. You will use these 2 parameters to filter the above table by its CreatedDate. You can use any values you like for the RangeStart and RangeEnd on your computer, while developing your dataset, they will be reset upon the first refresh that is performed in PowerBi.com.

This is what the formula looks like:

= Table.SelectRows(#"Reordered Columns", each [createddate] >= RangeStart and [createddate] < RangeEnd)

The next step is to setup the incremental refresh.

For this example, I am using Archive data set to 6 months and Incrementally refresh data starting 3 months before refresh date. This will make it easier for me to illustrate the way incremental refresh works. You will likely use very different values in your production datasets. Take note of the fact that I have set the "detect data changes" option. This allows you to fine tune the incremental refresh even more (and a setting I think you should definitely use). Instead of refreshing all the partitions in the "Incremental Refresh" period, only those partitions that have changes are refreshed. But how does PowerBi know which partitions changed? You use a field that is updated anytime a record is updated (an audit date column). Typically in most systems this is a field called LastModifiedDate. PowerBi, keeps track of the max value of this field for each Incremental Refresh partition. The next time the dataset is refreshed, only those partitions are reloaded where the source data returns a higher LastModifiedDate than the one PowerBi captured the last time it ran. SalesForce note: you should use the SysModStamp column instead of the LastModifiedDate column, as even backend process changes will update the SysModStamp column, where as LastModifedDate is typically only updated by user based changes.

With the settings as defined above (3 months of incremental refresh and another 3 months of archived data, for a total of 6 months of data), on the very first refresh in PowerBi.com, we will see 3 + (3*2) queries getting fired off (3 for the archived months and 6 for incremental refresh months). The reason you see the additional set of queries for the incremental refresh is that PowerBi will send 2 queries per partition (in this case its by month). These queries will look like this:

select max("systemmodstamp") as "C1" from ( select "systemmodstamp" from account where "createddate" >= timestamp '2022-01-01 00:00:00' and "createddate" < timestamp '2022-02-01 00:00:00' ) as "ITBL"

select "id", "name", "isdeleted", "createddate", "systemmodstamp" from account where "createddate" >= timestamp '2022-01-01 00:00:00' and "createddate" < timestamp '2022-02-01 00:00:00'

The 1st query is on the "detect data changes" column (sysmodstamp) and gets the max value for that partition. The 2nd query gets the data for that partition. After this first refresh in PowerBi.com, all 6 months of data are loaded into the dataset.

Now on every subsequent refresh, PowerBi will first send similar queries as this for each partition that is part of "Incremental Refresh". In our case it will run the queries for the months of April, May, June.

select max("systemmodstamp") as "C1" from ( select "systemmodstamp" from account where "createddate" >= timestamp '2022-04-01 00:00:00' and "createddate" < timestamp '2022-05-01 00:00:00' ) as "ITBL"

PowerBi will then compare these max values to the previous max values it retrieved. PowerBi will only refresh those partitions for which the new max values are different from the previous max values on the detect-changes-column. By doing this, PowerBi can be even more efficient about which partitions it is reloading. If you had not used the detect-data-changes option, then for every subsequent refresh, PowerBi would reload all 3 "incremental refresh" partitions.

Now lets get back to the data we were loading. Initially this data was (I have added the partition each record would be loaded into, in this table). Lets assume this occurred on 6/3/2022

Id	Name	Created Date	Last Modified Date	Partition
1	Microsoft	2022-01-01	2022-01-02	January (archived)
2	Sales Force	2022-02-01	2022-02-02	February (archived)
3	Hitachi	2022-03-01	2022-03-01	March (archived)
4	NCM	2022-04-01	2022-04-05	Apr (Incr. Refresh)
5	Tesla	2022-05-01	2022-06-03	May (Incr. Refresh)

Lets assume the following changes happened on the next day (6/4/2022)

Id	Name	Created Date	Last Modified Date	Partition
1	Microsoft (deleted)	2022-01-01	2022-06-04	January (archived)
2	SalesForce (SFDC)	2022-02-01	2022-02-02	February (archived)
3	Hitachi	2022-03-01	2022-03-01	March (archived)
4	National CineMedia	2022-04-01	2022-06-04	Apr (Incr. Refresh)
5	Tesla/Twitter	2022-05-01	2022-06-04	May (Incr. Refresh)
6	Disney	2022-06-04	2022-06-04	June (Incr. Refresh

In the above case, because rows 1 and 2 are in archived partitions they will not be reloaded. This can be a problem if you support soft-deletes, as now your report potentially has 1 additional record than the source (as row 1 was soft-deleted in the source system). The name change from row 2 will also not show up in your report. The only changes that will be loaded are the changes to the rows 4, 5 and 6.

Finally, when July rolls around, the January partition will be dropped entirely from the dataset and a July Partition will be added.

Id	Name	Created Date	Last Modified Date	Partition
1	~~Microsoft (deleted)~~	~~2022-01-01~~	~~2022-06-04~~	~~January (archived)~~ Partition is dropped
2	SalesForce (SFDC)	2022-02-01	2022-02-02	February (archived)
3	Hitachi	2022-03-01	2022-03-01	March (archived)
4	National CineMedia	2022-04-01	2022-06-04	Apr (archived) Partition type changed
5	Tesla/Twitter	2022-05-01	2022-06-04	May (Incr. Refresh)
6	Disney	2022-06-04	2022-06-04	June (Incr. Refresh
7	Paramount Studios	2022-07-04	2022-07-04	July (Incr. Refresh) Partition is added

Incremental Refresh and CRM data:

When you use "CreatedDate" as your partition field, one thing you have to think about is what period of data you need refreshed. Imagine if you set up your Incremental Refresh to be 12 months and Archive period to be 24 months. On an entity like "Account", sometimes accounts may get created and may not get sold to or touched for a long time. So, if the account was created more than 12 months ago, then when it gets finally used and if any changes were to occur on that entity, you would not see those changes. In the above example, Microsoft is no longer even in the dataset, as its created date is outside the archival range. If Microsoft had an order on 2022-06-04, you would see the order's details, but the order would not have the account details, as its no longer being loaded.

For this reason, one needs to know what is the maximum time period between last-modified on an entity and its created date. You would then have to setup your refresh settings based on the usage statistics of the entities at your company. This is also one of the biggest reasons that it is very hard to setup an incremental refresh with CRM data.

When it comes to Order/Opportunity data, you have a little more lee-way. Typically one should use a field like "Order-Date" and not something like "Shipped-Date", as the latter could change. Also, you could institute a rule that states that orders/opportunities that are older than x months should be closed and a new one created.

What would happen if you use a partition column that might change

Lets take the following data as an example:

Id	Order Name	Order Date	Last Modified Date	Order Value	Partition
1	My Big Order	2021-01-01	2022-01-02	$1 million	January

If we were to have a refresh setting where the partition field were set to "Order Date" and where we incrementally refresh data in the last 12 months and archive another 12 months of data. The above order would have be in an "archived" partition. Now if the Order Date were to be changed to 2022-06-01:

Id	Order Name	Order Date	Last Modified Date	Order Value	Partition
1	My Big Order	2022-06-01	2022-06-01	$1.5 million	January

This same record would be loaded into the "Incremental Refresh" partition. So you would have 2 copies of the same order in your dataset and your order value would be over stated by $1 million. One could try and de-dup this data, but PowerBi is very inefficient at de-duping large record sets. Also, your algorithm would have to be smart, as you would want to keep the latest version of the order.

Important things to remember:

The incremental refresh column must be a date/time column. You can use an integer column but it needs to be of type yyyymmdd (so you cant use any arbitrary integer column).
Hard deletes cannot be handled by incremental refresh. You can use it only if your entity implements soft-deletes (i.e., uses a flag to denote a delete). So if your system performs hard-deletes, then you are SOL!
Make sure that query folding is supported by your data-source and that it is occurring, otherwise, you will end up pulling multiple copies of the data from the data-source and end up considerably slowing down your refresh and possibly inviting the wrath of your data-source administrators. (Don't ask me how I know!)

Some Tips and Notes:

If you implement incremental refresh in your dataset, then know that you will not be able to download your PBIX file from powerbi.com. Plan for this ahead of time. Another important consideration is that if you implement incremental refresh in your dataset and you need to make model changes, then the partitions will be deleted and the entire data reloaded. This is another reason to not implement incremental refresh directly in your dataset.
If you do need to implement incremental refresh in your dataset, then consider:

Storing your PBIX file in a shared location and have an agreed upon method for your team to work on that file. I highly recommend using Sharepoint for this.
You can publish your changes using a tool like "ALM Toolkit" This will not cause your dataset to loose its changes. You can also use the deployment pipeline to push your changes in PowerBi and again this will not delete your existing partitions.

Often times you need to implement incremental refresh because your source is very slow. In this case, you could implement the incremental refresh as a dataflow. You would pull the dataflow into your dataset. In this case, you may not have to implement incremental refresh in your dataset, as the data may load fast enough from the data-flow into your dataset.
PowerBi will setup a partitions based on your incremental refresh settings. When powerbi refreshes data, it is refreshing the entire partition (i.e., it will delete the partition and reload the data from the data-source). So you need to carefully consider the incremental refresh settings and figure out how many rows on average might fall in a partition. If you partition by year and each year has a billion rows, then you might want to consider partitioning by month.

Thursday, June 02, 2022

Error Correction using PID (Proportional Integral Derivative) - The Math

PID or Proportional Integral Derivate is used often times when you need an automated system to self correct. Think for example a robot that has to drive straight or follow a line, or self balance.

Driving or Flying Straight: In this case the robot needs to keep to a certain heading or angle. Various forces might make it deviate from that heading (wind, wheel slip or friction, etc). The difference between the target heading and the actual heading is the error and PID can be used to correct that heading.

Following a line: In line following the robot attempts to follow a dark line to its target. In this case, typically a sensor that senses reflected light is used. The robot attempts to keep the reflected light within a certain range and tries to correct its direction to keep tracking the line. The difference between the amount of reflected light sensed and target reflected light is the error value and used in self correcting the direction the robot is travelling.

Self Balancing Robot: Think about a segway like robot that attempts to self balance using 2 motors. In this case the robot's target orientation is upright and the error is the angle from the vertical plane. The robot then attempts to self correct by applying a force towards the vertical plane. PID again provides the amount of self correcting needed.

PID is most important in the self balancing example, as one cannot use a constant amount of correction, as the robot would likely topple. This is because while the robot is attempting to balance, the amount of deviation from the vertical plane will vary randomly.

Quad Copter: This is probably the most intuitive applications to work with. Think of a quad copter on the ground. You wish to have it take off and hover at a certain height (eg: 10 feet). When the quad copter first takes off, its altitude sensor will start at 0 feet and the error is 10 feet. The quad copter increases its propeller speed and starts its journey to 10 feet. It will invariably overshoot the target altitude (say 11 feet), so it will have to slow down its propeller to return back to its target of 10 feet. Wind and other factors will keep moving the copter away from its target altitude and so, as the sensor returns its altitude readings, the quad copter will have to continuously change the propeller speed to increase or decrease its altitude.

PID is again a better algorithm to use for a quad-copter trying to attain a target height, compared to a constant correction amount, as it will allow it to get to its target height and maintain it better.

In a PID based system, the math is trying to reduce the error at every calculation step (or loop) and bring it to its target value. The systems typically never ever reaches the their target value and continuously fluctuates around the target value.

Even though PID uses fancy words like integral and derivate, the math itself is very simple.

Lets go through PID math by its various components:

Proportional

One can use only the proportional part of PID (with I = 0 and D = 0) in many simple systems (eg: line following robot). The job of this component of the PID math is to provide a correction towards the target value.

Kp: constant of proportion (usually called the "proportional gain constant")

SP: Set Point (or the target value we want on the sensor)

PV(t): The process value at time t (or the current sensor value)

E(t) = SP - PV(t) : This is the error at any given time t. (or the error in the graph above).

The proportional value P(t) is calculated as Kp * E(t) or Kp * (SP - PV(t)).

P(t) = Kp * E(t)

P(t) = Kp * (SP - PV(t))

In other words its the error multiplied by a certain tuning factor (Kp). Kp can be used to magnify or dampen the effect of the error on the driving components.

In the diagram below, its Kp * 9ft (where Kp could be any value that you pick, other than 0!). If Kp = 1, then the value would be 9.

Integral

The job of the integral is make larger corrections if the Proportional component is not providing enough of a correction. It accumulates the errors and applies it to the correction. Again a Integral gain constant is used to dampen or magnify the effect of this component.

Ki: integral gain constant

I(t) = Ki * E(t) + I(t-1)

Where I(t-1) is the I(t) that was calculated in the previous cycle. If the terms (t-1) and (t) are confusing, you can also write this as:

I = Ki * E + Iprev

In the above diagram if Ki = 1, then at time = 1:

I = 1 * 9 + 0 = 9 (0, as I starts at 0)

at time = 2, (now I = 9)

I = 1 * 7 + 9 = 16

Derivate

The job of the derivate is to slow down the effect of the Proportional and Integral components. It attempts to minimize overshoot that can happen as the previous two components attempt to bring the system to the target value. From the graph above its the difference in value of the 2 error values shown above at (t-1) and t.

Kd: derivate gain constant

D(t) = Kd * (E(t) - E(t-1))

or alternatively:

D = Kd * (E - Eprev)

In the above example, if Kd = 1, then at t = 1:

D = 1 * (9-0) = 9

at t=2, E(t-1) or Eprev = 9:

D = 1 * (7-9) = 1 * -2 = -2

Putting it all together:

C(t) = P(t) + I(t) + D(t) where C is the correction.

C(t) = Kp*E(t) + (Ki*E(t) + Ki*I(t-1)) + Kd*(E(t) - E(t-1))

Going back to our quad-copter example, this is what its flight might have looked like

In fancy math terms this is the PID equation:

The first term is Proportional component, 2nd is Integral and 3rd term is Derivative component.

And as an Algorithm:

Kp = 1, Ki = 1, Kd = 1 (where the values can be different and tuned for your system)
Target = 100 (this is the set-point. This the target value you wish your system to achieve)
CurrentValue = 0 (this is the current sensor measurement or Process value - PV)
ErrPrev = 0
I = 0
while true

Err = Target - CurrentValue(from Sensor)
P = Kp * Err
I = Ki * Err + I
D = Kd * (Err - ErrPrev)
ErrPrev = Err
C = P + I + D
Do something with C (turn your robot by C degrees, etc).

The above loop is typically continuously evaluated until a certain condition is met (quad-copter has been asked to return home, or robot has reached its target or robot has been running for x minutes, etc).