Power BI Dataflows – Data Profiling Without Premium

Important: This post was written and published in 2018, and the content below no longer represents the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

If you’ve been reading this blog, you already know a few things:

  1. Power BI includes a capability for self-service data preparation called “dataflows”
  2. Dataflows include computed entities, which enable some powerful reuse scenarios, but which are available only in Power BI Premium
  3. You can use computed entities to make current data profiles available for your dataflow entities, so that this valuable metadata can be used wherever the data is used
  4. Dataflows enable scenarios for reuse that don’t require computed entities and Premium

This post presents a variation on the data profiling pattern that doesn’t require Premium capacity. Let’s jump right in.

This is the approach that I took last time: I created a single dataflow to contain the data profiles for entities in other dataflows. As you can see, my workspace is no longer backed by Premium capacity, so this approach isn’t going to work.

2018-11-24_12-09-47

Instead of having a dedicated “Data Profiles” dataflow, we’re going to have data profile entities in the same dataflows that contain the entities being profiled. Dataflows like this one.

2018-11-24_12-16-57

As you can see, this dataflow contains two entities. We want to profile each of them. The most intuitive approach would be to create new queries that reference the queries for these entities, and to put the profile in the dependent query…

2018-11-24_12-19-26.png

…but if you do this, Power BI thinks you’re trying to create a computed entity, which requires Premium.

2018-11-24_12-19-47

Please allow me to rephrase that last sentence. If you reference a query that is loaded into a dataflow entity, you are creating a computed entity, which requires Premium.

So let’s not do that.

Specifically, let’s use the same pattern we used in the “reuse without premium” post to address this specific scenario.

Let’s begin by disabling the data load for the two “starter” entities that reference the external data source.

2018-11-24_12-27-50.png

Once this is done, the Premium warning goes away, because we’re no longer trying to create computed entities.

2018-11-24_12-29-20.png

Let’s rename the queries, and look at the M code behind the new queries we’ve created.

2018-11-24_12-32-04

As you can see, the new queries don’t contain any real logic – all of the data acquisition and transformation takes place in the “source” queries. The new ones just reference them, and get loaded into the CDM Folder that’s backing the dataflow.

At this point we’re functionally right back where we started – we just have a more complex set of queries to achieve the same results. But we’re also now positioned to add in queries to profile these entities, without needing Premium.

To do this, we’ll simply add new queries that reference the “source” queries, and add a step that calls Table.Profile().[1]

2018-11-24_12-38-37

And that’s that.

When I save my dataflow and refresh it, the entities for the data, and the entities for the data profile, will load, and will be saved for reuse. When I connect to this dataflow from Power BI Desktop, I have available all four entities.

2018-11-24_12-46-22

At this point you may be wondering about what the difference is between this approach and the approach that uses computed entities. To help answer this question, let’s look at the refresh details in the CSV file that can be downloaded from the refresh history.[2]

2018-11-24_12-48-06

If you look at the start and end time for each of the four entities, you’ll see that each of them took roughly the same time to complete. This is because for each entity, the query extracted data from the data source and transformed it before loading into the CDM Folder. Even though the extract logic was defined in the shared “source” queries, when the dataflow is refreshed each entity is loaded by executing its query against the data source.

By comparison, in the data profiling pattern that relies on computed entities, the data source is not used to generate the profile. The computed entity uses the CDM Folder managed by Power BI as its source, and generates from profile from there. This means that the data source is placed under lighter load[3], and the profile generation itself should take less time.

For meaningfully large data sources, this different may be significant. For the trivial data sources used in this example, the difference is measured in seconds, not minutes or hours. You’ll probably want to explore these patterns and others – I’m eager to hear what you discover, and what you think…


[1] Yes, I literally copied the code from my other blog post.

[2] For more details on refresh, see Dataflows in Power BI: Overview Part 5 – Data Refresh.

[3] 50% lighter, if my math skills haven’t failed me.

 

One thought on “Power BI Dataflows – Data Profiling Without Premium

  1. Pingback: Dataflows in Power BI – BI Polar

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s