If you’ve ever built an enterprise data warehouse or large-scale business intelligence solution, you understand the effort that goes into building the ETL / data preparation subsystem. Tools like SQL Server Integration Services, Azure Data Factory, and 3rd party tools like Informatica are powerful, but using them effectively to build a data pipeline requires specialized technical knowledge. They’re not tools that you would hand to a business analyst with expectations for success.
This is where Power Query really shines. Power Query provides a powerful, intuitive, and interactive data preparation experience that can be employed by business users as well as data professionals. Dataflows in Power BI expand the reach of Power Query, as the other posts in this series demonstrate. But Dataflows do more than that. Dataflows can help dramatically simplify the real complexity of data preparation.
Building a single ETL package can seem complex – regardless of the tools and technologies you use – but the true complexity comes from orchestrating the execution of tens or hundreds of packages. ETL orchestration logic must take into account many considerations, including:
- Availability windows of source systems – each data source is already under load, and most enterprise sources have specific windows of availability where bulk processing is permitted
- Dependencies between packages or groups of packages – it is common to have groups of packages related to a data source, a data domain, or s stage in the data preparation processing, and to have dependencies between them, where a dependent downstream set of packages should not start executing until the upstream packages have completed successfully.
- Error handling – If one package fails to execute, the system as a whole should respond in a consistent and predictable manner to ensure that supports the needs of the system and of the business processes in which the data is used.
- Much, much more – It’s been 10 years since I built an enterprise ETL system myself, and my memory is exactly what it used to be, so I’m sure I’ve forgotten more considerations than I remember.
Most ETL tools will include capabilities to define the orchestration logic within the scope of the tool, and some will provide APIs so users can define that logic outside of the tool. But no matter where you define the orchestration logic, getting it right is difficult, and complex.
But what if it wasn’t?
Consider if you will Microsoft Excel. Excel is not a data preparation tool in the context in which we’re discussing data preparation, but it is a data tool, and it has many characteristics that would be valuable in a data preparation tool.
Consider a formula in a cell in an Excel worksheet. The formula defines where the data in the cell comes from. The formula references other cells and ranges in the worksheet, or in other worksheets, and applies logic to the values in those cells and ranges to calculate the cell’s value. The cell’s data lineage is self-describing, because the only way that the data in the cell is updated is through the execution of the formula. And because all of the formulas are managed by Excel, Excel can understand the dependencies between them and ensure that when any value is updated, all formulas that reference it – directly or indirectly – are evaluated and their values automatically updated.
Wouldn’t it be awesome if data preparation tools were like Excel?
Power BI dataflows are like Excel.
Let’s break down that bold statement a little.
- An entity in a dataflow is like a cell in an worksheet. The value of the entity is the data set that is produced by the query that defines the entity. The query is the entity’s formula.
- Every dataflow entity is self-describing in its lineage. The query that defines the entity’s structure also authoritatively defines where the data in the entity comes from, and how it is transformed.
- The only way that the data in an entity is updated is through the execution of the entity’s query.
- All dataflows and entities are managed by the Power BI service, so the Power BI service understands all of the relationships and dependencies between them.
And because of this, with linked entities, when upstream dataflow is refreshed, downstream dependent dataflows can automatically refresh without the need for any explicit orchestration logic to be defined.
If you’ve ever seen one of my presentations on Power BI dataflows, this is the part where I get excited. Let’s get you excited too. Let’s look at linked entities.
Linked entities are entities in one Power BI dataflow that reference entities in another dataflow. Before I can create linked entities, I need to have entities in a dataflow that I can reference. For this example, I’m pulling in two data sets from the UK NHS related to healthcare sites and food quality.
And once I’ve created my dataflow, I’ll manually refresh it so that it contains data.
I should point out two things in these screen shots that may not seem important at first glance:
- The workspace is backed by Power BI Premium dedicated capacity
- The workspace is a new “v2” workspace, not a “v1” workspace tied to an Office 365 group
Both of these criteria must be met for linked entities to be used.
Now that I have the source dataflow created, I can create the dataflow for the linked entities.
To get started with the dataflow, instead of selecting “Add new entities” like I’ve done before, I will select “Add linked entities” when prompted.
Linked entities are their own data source, so I’m prompted to authenticate, using my AAD organizational account to sign in to the Power BI service.
Once I’ve signed in, I can choose from the available dataflows entities. Although this may not be obvious from the screen shot below, the I can only select from “v2” workspaces on Premium capacity. There are additional workspaces in my Power BI tenant that I have permission to access, and which contain dataflows, which do not appear in this list.
Once I’ve selected the entities to which I want to link, I’m taken to Power Query editor.
You may be asking why the “Applied steps pane” shows multiple query steps, even though no transformations are applied to linked queries. In case you aren’t asking this, I’ve thoughtfully drawn a red arrow. See?
If you look at the Advanced editor for a linked entity, you’ll see that the query uses PowerBI.Dataflows as its source, and then navigates to the workspace, the dataflow, and the entity in order to create the linked entity. The linked entity will be read-only in this workspace.
Now that I have two linked entities in my dataflow, I can use them as the starting point for additional data preparation and mashup. Keep in mind that this is a very simple example, with two linked entities coming from one dataflow. A more realistic scenario might involve linking to entities in multiple dataflows from multiple workspaces. It all depends on your scenario and requirements – Power BI will make the data easy to discover and navigate, so you can get started quickly and easily.
For this example, I’m going to merge these two linked entities to create a new computed entity.
Once the merge is complete and the new entity is created, I can edit it further. But first, there are a few things I should point out.
First, notice the icons in the query list.
The first two have a “link” icon superimposed on the query icon. This shows that these are linked entities, and that they are read-only “pointers” to entities defined and stored elsewhere. The data for linked entities remains in the CDM Folder backing the source dataflow; only the pointer exists in this dataflow, and no data is duplicated.
The third entity has a “lightning bolt” icon that shows that this is “the entity who lived” after surviving the dark lord’s killing curse. Wait, no, that’s not right. Wrong lightning bolt. This icon shows that this is a computed entity. Computed entities are entities that reference linked entities, and which rely on in-storage calculations for performance and scale. Unlike “normal” entities which use the Power Query M engine to extract data from external data sources, and to load it into CDM Folders in Azure Data Lake Storage gen2. For a computed entity, the source and destination are both CDM Folders managed by the Power BI service, which allows the compute to take place close to the storage.
Second, notice the warning at the bottom of the editor. Because I’ve just merged two queries, one of the columns in the resulting new query represents the right table for the merge. This complex type isn’t supported by CDM Folders, I will need to expand the column before I proceed. This is great, because I wanted to do that anyway.
See? Now the warning is gone!
Now I can click “Done” to save my changes and return to the entity list for the dataflow.
Notice how the “link” and “lightning” icons are retained here as well, and how the two linked entities are explicitly labeled and include a link to the source dataflow.
The last thing we need to do is save the dataflow that includes our linked and computed entities.
Now we just need to refresh, to populate the computed entity with data.
But we do not need to refresh the dataflow that contains the linked and computed entities. That would be silly, because dataflows are like Excel, remember?
In an Excel workbook, what do you change if you want the value of a cell with a formula to change? That’s right – you change a cell referenced by the formula, and formula will automatically update value of its cell. And that’s what we’re going to do here.
We’re going to refresh the dataflow that contains the “normal” entities that is referenced by the linked entities. If the queries behind linked and computed entities are like an Excel formula, “normal” entities are like the data entry cells in the Excel worksheet. Refreshing them is like entering new data into those cells – it is how new data enters the system.
So let’s refresh that source dataflow.
And as soon as its processing is complete, the Power BI service will automatically refresh the dependent dataflow.
How does it know? The Power BI service includes a calculation engine that understands the relationships between all entities in all dataflows in all workspaces in the tenant. Just like Excel understands the relationships between cells, Power BI understands the relationships between entities. And just as Excel will automatically run all formulas that use a value that has changed, Power BI will automatically refresh the dataflows that contain linked and computed entities when the dataflows that they reference have been updated.
Wow, this post ended up being longer than I’d expected. Remember how I mentioned that this is the part of Power BI dataflows that I’m most excited about? Do you believe me now?
 I think. I’m pretty sure. Maybe.
 For example, control flow in SSIS.
 Were you rolling your eyes when you read this, wondering why I would go into so much detail on a topic that everyone has taken for granted for decades? Of course Excel does these things – that’s just how it works… right?
 Awesome, game-changing, magical, etc.
 Even more excited. I’m excited by default when I’m talking about dataflows.
 Unless you’ve been paying really close attention to the screen shots in other posts.
 We really need a name for entities that are neither linked nor computed, but lacking one, I’ll call them “normal” entities. Suggestions are welcome.
 Not that these suggestions will make their way into the product, of course, but they’re still welcome. If nothing else, receiving suggestions will let me know that people read the footnotes.
 Maybe you shouldn’t believe me just yet. Dataflows are big, really big, and there’s a lot coming that we’re not talking about yet. You’ll need to stay tuned.