Dataflows in Power BI: Overview Part 6 – Linked and Computed Entities

Important: This post was written and published in 2018, and the content below no longer represents the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

If you’ve ever built an enterprise data warehouse or large-scale business intelligence solution, you understand the effort that goes into building the ETL / data preparation subsystem. Tools like SQL Server Integration Services, Azure Data Factory, and 3rd party tools like Informatica are powerful, but using them effectively to build a data pipeline requires specialized technical knowledge. They’re not tools that you would hand to a business analyst with expectations for success.

This is where Power Query really shines. Power Query provides a powerful, intuitive, and interactive data preparation experience that can be employed by business users as well as data professionals. Dataflows in Power BI expand the reach of Power Query, as the other posts in this series demonstrate. But Dataflows do more than that. Dataflows can help dramatically simplify the real complexity of data preparation.

Building a single ETL package can seem complex – regardless of the tools and technologies you use – but the true complexity comes from orchestrating the execution of tens or hundreds of packages. ETL orchestration logic must take into account many considerations, including:

  • Availability windows of source systems – each data source is already under load, and most enterprise sources have specific windows of availability where bulk processing is permitted
  • Dependencies between packages or groups of packages – it is common to have groups of packages related to a data source, a data domain, or s stage in the data preparation processing, and to have dependencies between them, where a dependent downstream set of packages should not start executing until the upstream packages have completed successfully.
  • Error handling – If one package fails to execute, the system as a whole should respond in a consistent and predictable manner to ensure that supports the needs of the system and of the business processes in which the data is used.
  • Much, much more – It’s been 10 years since I built an enterprise ETL system myself, and my memory is exactly what it used to be[1], so I’m sure I’ve forgotten more considerations than I remember.

Most ETL tools will include capabilities to define the orchestration logic within the scope of the tool[2], and some will provide APIs so users can define that logic outside of the tool. But no matter where you define the orchestration logic, getting it right is difficult, and complex.

But what if it wasn’t?

Consider if you will Microsoft Excel. Excel is not a data preparation tool in the context in which we’re discussing data preparation, but it is a data tool, and it has many characteristics that would be valuable in a data preparation tool.

Consider a formula in a cell in an Excel worksheet. The formula defines where the data in the cell comes from. The formula references other cells and ranges in the worksheet, or in other worksheets, and applies logic to the values in those cells and ranges to calculate the cell’s value. The cell’s data lineage is self-describing, because the only way that the data in the cell is updated is through the execution of the formula. And because all of the formulas are managed by Excel, Excel can understand the dependencies between them and ensure that when any value is updated, all formulas that reference it – directly or indirectly – are evaluated and their values automatically updated.[3]

Wouldn’t it be awesome[4] if data preparation tools were like Excel?

Power BI dataflows are like Excel.

Let’s break down that bold statement a little.

  • An entity in a dataflow is like a cell in an worksheet. The value of the entity is the data set that is produced by the query that defines the entity. The query is the entity’s formula.
  • Every dataflow entity is self-describing in its lineage. The query that defines the entity’s structure also authoritatively defines where the data in the entity comes from, and how it is transformed.
  • The only way that the data in an entity is updated is through the execution of the entity’s query.
  • All dataflows and entities are managed by the Power BI service, so the Power BI service understands all of the relationships and dependencies between them.

And because of this, with linked entities, when upstream dataflow is refreshed, downstream dependent dataflows can automatically refresh without the need for any explicit orchestration logic to be defined.

If you’ve ever seen one of my presentations on Power BI dataflows, this is the part where I get excited[5].  Let’s get you excited too. Let’s look at linked entities.

Linked entities are entities in one Power BI dataflow that reference entities in another dataflow. Before I can create linked entities, I need to have entities in a dataflow that I can reference. For this example, I’m pulling in two data sets from the UK NHS related to healthcare sites and food quality.

01 - Source Entities.png

02 - New Dataflow

And once I’ve created my dataflow, I’ll manually refresh it so that it contains data.

03 - Refreshing the dataflow

I should point out two things in these screen shots that may not seem important at first glance:

  • The workspace is backed by Power BI Premium dedicated capacity
  • The workspace is a new “v2” workspace, not a “v1” workspace tied to an Office 365 group

Both of these criteria must be met for linked entities to be used.

Now that I have the source dataflow created, I can create the dataflow for the linked entities.

04 - Create new dataflow

To get started with the dataflow, instead of selecting “Add new entities” like I’ve done before, I will select “Add linked entities” when prompted.

05 - Add linked entities

Linked entities are their own data source, so I’m prompted to authenticate, using my AAD organizational account to sign in to the Power BI service.

06 - Authenticate

Once I’ve signed in, I can choose from the available dataflows entities. Although this may not be obvious from the screen shot below[6],  the I can only select from “v2” workspaces on Premium capacity. There are additional workspaces in my Power BI tenant that I have permission to access, and which contain dataflows, which do not appear in this list.

07 - Selecting entities.png

Once I’ve selected the entities to which I want to link, I’m taken to Power Query editor.

You may be asking why the “Applied steps pane” shows multiple query steps, even though no transformations are applied to linked queries. In case you aren’t asking this, I’ve thoughtfully drawn a red arrow. See?

08 - Linked Entities Created.png

If you look at the Advanced editor for a linked entity, you’ll see that the query uses PowerBI.Dataflows as its source, and then navigates to the workspace, the dataflow, and the entity in order to create the linked entity. The linked entity will be read-only in this workspace.

09 - Advanced Editor

Now that I have two linked entities in my dataflow, I can use them as the starting point for additional data preparation and mashup. Keep in mind that this is a very simple example, with two linked entities coming from one dataflow. A more realistic scenario might involve linking to entities in multiple dataflows from multiple workspaces. It all depends on your scenario and requirements – Power BI will make the data easy to discover and navigate, so you can get started quickly and easily.

For this example, I’m going to merge these two linked entities to create a new computed entity.

10 - Merge as new.png

11 - Merge

Once the merge is complete and the new entity is created, I can edit it further. But first, there are a few things I should point out.

12 - Merged query with warning.png

First, notice the icons in the query list.

The first two have a “link” icon superimposed on the query icon. This shows that these are linked entities, and that they are read-only “pointers” to entities defined and stored elsewhere. The data for linked entities remains in the CDM Folder backing the source dataflow; only the pointer exists in this dataflow, and no data is duplicated.

The third entity has a “lightning bolt” icon that shows that this is “the entity who lived” after surviving the dark lord’s killing curse. Wait, no, that’s not right. Wrong lightning bolt. This icon shows that this is a computed entity. Computed entities are entities that reference linked entities, and which rely on in-storage calculations for performance and scale. Unlike “normal” entities[7] which use the Power Query M engine to extract data from external data sources, and to load it into CDM Folders in Azure Data Lake Storage gen2. For a computed entity, the source and destination are both CDM Folders managed by the Power BI service, which allows the compute to take place close to the storage.

Second, notice the warning at the bottom of the editor. Because I’ve just merged two queries, one of the columns in the resulting new query represents the right table for the merge. This complex type isn’t supported by CDM Folders, I will need to expand the column before I proceed. This is great, because I wanted to do that anyway.

13 - Expand food quality.png

See? Now the warning is gone!

14 - Finished query, no warning.png

Now I can click “Done” to save my changes and return to the entity list for the dataflow.

15 - Linked and Computed Entities

Notice how the “link” and “lightning” icons are retained here as well, and how the two linked entities are explicitly labeled and include a link to the source dataflow.

The last thing we need to do is save the dataflow that includes our linked and computed entities.

16 - Saving

Now we just need to refresh, to populate the computed entity with data.

But we do not need to refresh the dataflow that contains the linked and computed entities. That would be silly, because dataflows are like Excel, remember?

In an Excel workbook, what do you change if you want the value of a cell with a formula to change? That’s right – you change a cell referenced by the formula, and formula will automatically update value of its cell. And that’s what we’re going to do here.

We’re going to refresh the dataflow  that contains the “normal” entities that is referenced by the linked entities. If the queries behind linked and computed entities are like an Excel formula, “normal” entities are like the data entry cells in the Excel worksheet. Refreshing them is like entering new data into those cells – it is how new data enters the system.

So let’s refresh that source dataflow.

17 - Refreshing source dataflow

And as soon as its processing is complete, the Power BI service will automatically refresh the dependent dataflow.

18 - Magic

How does it know? The Power BI service includes a calculation engine that understands the relationships between all entities in all dataflows in all workspaces in the tenant. Just like Excel understands the relationships between cells, Power BI understands the relationships between entities. And just as Excel will automatically run all formulas that use a value that has changed, Power BI will automatically refresh the dataflows that contain linked and computed entities when the dataflows that they reference have been updated.

Wow, this post ended up being longer than I’d expected. Remember how I mentioned that this is the part of Power BI dataflows that I’m most excited about? Do you believe me now?[9]

 


[1] I  think. I’m pretty sure. Maybe.

[2] For example, control flow in SSIS.

[3] Were you rolling your eyes when you read this, wondering why I would go into so much detail on a topic that everyone has taken for granted for decades? Of course Excel does these things – that’s just how it works… right?

[4] Awesome, game-changing, magical, etc.

[5] Even more excited. I’m excited by default when I’m talking about dataflows.

[6] Unless you’ve been paying really close attention to the screen shots in other posts.

[7] We really need a name for entities that are neither linked nor computed, but lacking one, I’ll call them “normal” entities. Suggestions are welcome.

[8] Not that these suggestions will make their way into the product, of course, but they’re still welcome. If nothing else, receiving suggestions will let me know that people read the footnotes.

[9] Maybe you shouldn’t believe me just yet. Dataflows are big, really big, and there’s a lot coming that we’re not talking about yet. You’ll need to stay tuned.

 

21 thoughts on “Dataflows in Power BI: Overview Part 6 – Linked and Computed Entities

  1. Pingback: Dataflows in Power BI – BI Polar

  2. Pingback: Dataflows and Data Profiling – BI Polar

  3. Dhawal

    Hi Matthew,

    I think instead of “Normal”, “Common”, or “Regular” entity will make more common to use. 🙂

    Thanks,
    Dhawal Mehta

    Like

  4. Mark

    Super informative post, Matthew! Thanks a ton. I didn’t see Linked Entities in my private preview environment, but at least now I have a familiarity with them. What are some use cases for linked/computed entities? Is it pretty much an “extension” of the base entity with added transforms/logic?

    Like

    1. A) You should look again, because if you’re in the dataflows private preview, you should have access to this feature. Just be aware (as pointed out in the post) this is available only in the new “v2” workspaces, so that may be why you may not have seen them.

      B) The use cases that seem most exciting to me are based on common data warehousing patterns. If you have a set of entities that represent staging data, you can use them as sources for computed entities that cleanse, enrich, standardize, and otherwise transform the data to make it more suited for analysis… but this perspective is definitely colored by my personal history. I can’t wait to hear what you do with them!

      Like

  5. Mark

    I see them now; I had to create a new “v2” App Workspace.

    Looking forward to tinkering around more and spreading the word about dataflows. Thanks again for your response and continued work/blog posts! Cheers.

    Like

  6. Pingback: Power BI Dataflows – Reuse without Premium – BI Polar

  7. Pingback: Power BI Dataflows – Data Profiling Without Premium – BI Polar

  8. Pingback: Power BI Dataflows and Slow Data Sources – BI Polar

  9. Pingback: Power BI Dataflows FAQ – BI Polar

  10. Pingback: Lego Bricks and the Spectrum of Data Enrichment and Reuse – BI Polar

  11. Pingback: Dataflows in Power BI: Overview Part 7 – External CDM Folders – BI Polar

  12. Pingback: Dataflows in Power BI: Overview Part 8 – Using an Organizational Azure Data Lake Resource – BI Polar

  13. Pingback: Dataflows in Power BI: Overview Part 3 – Premium – BI Polar

  14. Pingback: Using Custom Functions in Power BI Dataflows – BI Polar

  15. Pingback: Quick Tip: Restricting access to linked entities in Power BI dataflows – BI Polar

  16. Pingback: Power BI dataflows enhanced compute engine – BI Polar

  17. Pingback: Dataflows in Power BI: Overview Part 9 – Lineage and impact analysis – BI Polar

  18. Pingback: Power BI workspace lineage – BI Polar

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s