This week’s Power BIte is the second in a series of videos that present different ways to create new Power BI dataflows, and the results of each approach.
When creating a dataflow by defining new entities, the final dataflow will have the following characteristics:
Data ingress path
Ingress via the mashup engine hosted in the Power BI service, using source data that is also managed by the Power BI service, taking advantage of locality of data.
Data stored in the CDM folder defined for the dataflow for computed entities. Data for linked entities remains in source dataflow and is not moved or copied.
The dataflow is refreshed based on the schedule and policies defined in the workspace.
Let’s look at the dataflow’s model.json metadata to see some of the details.
At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 35 and the Power Query code for all of the entities on line 37. This will look awfully familiar from the last Power BIte post.
Things start to get interesting and different when we look at the entity definitions:
On line 80 we can see that the Product entity is defined as a ReferenceEntity, which is how the CDM metadata format describes what Power BI calles linked entities. Rather than having its attribute metadata defined in the current dataflow’s model.json file, it instead identifies the source entity it references, and the CDM folder in which the source entity is stored, similar to what we saw in the last example. Each modelId value for a linked entity references the id value in the referenceModels section as we’ll see below.
The Customers with Addresses entity, defined starting on line 93, is the calculated entity built in the video demo. This entity is a LocalEntity, meaning that its data is stored in the current CDM folder, and its metadata includes both the location, and its full list of attributes.
The end of the model.json file highlights the rest of the differences between local and linked entities.
At line 184 we can see the partitions for the Customers with Addresses entity, including the URL for the data file backing this entity. Because the other entities are linked entities, their partitions are not defined in the current model.json.
Instead, the CDM folders where their data does reside are identified in the referenceModels section starting at line 193. The id values in this section match the modelId values for the model.json file, above, and the location values are the URLs to the model.json files that define the source CDM folders for the linked entities.
If this information doesn’t make sense yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.
I guarantee it will make as much sense as anything on this blog.
This week’s Power BIte is the first in a series of videos that present different ways to create new Power BI dataflows, and the results of each approach.
When creating a dataflow by defining new entities in Power Query Online, the final dataflow will have the following characteristics:
Data ingress path
Ingress via the mashup engine hosted in the Power BI service
Data stored in the CDM folder defined for the newly created dataflow
The dataflow is refreshed based on the schedule and policies defined in the workspace
Let’s look at the dataflow’s model.json metadata to see some of the details.
At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 19 and the Power Query code for all of the entities on line 22.
At the bottom of the file we can see information about the refresh and storage. Line 26 identifies the entity as a LocalEntity, which means that the entity’s data is physically stored in the current CDM folder.
Line 30 shows that the entity is fully refreshed rather than incrementally refreshed, and line 31 shows the file name where the entity data is stored. Lines 97 through 99 identify the single partition where the data for the current version of the entity is stored, including the full URI for the data file. If this entity used incremental refresh, there would be multiple partitions to match the incremental refresh policy.
If this information doesn’t all make sense just yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.
I guarantee it will make as much sense as anything on this blog.
 New videos every Monday morning!
 The same information is also included starting on line 103 for the Promotions entity, but is not pictured here.
Last week I kicked off the new BI Polar YouTube channel with a video and blog post comparing and contrasting Power BI dataflows and datasets. In the days that followed, I continued to hear questions that led me to believe I hadn’t done a great job answering one vital question:
When would I use dataflows, and when would I use shared datasets?
This is still one of the most common dataflows questions: what’s the difference between Power BI dataflows and Power BI datasets?
For the last year I have resisted tackling this question head-on. This isn’t because it’s a bad or “dumb” question. Just the opposite – this is a very simple question, and the simpler a question is, the more complex and nuanced the answer is likely to be.
If you’re a regular reader of this blog, you probably already know the answer, because I’ve answered it already. Sort of. The existing answer is distributed across dozens of posts, and if you’ve read all of them and picked up the answer along the way. But I keep hearing this question, and I keep thinking that there must be a more direct answer I could share.
Here it is, in a single, simple table.
Power BI dataflows
Power BI datasets
Analysis Services tabular model
Common Data Model – model.json
Power Query Online
Power Query in Power BI Desktop
Acts as data source in multiple datasets
Shared datasets across workspaces
Scope of reuse
Entity level reuse
Dataset level reuse
Mashup with other data sources
Used for reporting
Reuse outside Power BI
Yes, through ADLSg2
Yes, through XMLA
Data access methods
Certification and promotion
What else am I missing?
Please let me know!
Seriously, you should let me know.
Update: I’ve added a few rows to the table after the post was originally published, to incorporate feedback from readers on differences I had missed. Thank you!
Each of the rows in this table could easily be an in-depth topic in and of itself, so if you’re looking at any of them and thinking “that’s not quite right” I might very well agree with you. There’s a lot of context and a lot of nuance here, and we’re trying to sum things up in a word or two… which is kind of the whole point.
Oh yeah, there’s a video too.
I can’t wait to hear what you think!
 A simple table with ten footnotes.
 The storage aspect of dataflows and datasets is one of the most significant differences between the two. Datasets use the Vertipaq column store to load data into an optimized and highly compressed in-memory representation that is optimized for analysis. Dataflows use text files in folders, which are optimized for interoperability.
 The Analysis Services Tabular engine uses the BI Semantic Model (BISM) to represent its metadata. This is a metadata model originally included in SQL Server 2012 Analysis Services, and used by the Tabular engine ever since.
 Saying “this is the primary purpose” of any complex tool is fraught with risk, because no matter what you say, there are other valid things that remain unsaid. With this said… the big gap that dataflows close is that of self-service data preparation for the purpose of data sharing and reuse. Power BI has always had self-service data preparation through Power Query, but before dataflows the data that was prepared was “locked” in a dataset, for analysis, and not for sharing or reuse.
 Once you have loaded data into dataflows, authorized users can reuse entities from multiple dataflows, and use them as the building blocks for new dataflows or new datasets. Once you have loaded data into a dataset (and published it to the Power BI service) you can enable users to connect to it.
 With dataflows, users can pick and choose the entities they want, but a dataset can only be reused as-is.
 Dataflow entities can be used as data sources in the same Power BI Desktop file as other data sources, and can serve as part of a mashup or composite model, but a dataset can only be reused as-is.
 Although you can obviously use dataflows for reporting, you do so by first importing the data from the dataflow into a dataset.
 It’s interesting to point out that using your own organizational ADLSg2 account does not require Power BI Premium, but using the XMLA endpoint to connect to Power BI datasets from non-Power BI clients does.
 You can only import data into your dataflow entities, but tables in your dataset can import data or use DirectQuery, and a dataset can use a combination of the two.
 You can only import data from a dataflow into a dataset. When connecting to a shared dataset you can only use Live Connections.
 I’ve been thinking of making videos to supplement this blog for almost as long as I’ve been hearing the question that inspired this post. Please take a moment to share your thoughts on the video. This is something of a “soft launch” and although I have plans for a few dozen more videos already, your feedback will be a main factor in how the video series evolves.
Power BI dataflows have included capabilities for data lineage since they were introduced in preview way back in 2018. The design of dataflows, where each entity is defined by the Power Query that provides its data, enables a simple and easy view into its data lineage. The query is the authoritative statement on where the entity’s data comes from, and how it is transformed.
But what about everything else in a workspace? What about datasets, and reports, and dashboards? What about them?
Power BI has your back.
Late last month the Power BI team released a new preview capability that lets users view workspace content in a single end-to-end lineage view, in addition to the familiar list view.
Once the lineage view is selected, all workspace contents – data sources, dataflows, datasets, dashboards, and reports – are displayed, along with the relationships between them. Here’s a big-picture view of a workspace I’ve been working in lately:
There’s a lot to unpack here, so I’ll break down what feels to me like the important parts:
The primary data source is a set of text files in folders. The text files are produced by various web scraping processes, and each has a different format and contents.
The secondary data source is a set of reference and lookup data stored in Excel workbooks in SharePoint Online. These workbooks contain manually curated data that is used to cleanse, standardize and/or enrich the data from the primary data.
The primary data is staged with minimal transformation in a “raw” dataflow. This data is then progressively processed by a series of downstream dataflows, including mashing up with the secondary data from Excel, and reshaped into facts and dimensions.
There is one dataset based on the fact and dimension entities, and report based on this dataset. There’s a second dataset that includes data quality metrics from entities in multiple dataflows, and a report based on this dataset. And there are two dashboards, one that includes only visuals for data quality metrics, and one that presents the main data along with a few tiles from the quality report.
That overview is simplified enough as to be worthless from a technical understanding perspective, but it’s still a wall of text. Who wants to read that?
For a real-world workspace that implements a production BI application, there is likely to be more complexity, and less well defined boundaries between objects. How do you document the contents of a complex workspace, and the relationships between those components? How do you understand them well enough to identify and solve problems?
That’s where the lineage view comes in.
Let’s begin by looking at the data sources.
For data sources that use a gateway, I can easily see the gateway name. For other data sources I can see the data source location. We’re off to a good start, because I have a single place to look to see where my data is coming from.
Next, let’s look at the dataflows.
In addition to being able to see the dataflows and the dependencies between them, you can click on any dataflow to see the entities it contains, and can jump directly to edit the dataflow from this view.
This part of workspace lineage isn’t completely new – this is essentially what you could do with dataflows already. But now you can do it with datasets, reports, and dashboards as well.
Selecting a dataset shows me the tables it contains, and selecting a dashboard or report takes me directly to the visualization. But the real power of this view comes from the relationships between objects. The relationships are where data lineage comes to the fore.
The two primary questions asked in the context of data linage are around upstream “where does this data come from?” and downstream “where is this data used?” lineage scenarios.
The first question is often asked in the context of “why am I not seeing what I expect to see?” and the resulting investigation looks at upstream logic and data source to identify the root cause of problems.
The second question is often asked in the context of “what might break if I change this?” and the resulting investigation looks at downstream objects and processes.
The lineage view has a simple way to answer both questions: just click on the “double arrow” icon and the view will change to highlight all upstream and downstream objects. In a single click you can see where the data comes from, and where the data is used. Click again, and the view toggles back to the default view.
There’s more to lineage view than this, including support for shared and certified datasets, but this should be enough to get you excited. Be sure to check out the preview documentation as you check out the feature!
Update: We now have a video to supplement the blog post. Check it out!
Update: The Power BI blog now has the official announcement for this exciting feature. The blog post includes a look at where the lineage team is planning to invest to make this feature even better, and that all of the information in the lineage view is now available using the Power BI API. If you want to integrate lineage and impact analysis into your own tools, or if you want to build a cross-workspace lineage view, you now have the APIs you need to be successful!
 This is a pet project that may one day turn into a viable demo, assuming work and life let me devote a little more time to it…
 Different, annoying, and difficult to clean up.
 For example, the source web site allows any user to contribute, and although the contribution process is moderated there is no enforcement of content or quality. One artist may be credited for “guitar” on one album, “guitars” on another, “lead guitar” on a third. This sounds pretty simple until you take into account there were close to 50,000 different “artist roles” in the raw source data, that needed to be standardized down to a few hundred values in the final data model.
I post about Power BI dataflows a lot, but that’s mainly because I love them. My background in data preparation and ETL, combined with dataflows’ general awesomeness makes them a natural fit for my blog. This means that people often think of me as “the dataflows guy” even though dataflows are actually a small part of my role on the Power BI CAT team. Most of what I do at work is help large enterprise customers successfully adopt Power BI, and to help make Power BI a better tool for their scenarios.
As part of my ongoing conversations with senior stakeholders from these large global companies, I’ve noticed an interesting trend emerging: customers describing self-service BI as a two-edged sword. This trend is interesting for two main reasons:
It’s a work conversation involving swords
Someone other than me is bringing swords into the work conversation
As someone who has extensive experience with both self-service BI and with two-edged swords, I found myself thinking about these comments more and more – and the more I reflected, the more I believed this simile holds up, but not necessarily in the way you might suspect.
The two sharp edges of a sword each serve distinct and complementary purposes.
A competent swordsperson knows how and when to use each, and how to use them effectively in combination.
Having two sharp edges is only dangerous to the wielder if they are ignorant of their tool.
A BI tool like Power BI, which can be used for both “pro” IT-driven BI and self-service business-driven BI has the same characteristics, and to use it successfully at scale an organization needs to understand its capabilities and know how to use both “edges” effectively in combination.
As you can imagine, there’s more to it than this, so you should probably watch the session recording.
For those who are coming to the Microsoft Business Applications Summit next week, please consider joining the CAT team’s “Enterprise business intelligence with Power BI” full-day pre-conference session on Sunday. Much of the day will be deep technical content, but we’ll be wrapping up with a revised and refined version of this content, with a focus on building a center of excellence and a culture of data in your organization.
There is also a video of the final demo where Adam Saxton joined me to illustrate how business and IT can work together to effectively respond to unexpected challenges. If you ever wondered what trust looks like in a professional environment, you definitely want to watch this video.
 This may be even more exciting for me than Power BI dataflows are, but it’s not as obvious how to share this in blog-sized pieces.
 Without this second point, it probably wouldn’t be noteworthy. I have a tendency to bring up swords more often in work conversations than you might expect.
 And if you’ve been paying attention for very long, you’ll probably expect this to come up pretty often.
Last week I delivered two online sessions on the topic of integrating Power BI dataflows with an organizational Azure Data Lake Storage Gen2 storage account. I’ve blogged on this topic before (link | link | link | link) but sometimes a presentation and demo is worth a thousand words.
On April 30th I delivered a “Power BI dataflows 201 – beyond the basics” session for the PASS Business Intelligence Virtual Chapter. The session recording is online here, and you can download the slides here.
On May 4th I delivered a “Integrating Power BI and Azure Data Lake with dataflows and CDM Folders” session for the SQL Saturday community event in Stockholm, Sweden. I was originally planning to deliver the Stockholm session in person, but due to circumstances beyond my control I ended up presenting remotely, which meant that I could more easily record the session. The session recording is online here, and you can download the slides here.
Each of these sessions covers much of the same material. The Stockholm presentation got off to a bit rocky start but it gets smoother after the first few minutes.
Please feel free to use these slides for your own presentations if they’re useful. And please let me know if you have any questions!
Update: I delivered an updated version of this session on September 27, at the Power Platform World Tour in Vancouver, BC. There is no recording of this session, but the updated slides can be downloaded here.
 I forgot to book flights. Seriously, I thought I had booked flights in February when I committed to speaking, and by the time I realized that I had not booked them, they were way out of my budget. This was not my finest hour.
 The presentation was scheduled to start at 6:00 AM, so I got up at 4:00 and came into the office to review and prepare. Instead I spent the 90 minutes before the session start time fighting with PC issues and got everything working less than a minute before 6:00. I can’t remember ever coming in quite this hot…