Last week I kicked off the new BI Polar YouTube channel with a video and blog post comparing and contrasting Power BI dataflows and datasets. In the days that followed, I continued to hear questions that led me to believe I hadn’t done a great job answering one vital question:
When would I use dataflows, and when would I use shared datasets?
You probably already know that Power BI dataflows store their data in CDM folders. But what does this actually mean?
This is a quick post to share information that I hope will answer some of the most common questions that I hear from time to time, and which I discuss when I present on Power BI dataflows integration with Azure. I don’t believe any of the information in this post is new or unique, but I do believe it is delivered in a more targeted manner that might help.
Point #1: CDM is a metadata system
The Common Data Model is a metadata system that simplifies data management and application development by unifying data into a known form and applying structural and semantic consistency across multiple apps and deployments. If you’re coming from a SQL Server background, it may help to think about CDM as the “system tables” for data that’s stored in multiple locations and formats. This analogy doesn’t hold up to particularly close inspection, but it’s a decent place to start.
Point #2: CDM includes standard entity schemas
In addition to being a metadata system, the Common Data Model includes a set of standardized, extensible data schemas that Microsoft and its partners have published. This collection of predefined schemas includes entities, attributes, semantic metadata, and relationships. The schemas represent commonly used concepts and activities, such as Account and Campaign, to simplify the creation, aggregation, and analysis of data.
Point #3: CDM folders are data storage that use CDM metadata
A CDM folder is a folder in a data lake that conforms to specific, well-defined, and standardized metadata structures and self-describing data. These folders facilitate metadata discovery and interoperability between data producers and data consumers.
CDM folders store metadata in a model.json file; this is what makes them self-describing. This metadata conforms to the CDM metadata format, and can be read by any client application or code that knows how to work with CDM.
Point #4: You don’t need to use any standard entities
The most common misconception I hear about CDM and CDM folders is that you only use them when you’re storing “standard data.” This is not correct. The data in a CDM entity may map to a standard entity schema, but for 99% of the entities I have built or used, this is not the case. There is nothing in CDM or CDM folders that requires you to use a standard schema.
I hope this helps – please let me know if you have questions!
 Check out the documentation for CDM and CDM folders here and here, and here for more detail. You’ll probably notice that some chunks of text in this post were simply copied from that documentation.
I do a lot that’s not related to dataflows. In fact, dataflows take up a surprisingly small part of my day, if your insight into my calendar came solely from this blog.
Despite this, I like to believe that I’m keeping my finger on the pulse of this feature, and when I learned today that the dataflows team had published best practice guidance almost a month ago, I was shocked and surprised.
These practices were produced by the dataflows team, and are based on questions and support tickets from customers around the world. Definitely check them out, and see how you can incorporate them into your Power BI solutions!
This is still one of the most common dataflows questions: what’s the difference between Power BI dataflows and Power BI datasets?
For the last year I have resisted tackling this question head-on. This isn’t because it’s a bad or “dumb” question. Just the opposite – this is a very simple question, and the simpler a question is, the more complex and nuanced the answer is likely to be.
If you’re a regular reader of this blog, you probably already know the answer, because I’ve answered it already. Sort of. The existing answer is distributed across dozens of posts, and if you’ve read all of them and picked up the answer along the way. But I keep hearing this question, and I keep thinking that there must be a more direct answer I could share.
Here it is, in a single, simple table.
Power BI dataflows
Power BI datasets
Analysis Services tabular model
Common Data Model – model.json
Power Query Online
Power Query in Power BI Desktop
Acts as data source in multiple datasets
Shared datasets across workspaces
Scope of reuse
Entity level reuse
Dataset level reuse
Mashup with other data sources
Used for reporting
Reuse outside Power BI
Yes, through ADLSg2
Yes, through XMLA
Data access methods
Certification and promotion
What else am I missing?
Please let me know!
Seriously, you should let me know.
Update: I’ve added a few rows to the table after the post was originally published, to incorporate feedback from readers on differences I had missed. Thank you!
Each of the rows in this table could easily be an in-depth topic in and of itself, so if you’re looking at any of them and thinking “that’s not quite right” I might very well agree with you. There’s a lot of context and a lot of nuance here, and we’re trying to sum things up in a word or two… which is kind of the whole point.
Oh yeah, there’s a video too.
I can’t wait to hear what you think!
 A simple table with ten footnotes.
 The storage aspect of dataflows and datasets is one of the most significant differences between the two. Datasets use the Vertipaq column store to load data into an optimized and highly compressed in-memory representation that is optimized for analysis. Dataflows use text files in folders, which are optimized for interoperability.
 The Analysis Services Tabular engine uses the BI Semantic Model (BISM) to represent its metadata. This is a metadata model originally included in SQL Server 2012 Analysis Services, and used by the Tabular engine ever since.
 Saying “this is the primary purpose” of any complex tool is fraught with risk, because no matter what you say, there are other valid things that remain unsaid. With this said… the big gap that dataflows close is that of self-service data preparation for the purpose of data sharing and reuse. Power BI has always had self-service data preparation through Power Query, but before dataflows the data that was prepared was “locked” in a dataset, for analysis, and not for sharing or reuse.
 Once you have loaded data into dataflows, authorized users can reuse entities from multiple dataflows, and use them as the building blocks for new dataflows or new datasets. Once you have loaded data into a dataset (and published it to the Power BI service) you can enable users to connect to it.
 With dataflows, users can pick and choose the entities they want, but a dataset can only be reused as-is.
 Dataflow entities can be used as data sources in the same Power BI Desktop file as other data sources, and can serve as part of a mashup or composite model, but a dataset can only be reused as-is.
 Although you can obviously use dataflows for reporting, you do so by first importing the data from the dataflow into a dataset.
 It’s interesting to point out that using your own organizational ADLSg2 account does not require Power BI Premium, but using the XMLA endpoint to connect to Power BI datasets from non-Power BI clients does.
 You can only import data into your dataflow entities, but tables in your dataset can import data or use DirectQuery, and a dataset can use a combination of the two.
 You can only import data from a dataflow into a dataset. When connecting to a shared dataset you can only use Live Connections.
 I’ve been thinking of making videos to supplement this blog for almost as long as I’ve been hearing the question that inspired this post. Please take a moment to share your thoughts on the video. This is something of a “soft launch” and although I have plans for a few dozen more videos already, your feedback will be a main factor in how the video series evolves.
In a recent post I mentioned an approach for working around the import-only nature of Power BI dataflows as a data source in Power BI Desktop, and in an older post I shared information about the enhanced compute engine that’s currently available in preview.
Some recent conversations have led me to believe that I should summarize a few points about dataflows and query folding, because these existing posts don’t make them easy to find and understand.
When accessing dataflow entities from Power BI Desktop, no query folding takes place, even if the enhanced compute engine is enabled.
When accessing dataflow entities from other entities in the Power BI service, no query folding takes place unless the enhanced compute engine enabled.
When accessing dataflow entities from other entities in the Power BI service, query folding will take place when the enhanced compute engine is enabled, because the linked entity’s query will be executed against the cached data in SQL, rather than the underlying CDM folder.
These three statements summarize how query folding works – or does not work – in Power BI dataflows today.
The Power BI team has discussed some of their plans for the enhanced compute engine, so this should change in the future  but as of today, the only dataflows scenario where query folding takes place is when a dataflow is backed by the enhanced compute engine is referenced by a linked entity.
I hope this helps clarify things, at least a little…
 I think this will be around the time the engine goes GA, but I don’t remember for sure, and I’m too lazy to re-watch the MBAS session to double check. If you watch it and let me know, I’ll gladly update this post with the details.
I received today what is becoming a common question:
I have a customer who is using dataflows to populate a 255M row fact table, refreshing the last 2 months of data (which is working) but experiencing issues when trying to connect to that Dataflow using Power BI Desktop. Suggestions?
Power BI dataflows are an import-only data source when being used from Power BI Desktop. This means that if you’re building a model in Power BI Desktop and are using dataflows as a source, all of the data from the entities you select will be downloaded from Azure and loaded into your data model.
All of it.
To exacerbate the challenge, dataflows – and the CDM folders on which they’re built – do not support query folding, so even if you filter the records in your query, that filter will be applied in Power BI Desktop, not at the sources.
If one of these entities includes very wide records and/or a very large number of records, this can result in a slow and frustrating experience in Power BI Desktop, and in some circumstances can cause the data load to fail. So what can you do?
My preferred approach is to use a pattern I’ve long used when working with other data sources that don’t support DirectQuery or query folding: work with a subset of the data in Power BI Desktop, and then work with the full data set after you publish to the Power BI service.
For dataflows, it looks like this:
Develop and test the dataflows in the Power BI service
Add filters to the too-large dataflow entities’ queries to reduce the number of records in the entities
Refresh the dataflow
Develop the model and reports in Power BI Desktop
Publish the model and reports to the Power BI service
Remove the filters from the filtered dataflow entities
Refresh the dataflow
Refresh the dataset
The dataflows team announced at MBAS in June that the enhanced compute engine currently in preview will be enable DirectQuery access and query folding when using dataflows from Power BI Desktop at some point – but that point is not today. Until then, I hope this quick pattern will help…
 Like large folders full of large files.
 This is typically just the one or two largest entities, not all of them.
How would you decide how big or how small to make each artifact in the lineage, in terms of the amount of transformations taking place inside the artifact? In my case they would only be shared with 2-3 other users.
For instance I could go all out and have every step that would previously take place in a query editor result in a new link in the data lineage chain, but that would probably be overkill.
I agree that “one step per dataflow” would be overkill, but beyond that the answer is largely “it depends.”
The approach I generally take is to break the end to end data preparation down into blocks that look like this:
Staging – getting the source data into the system (in this case dataflow, but could be data mart, data warehouse, data lake, etc.) with zero or minimal transformations
Cleansing – correcting known data quality and format problems from the staged data
Transformation 1 – getting the cleansed data into the shape required for intended downstream purposes
Enrichment – adding data from other sources, which have ideally already gone through steps 1 through 3
Transformation 2 – getting the cleansed and enriched data into the shape required for analysis, typically as dimensions and facts
The final step may also be performed in the queries that are used to create the final tabular model when creating a dataset in Power BI Desktop. If a given dimension is likely to be used in multiple datasets, implement it as a dataflow entity. If it isn’t, implement it as a table in your dataset.
These guidelines tend to create a moderate number of easily maintainable entities, but they’re obviously the bare minimum – take what works for you, and discard the rest.
I feel like I’m dating myself with this link, but I definitely recommend looking at the Kimball Group’s techniques for data warehousing and BI: resources link. Ralph Kimball and his amazing team know more about this stuff than I will ever forget (or something like that) and there’s a huge volume of guidance available. Do yourself a favor and check it out.
 I assume there are newer resources out there, but when I was your age it was the Kimball Method or the… synonym for highway that rhymes with method.