You probably already know that Power BI dataflows store their data in CDM folders. But what does this actually mean?
This is a quick post to share information that I hope will answer some of the most common questions that I hear from time to time, and which I discuss when I present on Power BI dataflows integration with Azure. I don’t believe any of the information in this post is new or unique, but I do believe it is delivered in a more targeted manner that might help.
Point #1: CDM is a metadata system
The Common Data Model is a metadata system that simplifies data management and application development by unifying data into a known form and applying structural and semantic consistency across multiple apps and deployments. If you’re coming from a SQL Server background, it may help to think about CDM as the “system tables” for data that’s stored in multiple locations and formats. This analogy doesn’t hold up to particularly close inspection, but it’s a decent place to start.
Point #2: CDM includes standard entity schemas
In addition to being a metadata system, the Common Data Model includes a set of standardized, extensible data schemas that Microsoft and its partners have published. This collection of predefined schemas includes entities, attributes, semantic metadata, and relationships. The schemas represent commonly used concepts and activities, such as Account and Campaign, to simplify the creation, aggregation, and analysis of data.
Point #3: CDM folders are data storage that use CDM metadata
A CDM folder is a folder in a data lake that conforms to specific, well-defined, and standardized metadata structures and self-describing data. These folders facilitate metadata discovery and interoperability between data producers and data consumers.
CDM folders store metadata in a model.json file; this is what makes them self-describing. This metadata conforms to the CDM metadata format, and can be read by any client application or code that knows how to work with CDM.
Point #4: You don’t need to use any standard entities
The most common misconception I hear about CDM and CDM folders is that you only use them when you’re storing “standard data.” This is not correct. The data in a CDM entity may map to a standard entity schema, but for 99% of the entities I have built or used, this is not the case. There is nothing in CDM or CDM folders that requires you to use a standard schema.
I hope this helps – please let me know if you have questions!
 Check out the documentation for CDM and CDM folders here and here, and here for more detail. You’ll probably notice that some chunks of text in this post were simply copied from that documentation.
I do a lot that’s not related to dataflows. In fact, dataflows take up a surprisingly small part of my day, if your insight into my calendar came solely from this blog.
Despite this, I like to believe that I’m keeping my finger on the pulse of this feature, and when I learned today that the dataflows team had published best practice guidance almost a month ago, I was shocked and surprised.
These practices were produced by the dataflows team, and are based on questions and support tickets from customers around the world. Definitely check them out, and see how you can incorporate them into your Power BI solutions!
This is still one of the most common dataflows questions: what’s the difference between Power BI dataflows and Power BI datasets?
For the last year I have resisted tackling this question head-on. This isn’t because it’s a bad or “dumb” question. Just the opposite – this is a very simple question, and the simpler a question is, the more complex and nuanced the answer is likely to be.
If you’re a regular reader of this blog, you probably already know the answer, because I’ve answered it already. Sort of. The existing answer is distributed across dozens of posts, and if you’ve read all of them and picked up the answer along the way. But I keep hearing this question, and I keep thinking that there must be a more direct answer I could share.
Here it is, in a single, simple table.
Power BI dataflows
Power BI datasets
Analysis Services tabular model
Common Data Model – model.json
Power Query Online
Power Query in Power BI Desktop
Acts as data source in multiple datasets
Shared datasets across workspaces
Scope of reuse
Entity level reuse
Dataset level reuse
Mashup with other data sources
Used for reporting
Reuse outside Power BI
Yes, through ADLSg2
Yes, through XMLA
Data access methods
Certification and promotion
What else am I missing?
Please let me know!
Seriously, you should let me know.
Update: I’ve added a few rows to the table after the post was originally published, to incorporate feedback from readers on differences I had missed. Thank you!
Each of the rows in this table could easily be an in-depth topic in and of itself, so if you’re looking at any of them and thinking “that’s not quite right” I might very well agree with you. There’s a lot of context and a lot of nuance here, and we’re trying to sum things up in a word or two… which is kind of the whole point.
Oh yeah, there’s a video too.
I can’t wait to hear what you think!
 A simple table with ten footnotes.
 The storage aspect of dataflows and datasets is one of the most significant differences between the two. Datasets use the Vertipaq column store to load data into an optimized and highly compressed in-memory representation that is optimized for analysis. Dataflows use text files in folders, which are optimized for interoperability.
 The Analysis Services Tabular engine uses the BI Semantic Model (BISM) to represent its metadata. This is a metadata model originally included in SQL Server 2012 Analysis Services, and used by the Tabular engine ever since.
 Saying “this is the primary purpose” of any complex tool is fraught with risk, because no matter what you say, there are other valid things that remain unsaid. With this said… the big gap that dataflows close is that of self-service data preparation for the purpose of data sharing and reuse. Power BI has always had self-service data preparation through Power Query, but before dataflows the data that was prepared was “locked” in a dataset, for analysis, and not for sharing or reuse.
 Once you have loaded data into dataflows, authorized users can reuse entities from multiple dataflows, and use them as the building blocks for new dataflows or new datasets. Once you have loaded data into a dataset (and published it to the Power BI service) you can enable users to connect to it.
 With dataflows, users can pick and choose the entities they want, but a dataset can only be reused as-is.
 Dataflow entities can be used as data sources in the same Power BI Desktop file as other data sources, and can serve as part of a mashup or composite model, but a dataset can only be reused as-is.
 Although you can obviously use dataflows for reporting, you do so by first importing the data from the dataflow into a dataset.
 It’s interesting to point out that using your own organizational ADLSg2 account does not require Power BI Premium, but using the XMLA endpoint to connect to Power BI datasets from non-Power BI clients does.
 You can only import data into your dataflow entities, but tables in your dataset can import data or use DirectQuery, and a dataset can use a combination of the two.
 You can only import data from a dataflow into a dataset. When connecting to a shared dataset you can only use Live Connections.
 I’ve been thinking of making videos to supplement this blog for almost as long as I’ve been hearing the question that inspired this post. Please take a moment to share your thoughts on the video. This is something of a “soft launch” and although I have plans for a few dozen more videos already, your feedback will be a main factor in how the video series evolves.
In a recent post I mentioned an approach for working around the import-only nature of Power BI dataflows as a data source in Power BI Desktop, and in an older post I shared information about the enhanced compute engine that’s currently available in preview.
Some recent conversations have led me to believe that I should summarize a few points about dataflows and query folding, because these existing posts don’t make them easy to find and understand.
When accessing dataflow entities from Power BI Desktop, no query folding takes place, even if the enhanced compute engine is enabled.
When accessing dataflow entities from other entities in the Power BI service, no query folding takes place unless the enhanced compute engine enabled.
When accessing dataflow entities from other entities in the Power BI service, query folding will take place when the enhanced compute engine is enabled, because the linked entity’s query will be executed against the cached data in SQL, rather than the underlying CDM folder.
These three statements summarize how query folding works – or does not work – in Power BI dataflows today.
The Power BI team has discussed some of their plans for the enhanced compute engine, so this should change in the future  but as of today, the only dataflows scenario where query folding takes place is when a dataflow is backed by the enhanced compute engine is referenced by a linked entity.
I hope this helps clarify things, at least a little…
 I think this will be around the time the engine goes GA, but I don’t remember for sure, and I’m too lazy to re-watch the MBAS session to double check. If you watch it and let me know, I’ll gladly update this post with the details.
I received today what is becoming a common question:
I have a customer who is using dataflows to populate a 255M row fact table, refreshing the last 2 months of data (which is working) but experiencing issues when trying to connect to that Dataflow using Power BI Desktop. Suggestions?
Power BI dataflows are an import-only data source when being used from Power BI Desktop. This means that if you’re building a model in Power BI Desktop and are using dataflows as a source, all of the data from the entities you select will be downloaded from Azure and loaded into your data model.
All of it.
To exacerbate the challenge, dataflows – and the CDM folders on which they’re built – do not support query folding, so even if you filter the records in your query, that filter will be applied in Power BI Desktop, not at the sources.
If one of these entities includes very wide records and/or a very large number of records, this can result in a slow and frustrating experience in Power BI Desktop, and in some circumstances can cause the data load to fail. So what can you do?
My preferred approach is to use a pattern I’ve long used when working with other data sources that don’t support DirectQuery or query folding: work with a subset of the data in Power BI Desktop, and then work with the full data set after you publish to the Power BI service.
For dataflows, it looks like this:
Develop and test the dataflows in the Power BI service
Add filters to the too-large dataflow entities’ queries to reduce the number of records in the entities
Refresh the dataflow
Develop the model and reports in Power BI Desktop
Publish the model and reports to the Power BI service
Remove the filters from the filtered dataflow entities
Refresh the dataflow
Refresh the dataset
The dataflows team announced at MBAS in June that the enhanced compute engine currently in preview will be enable DirectQuery access and query folding when using dataflows from Power BI Desktop at some point – but that point is not today. Until then, I hope this quick pattern will help…
 Like large folders full of large files.
 This is typically just the one or two largest entities, not all of them.
How would you decide how big or how small to make each artifact in the lineage, in terms of the amount of transformations taking place inside the artifact? In my case they would only be shared with 2-3 other users.
For instance I could go all out and have every step that would previously take place in a query editor result in a new link in the data lineage chain, but that would probably be overkill.
I agree that “one step per dataflow” would be overkill, but beyond that the answer is largely “it depends.”
The approach I generally take is to break the end to end data preparation down into blocks that look like this:
Staging – getting the source data into the system (in this case dataflow, but could be data mart, data warehouse, data lake, etc.) with zero or minimal transformations
Cleansing – correcting known data quality and format problems from the staged data
Transformation 1 – getting the cleansed data into the shape required for intended downstream purposes
Enrichment – adding data from other sources, which have ideally already gone through steps 1 through 3
Transformation 2 – getting the cleansed and enriched data into the shape required for analysis, typically as dimensions and facts
The final step may also be performed in the queries that are used to create the final tabular model when creating a dataset in Power BI Desktop. If a given dimension is likely to be used in multiple datasets, implement it as a dataflow entity. If it isn’t, implement it as a table in your dataset.
These guidelines tend to create a moderate number of easily maintainable entities, but they’re obviously the bare minimum – take what works for you, and discard the rest.
I feel like I’m dating myself with this link, but I definitely recommend looking at the Kimball Group’s techniques for data warehousing and BI: resources link. Ralph Kimball and his amazing team know more about this stuff than I will ever forget (or something like that) and there’s a huge volume of guidance available. Do yourself a favor and check it out.
 I assume there are newer resources out there, but when I was your age it was the Kimball Method or the… synonym for highway that rhymes with method.
Power BI dataflows have included capabilities for data lineage since they were introduced in preview way back in 2018. The design of dataflows, where each entity is defined by the Power Query that provides its data, enables a simple and easy view into its data lineage. The query is the authoritative statement on where the entity’s data comes from, and how it is transformed.
But what about everything else in a workspace? What about datasets, and reports, and dashboards? What about them?
Power BI has your back.
Late last month the Power BI team released a new preview capability that lets users view workspace content in a single end-to-end lineage view, in addition to the familiar list view.
Once the lineage view is selected, all workspace contents – data sources, dataflows, datasets, dashboards, and reports – are displayed, along with the relationships between them. Here’s a big-picture view of a workspace I’ve been working in lately:
There’s a lot to unpack here, so I’ll break down what feels to me like the important parts:
The primary data source is a set of text files in folders. The text files are produced by various web scraping processes, and each has a different format and contents.
The secondary data source is a set of reference and lookup data stored in Excel workbooks in SharePoint Online. These workbooks contain manually curated data that is used to cleanse, standardize and/or enrich the data from the primary data.
The primary data is staged with minimal transformation in a “raw” dataflow. This data is then progressively processed by a series of downstream dataflows, including mashing up with the secondary data from Excel, and reshaped into facts and dimensions.
There is one dataset based on the fact and dimension entities, and report based on this dataset. There’s a second dataset that includes data quality metrics from entities in multiple dataflows, and a report based on this dataset. And there are two dashboards, one that includes only visuals for data quality metrics, and one that presents the main data along with a few tiles from the quality report.
That overview is simplified enough as to be worthless from a technical understanding perspective, but it’s still a wall of text. Who wants to read that?
For a real-world workspace that implements a production BI application, there is likely to be more complexity, and less well defined boundaries between objects. How do you document the contents of a complex workspace, and the relationships between those components? How do you understand them well enough to identify and solve problems?
That’s where the lineage view comes in.
Let’s begin by looking at the data sources.
For data sources that use a gateway, I can easily see the gateway name. For other data sources I can see the data source location. We’re off to a good start, because I have a single place to look to see where my data is coming from.
Next, let’s look at the dataflows.
In addition to being able to see the dataflows and the dependencies between them, you can click on any dataflow to see the entities it contains, and can jump directly to edit the dataflow from this view.
This part of workspace lineage isn’t completely new – this is essentially what you could do with dataflows already. But now you can do it with datasets, reports, and dashboards as well.
Selecting a dataset shows me the tables it contains, and selecting a dashboard or report takes me directly to the visualization. But the real power of this view comes from the relationships between objects. The relationships are where data lineage comes to the fore.
The two primary questions asked in the context of data linage are around upstream “where does this data come from?” and downstream “where is this data used?” lineage scenarios.
The first question is often asked in the context of “why am I not seeing what I expect to see?” and the resulting investigation looks at upstream logic and data source to identify the root cause of problems.
The second question is often asked in the context of “what might break if I change this?” and the resulting investigation looks at downstream objects and processes.
The lineage view has a simple way to answer both questions: just click on the “double arrow” icon and the view will change to highlight all upstream and downstream objects. In a single click you can see where the data comes from, and where the data is used. Click again, and the view toggles back to the default view.
There’s more to lineage view than this, including support for shared and certified datasets, but this should be enough to get you excited. Be sure to check out the preview documentation as you check out the feature!
 This is a pet project that may one day turn into a viable demo, assuming work and life let me devote a little more time to it…
 Different, annoying, and difficult to clean up.
 For example, the source web site allows any user to contribute, and although the contribution process is moderated there is no enforcement of content or quality. One artist may be credited for “guitar” on one album, “guitars” on another, “lead guitar” on a third. This sounds pretty simple until you take into account there were close to 50,000 different “artist roles” in the raw source data, that needed to be standardized down to a few hundred values in the final data model.