You probably already know that Power BI dataflows store their data in CDM folders. But what does this actually mean?
This is a quick post to share information that I hope will answer some of the most common questions that I hear from time to time, and which I discuss when I present on Power BI dataflows integration with Azure. I don’t believe any of the information in this post is new or unique, but I do believe it is delivered in a more targeted manner that might help.
Point #1: CDM is a metadata system
The Common Data Model is a metadata system that simplifies data management and application development by unifying data into a known form and applying structural and semantic consistency across multiple apps and deployments. If you’re coming from a SQL Server background, it may help to think about CDM as the “system tables” for data that’s stored in multiple locations and formats. This analogy doesn’t hold up to particularly close inspection, but it’s a decent place to start.
Point #2: CDM includes standard entity schemas
In addition to being a metadata system, the Common Data Model includes a set of standardized, extensible data schemas that Microsoft and its partners have published. This collection of predefined schemas includes entities, attributes, semantic metadata, and relationships. The schemas represent commonly used concepts and activities, such as Account and Campaign, to simplify the creation, aggregation, and analysis of data.
Point #3: CDM folders are data storage that use CDM metadata
A CDM folder is a folder in a data lake that conforms to specific, well-defined, and standardized metadata structures and self-describing data. These folders facilitate metadata discovery and interoperability between data producers and data consumers.
CDM folders store metadata in a model.json file; this is what makes them self-describing. This metadata conforms to the CDM metadata format, and can be read by any client application or code that knows how to work with CDM.
Point #4: You don’t need to use any standard entities
The most common misconception I hear about CDM and CDM folders is that you only use them when you’re storing “standard data.” This is not correct. The data in a CDM entity may map to a standard entity schema, but for 99% of the entities I have built or used, this is not the case. There is nothing in CDM or CDM folders that requires you to use a standard schema.
I hope this helps – please let me know if you have questions!
 Check out the documentation for CDM and CDM folders here and here, and here for more detail. You’ll probably notice that some chunks of text in this post were simply copied from that documentation.
Power BI dataflows have included capabilities for data lineage since they were introduced in preview way back in 2018. The design of dataflows, where each entity is defined by the Power Query that provides its data, enables a simple and easy view into its data lineage. The query is the authoritative statement on where the entity’s data comes from, and how it is transformed.
But what about everything else in a workspace? What about datasets, and reports, and dashboards? What about them?
Power BI has your back.
Late last month the Power BI team released a new preview capability that lets users view workspace content in a single end-to-end lineage view, in addition to the familiar list view.
Once the lineage view is selected, all workspace contents – data sources, dataflows, datasets, dashboards, and reports – are displayed, along with the relationships between them. Here’s a big-picture view of a workspace I’ve been working in lately:
There’s a lot to unpack here, so I’ll break down what feels to me like the important parts:
The primary data source is a set of text files in folders. The text files are produced by various web scraping processes, and each has a different format and contents.
The secondary data source is a set of reference and lookup data stored in Excel workbooks in SharePoint Online. These workbooks contain manually curated data that is used to cleanse, standardize and/or enrich the data from the primary data.
The primary data is staged with minimal transformation in a “raw” dataflow. This data is then progressively processed by a series of downstream dataflows, including mashing up with the secondary data from Excel, and reshaped into facts and dimensions.
There is one dataset based on the fact and dimension entities, and report based on this dataset. There’s a second dataset that includes data quality metrics from entities in multiple dataflows, and a report based on this dataset. And there are two dashboards, one that includes only visuals for data quality metrics, and one that presents the main data along with a few tiles from the quality report.
That overview is simplified enough as to be worthless from a technical understanding perspective, but it’s still a wall of text. Who wants to read that?
For a real-world workspace that implements a production BI application, there is likely to be more complexity, and less well defined boundaries between objects. How do you document the contents of a complex workspace, and the relationships between those components? How do you understand them well enough to identify and solve problems?
That’s where the lineage view comes in.
Let’s begin by looking at the data sources.
For data sources that use a gateway, I can easily see the gateway name. For other data sources I can see the data source location. We’re off to a good start, because I have a single place to look to see where my data is coming from.
Next, let’s look at the dataflows.
In addition to being able to see the dataflows and the dependencies between them, you can click on any dataflow to see the entities it contains, and can jump directly to edit the dataflow from this view.
This part of workspace lineage isn’t completely new – this is essentially what you could do with dataflows already. But now you can do it with datasets, reports, and dashboards as well.
Selecting a dataset shows me the tables it contains, and selecting a dashboard or report takes me directly to the visualization. But the real power of this view comes from the relationships between objects. The relationships are where data lineage comes to the fore.
The two primary questions asked in the context of data linage are around upstream “where does this data come from?” and downstream “where is this data used?” lineage scenarios.
The first question is often asked in the context of “why am I not seeing what I expect to see?” and the resulting investigation looks at upstream logic and data source to identify the root cause of problems.
The second question is often asked in the context of “what might break if I change this?” and the resulting investigation looks at downstream objects and processes.
The lineage view has a simple way to answer both questions: just click on the “double arrow” icon and the view will change to highlight all upstream and downstream objects. In a single click you can see where the data comes from, and where the data is used. Click again, and the view toggles back to the default view.
There’s more to lineage view than this, including support for shared and certified datasets, but this should be enough to get you excited. Be sure to check out the preview documentation as you check out the feature!
Update: We now have a video to supplement the blog post. Check it out!
Update: The Power BI blog now has the official announcement for this exciting feature. The blog post includes a look at where the lineage team is planning to invest to make this feature even better, and that all of the information in the lineage view is now available using the Power BI API. If you want to integrate lineage and impact analysis into your own tools, or if you want to build a cross-workspace lineage view, you now have the APIs you need to be successful!
 This is a pet project that may one day turn into a viable demo, assuming work and life let me devote a little more time to it…
 Different, annoying, and difficult to clean up.
 For example, the source web site allows any user to contribute, and although the contribution process is moderated there is no enforcement of content or quality. One artist may be credited for “guitar” on one album, “guitars” on another, “lead guitar” on a third. This sounds pretty simple until you take into account there were close to 50,000 different “artist roles” in the raw source data, that needed to be standardized down to a few hundred values in the final data model.
The reaction to this recent post on lineage and Power BI dataflows highlighted how important lineage is for Power BI. Although (as this post shows) there are lineage user experiences in place for dataflows today, and experiences coming for all artifacts in a workspace.
Even with these new experiences, there will still be times when you want or need to use the Power BI API to get insight into all the workspaces in the Power BI tenant for which you are an administrator. This ability isn’t new, but some recent updates to the Power BI admin API have made it easier.
GetGroupsAsAdmin is an API available to Power BI administrators that returns the workspaces for the Power BI tenant. With the information it returns, an admin can then call additional APIs like GetDatasetsInGroupAsAdmin and GetReportsInGroupAsAdmin to list their contents – and better understand and manage the tenant. This is a relatively straightforward pattern… but you do need to call up to four APIs for each workspace.
Now that GetGroupsAsAdmin supports $expand, you can get the full list of users, reports, dashboards, datasets, and dataflows in the workspace without needing to call any additional APIs. Pretty sweet.
With the information that’s returned, you can get a view of the contents of your Power BI tenant and start examining the relationships between the various objects, and now it’s simpler than ever. The API returns the workspace contents as JSON, which is easy enough to ingest and visualize using Power BI Desktop.
The Power BI team is continuing to add more features and experiences focused on governance and lineage, but the nature of oversight and governance is that most companies have specific tools and processes that require customization to one degree or another. Having a simple programmatic way to get workspace contents from your Power BI tenant will continue to be valuable even as these new experiences are delivered.
 As of when I’m writing, this lineage post has received more “first week views” than any other dataflows post I’ve made this year.
 When I wrote the first draft of this post in early July, dataflows were not yet included in the results of this API; they were added a few weeks later. I decided to wait to complete the post, and here it is, almost October. I really should know better by know.
When you build a Power BI dataflow, lineage is built in. It’s not an added feature – it’s a fundamental aspect of how dataflows work. Instead of having an ETL process that performs the data movement and compute in one place and the data storage in another, dataflow entities are defined by a Power Query “M” query. The ETL logic in the query and the data storage are defined as a single unit. The result is that not only does a dataflow contain the data, it contains the full lineage about where the data came from and how it was transformed.
This lineage information is what makes possible the diagram view that’s available today. As shown above, the diagram view provides a workspace-level lineage view of all dataflows in the workspace, their relationships, and the data sources from which they extract data. – including data sources that are dataflows in other workspaces.
It’s worth emphasizing that this diagram view isn’t an editor per se – it’s not where you define lineage relationships, it’s where you can view, explore, and understand the lineage relationships that are automatically created when you build your dataflows. At the same time, the diagram view does let you edit and manage the dataflows in your workspace in the same way as the list view.
The dataflows API lets developers build similar experiences and automated processes. Specifically, the Get Dataflow API returns the model.json file that contains all the dataflow metadata, including the M scripts for the queries that define the dataflow entities.
Of course, dataflows are just one part of a complete lineage and impact analysis story in Power BI. Ideally a Power BI workspace administrator would be able to see all data sources, dataflows, datasets, reports, and dashboards in a single view, and to easily navigate and manage the relationships and dependencies between them. Something like this:
At the Microsoft Business Applications Summit in June, Microsoft shared plans for lineage across all artifacts in a workspace. This represents a continuation of the current functionality available in dataflows and of the recently-announced shared and certified datasets.
With this upcoming functionality, users working with reports and dashboard can easily see where the data comes from, and how and where it has been transformed. I’m hesitant to call this the “holy grail” of business intelligence, but I’ve heard it described this was enough times that I wouldn’t argue too loudly if someone did. If I a dollar for every meeting I’ve been in where the agenda was derailed by and argument about whether the data was right… In any event, this is a very common problem, with few good solutions.
How are you using the lineage capabilities in dataflows today? Do you have processes that rely on the dataflows lineage UI or API? I’d love to hear about what you’re doing now and what you’re planning – let me know.
 Such as having a batch job, stored procedure, or SSIS Data Flow that loads into a SQL Server table or HDFS folder.
Power BI is constantly evolving – there’s a new version of Power BI Desktop every month, and the Power BI service is updated every week. Many of the new capabilities in Power BI represent gradual refinements, but some are significant enough to make you rethink how you your organization uses Power BI.
The new app navigation capabilities introduced last month to Power BI probably fall into the former category. But even though they’re a refinement of what the Power BI service has always had, they can still make your apps significantly better. Specifically, these new capabilities can be used to add documentation and training materials directly to the app experience, while keeping that content in its current location.
It’s surprisingly simple and easy to do. When publishing your app, select the Navigation tab.
On the Navigation tab, select New and then Link.
And finally, provide the URL of the content, the name for the tab in the app navigation, and select where you want the link to open.
When you publish the app, there will be a tab that contains the content from the link you specified in addition to tabs for the reports and dashboards from the workspace.
This approach is very simple, but it’s also very important for at least two reasons:
Having self-documenting apps will help ensure that the people who need to use it will be able to do so. End-user training for Power BI apps, reports, and dashboards is easy for BI developers and authors to overlook, but its importance cannot be overstated. To reach the users who need it the most, your training and documentation needs to be discoverable where those users need it – and that is often in the app itself.
Your content can stay in the system or application where it belongs. Power BI is great at many things, but it’s not a content management system. If you create training content in your Power BI reports, you’re probably not going to be as happy as you could be. And you deserve to be happy.
If you’re interested in a different take on the same feature, check out this post from the fine cows over at FourMoo. I think this new capability is exciting more for what you can do with it and less for how it works, so I skimmed over a lot of the details. FourMoo goes into more of the how-to detail, so you may want to check it out.
Are all metadata attributes equally applicable to a given usage scenario?
What are the risks of using the data in ways that are not approved by the policy defined in the metadata?
What mechanisms exist for the enforcement of that policy? How are these mechanisms different from the mechanisms in earlier examples? What do these differences imply about the risks inherent in misuse?
In each example, how does the metadata format relate to the importance of the metadata, and the data usage scenarios covered by the metadata attributes?
Does a difference in metadata format convey information about the metadata itself?