Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.
Power BI dataflows have included capabilities for data lineage since they were introduced in preview way back in 2018. The design of dataflows, where each entity is defined by the Power Query that provides its data, enables a simple and easy view into its data lineage. The query is the authoritative statement on where the entity’s data comes from, and how it is transformed.
In early 2019 dataflows also introduced a graphical lineage view. Users could now easily see and understand the relationships between all dataflows in a workspace, which made it easier than ever to work with linked and computed entities and to take advantage of the Lego-like capabilities of dataflows as building blocks for BI.
But what about everything else in a workspace? What about datasets, and reports, and dashboards? What about them?
Power BI has your back.
Late last month the Power BI team released a new preview capability that lets users view workspace content in a single end-to-end lineage view, in addition to the familiar list view.
Once the lineage view is selected, all workspace contents – data sources, dataflows, datasets, dashboards, and reports – are displayed, along with the relationships between them. Here’s a big-picture view of a workspace I’ve been working in lately:
There’s a lot to unpack here, so I’ll break down what feels to me like the important parts:
- The primary data source is a set of text files in folders. The text files are produced by various web scraping processes, and each has a different format and contents.
- The secondary data source is a set of reference and lookup data stored in Excel workbooks in SharePoint Online. These workbooks contain manually curated data that is used to cleanse, standardize and/or enrich the data from the primary data.
- The primary data is staged with minimal transformation in a “raw” dataflow. This data is then progressively processed by a series of downstream dataflows, including mashing up with the secondary data from Excel, and reshaped into facts and dimensions.
- There is one dataset based on the fact and dimension entities, and report based on this dataset. There’s a second dataset that includes data quality metrics from entities in multiple dataflows, and a report based on this dataset. And there are two dashboards, one that includes only visuals for data quality metrics, and one that presents the main data along with a few tiles from the quality report.
That overview is simplified enough as to be worthless from a technical understanding perspective, but it’s still a wall of text. Who wants to read that?
For a real-world workspace that implements a production BI application, there is likely to be more complexity, and less well defined boundaries between objects. How do you document the contents of a complex workspace, and the relationships between those components? How do you understand them well enough to identify and solve problems?
That’s where the lineage view comes in.
Let’s begin by looking at the data sources.
For data sources that use a gateway, I can easily see the gateway name. For other data sources I can see the data source location. We’re off to a good start, because I have a single place to look to see where my data is coming from.
Next, let’s look at the dataflows.
In addition to being able to see the dataflows and the dependencies between them, you can click on any dataflow to see the entities it contains, and can jump directly to edit the dataflow from this view.
This part of workspace lineage isn’t completely new – this is essentially what you could do with dataflows already. But now you can do it with datasets, reports, and dashboards as well.
Selecting a dataset shows me the tables it contains, and selecting a dashboard or report takes me directly to the visualization. But the real power of this view comes from the relationships between objects. The relationships are where data lineage comes to the fore.
The two primary questions asked in the context of data linage are around upstream “where does this data come from?” and downstream “where is this data used?” lineage scenarios.
The first question is often asked in the context of “why am I not seeing what I expect to see?” and the resulting investigation looks at upstream logic and data source to identify the root cause of problems.
The second question is often asked in the context of “what might break if I change this?” and the resulting investigation looks at downstream objects and processes.
The lineage view has a simple way to answer both questions: just click on the “double arrow” icon and the view will change to highlight all upstream and downstream objects. In a single click you can see where the data comes from, and where the data is used. Click again, and the view toggles back to the default view.
There’s more to lineage view than this, including support for shared and certified datasets, but this should be enough to get you excited. Be sure to check out the preview documentation as you check out the feature!
Update: We now have a video to supplement the blog post. Check it out!
Update: The Power BI blog now has the official announcement for this exciting feature. The blog post includes a look at where the lineage team is planning to invest to make this feature even better, and that all of the information in the lineage view is now available using the Power BI API. If you want to integrate lineage and impact analysis into your own tools, or if you want to build a cross-workspace lineage view, you now have the APIs you need to be successful!
 This is a pet project that may one day turn into a viable demo, assuming work and life let me devote a little more time to it…
 Different, annoying, and difficult to clean up.
 For example, the source web site allows any user to contribute, and although the contribution process is moderated there is no enforcement of content or quality. One artist may be credited for “guitar” on one album, “guitars” on another, “lead guitar” on a third. This sounds pretty simple until you take into account there were close to 50,000 different “artist roles” in the raw source data, that needed to be standardized down to a few hundred values in the final data model.
 I sure hope this gif works!
 This is the most exciting part for me.
7 thoughts on “Power BI workspace lineage”
How would you decide how big or how small to make each artifact in the lineage, in terms of the amount of transformations taking place inside the artifact? In my case they would only be shared with 2-3 other users.
For instance I could go all out and have every step that would previously take place in a query editor result in a new link in the data lineage chain, but that would probably be overkill.
Thanks for the question!
I agree that “one step per dataflow” would be overkill, but beyond that the answer is largely “it depends.”
The approach I generally take is to break the end to end data preparation down into blocks that look like this:
1. Staging – getting the source data into the system (in this case dataflow, but could be data mart, data warehouse, data lake, etc.) with zero or minimal transformations
2. Cleansing – correcting known data quality and format problems from the staged data
3. Transformation 1 – getting the cleansed data into the shape required for intended downstream purposes
4. Enrichment – adding data from other sources, which have ideally already gone through steps 1 through 3
5. Transformation 2 – getting the cleansed and enriched data into the shape required for analysis, typically as dimensions and facts
These guidelines tend to create a moderate number of easily maintainable entities.
I feel like I’m dating myself with this link, but I definitely recommend looking at the Kimball Group’s techniques for data warehousing and BI: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/. Ralph Kimball and his amazing team know more about this stuff than I will ever forget (or something like that) and there’s a huge volume of guidance available.
This feels like it has become a blog post…
What an amazing feature! Data lineage is so hot right now.
I just tested it and noted a few odd things:
(a) the view still renders unused data sources, e.g., if a dataflow against data source A is completely switched to data source B, then A will still appear. Also, we’re unable to remove credentials for the “Data source credentials” screen – I assume these two challenges are related; and
(b) the view does not render artifacts outside the current workspace.
Pingback: Quick Tip: Factoring your dataflow entities – BI Polar
Pingback: Dataflows in Power BI: Overview Part 9 – Lineage and impact analysis – BI Polar
All my data lineage tracker are on one excel file. Is there anyway to show all lineage connection?
Currently the best way to get all lineage is to use this API: https://powerbi.microsoft.com/en-us/blog/announcing-new-admin-apis-and-service-principal-authentication-to-make-for-better-tenant-metadata-scanning/