INTERIOR: pan over cubicles of happy, productive office workers
CLOSE-UP: office worker at desk
NARRATOR: Is that Susie I see, giving Power Platform dataflows a try?
SUSIE: That’s right! With dataflows I can have all of the data I need, right where I need it!!
NARRATOR: Dataflows. They’re not just for Power BI anymore.
OK, you may not remember that orange juice ad campaign from the late 1970s and early 80s[1], but I’ve had it stuck in my head since I started working on this post and video. I couldn’t figure out how to work it into the video itself, so here it is in written form.
Anyway, with that awkward moment behind us, you probably want to watch the video. Here is it:
As the video discusses, Power Apps now have a dataflows capability that is a natural complement to Power BI dataflows. Power Platform dataflows have been generally available since November 2019, and have been in preview since summer.
Power Platform dataflows use Power Query Online – and the same set of connectors, gateways, and transformation capabilities as Power BI dataflows. But there are a few key differences that are worth emphasizing.
Power Platform dataflows can load data into the Common Data Service, either into the standard Common Data Model entities used by Dynamics 365 apps, or into custom entities used by custom Power Apps. This is important – this makes dataflows more like a traditional ETL tool like SSIS data flows in that at the end of the dataflow creation process you can map the columns in your queries to the columns in these existing tables[2].
Power Platform dataflows can load data into ADLSg2 for analytical scenarios, but Power Apps doesn’t have the same concept of “built-in storage” that Power BI does. That means if you want to use Power Platform dataflows to create CDM folders, you must configure your Power Apps environment to use an ADLSg2 resource in your Azure subscription.
The “link to data lake” feature in Power Apps feels to me like a better integration experience than what’s currently available in Power BI. In Power Apps you define the link at the environment level, not the tenant level – this provides more flexibility, and enables non-tenant admins[3] to configure and use the data lake integration.
The first time you create a Power Platform dataflow and select the “analytical entities” option, you’ll be prompted – and required – to link the Power Apps environment to an Azure Data Lake Storage resource. You’ll need to have an Azure subscription to use, but the process itself if pretty straightforward.
I can’t wait to hear what you think of this new capability. Please let me know in the comments or via Twitter.
See you in the new year!
[1] I just realized that this was 40 years ago. Were you even born yet?
[2] CDS entities aren’t tables by the strictest definition, but it’s close enough for our purposes today.
[3] I honestly don’t know enough about Power Apps security to go into too much depth on this point, but I am not a Power Apps admin and I was able to create a trial environment and link it to my own ADLSg2 resource in my own Azure subscription without affecting other users.
The enhanced compute engine in Power BI dataflows has been in preview since June. It’s not really new, and I’ve posted about it before. But I still keep hearing questions about it, so I thought it might make sense to record a video[1].
This video.
I won’t go into too much more depth here – just watch the video, and if you want more details check out one of these existing posts:
[1] Also, I’m behind on my video schedule – this was a motivating factor as well. November was an unexpectedly busy month[2], and between work, life, and not really having the video editing skills I need to keep to a schedule… Yeah.
At this point I’ve said “Power BI dataflows enable reuse” enough times that I feel like a broken record[1]. What does this phrase actually mean, and how can you take advantage of dataflows to enable reuse in your Power BI applications?
This Power BIte video is a bit longer than its predecessors, and part of this is because it covers both the problem and the solution.
The problem is that self-service BI applications often start out as one-off efforts, but don’t stay that way. At least in theory, if the problem solved by the application was widespread and well understood, there would be an existing solution already developed and maintained by IT, and business users wouldn’t need to develop their own solutions.
Successful applications have a tendency to grow. For self-service BI, this could mean that more and more functionality gets added to the application, or it could mean that someone copies the relevant portions of the application and uses them as the starting point for a new, different-but-related, application.
Once this happens, there is a natural and gradual process of drift[2] that occurs, as each branch of the application tree grows in its own direction. A predictable consequence of this drift in Power BI applications is that query definitions that start off as common will gradually become out of sync, meaning that “the same data” in two datasets will actually contain different values.
Moving queries that need to be shared across multiple applications from multiple datasets into a single dataflow is a simple and effective solution to this problem. There’s no dedicated tooling for this solution in Power BI today, but the steps are still simple and straightforward.
P.S. This is the first Power BIte video recorded in my home office. After struggling unsuccessfully to get decent audio quality in my office at work, I’m trying out a new environment and some new tools. I know there’s still work to be done, but hopefully this is a step in the right direction. As always, I’d love to know what you think…
[1] For my younger readers, this phrase is a reference to when Spotify used to be called “records” and the most common service outage symptom was a repeat of the audio buffer until the user performed a hard reset of the client application. True story.
[2] Is there a better term for this? I feel like there should be an existing body of knowledge that I could reference, but my searching did not yield any promising results. The fact that “Logical Drift” is the name of a band probably isn’t helping.
This week’s Power BIte is the fourth and final entry in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.
When creating a dataflow by attaching an external CDM folder, the dataflow will have the following characteristics:
Attribute
Value
Data ingress path
Ingress via Azure Data Factory, Databricks, or whatever Azure service or app has created the CDM folder.
Data location
Data stored in ADLSg2 in the CDM folder created by the data ingress process.
Data refresh
The data is refreshed based on the execution schedule and properties of the data ingress process, not by any setting in Power BI.
The key to this scenario is the CDM folder storage format. CDM folders provide a simple and open way to persist data in a data lake. Because CDM folders are implemented using CSV data files and JSON metadata, any application can read from and write to CDM folders. This includes multiple Azure services that have libraries for reading and writing CDM folders and 3rd party data tools like Informatica that have implemented their own CDM folder connectors.
CDM folders enable scenarios like this one, which is implemented in a sample and tutorial published on GitHub by the Azure data team:
Create a Power BI dataflow by ingesting order data from the Wide World Importers sample database and save it as a CDM folder
Use an Azure Databricks notebook that prepares and cleanses the data in the CDM folder, and then writes the updated data to a new CDM folder in ADLS Gen2
Attach the CDM folder created by Databricks as an external dataflow in Power BI[2]
Use Azure Machine Learning to train and publish a model using data from the CDM folder
Use an Azure Data Factory pipeline to load data from the CDM folder into staging tables in Azure SQL Data Warehouse and then invoke stored procedures that transform the data into a dimensional model
Use Azure Data Factory to orchestrate the overall process and monitor execution
That’s it for this mini-series!
If all this information still doesn’t make sense yet, now is the time to ask questions.
[1] New videos every Monday morning!
[2] I added this bullet to the list because it fits in with the rest of the post – the other bullets are copied from the sample description.
This week’s Power BIte is the third in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.
When creating a dataflow by importing a model.json file previously exported from Power BI, the dataflow will have the following characteristics:
Attribute
Value
Data ingress path
Ingress via the mashup engine hosted in the Power BI service.
Data location
Data stored in the CDM folder defined for the newly created dataflow
Data refresh
The dataflow is refreshed based on the schedule and policies defined in the workspace.
Let’s look at the dataflow’s model.json metadata to see some of the details.
At the top of the file we can see the dataflow name on line 2…
…and that’s pretty much all that’s important here. The rest of the model.json file will exactly match what was exported from the Power BI portal, and will look a lot like this or this. Boom.
For a little more detail (and more pictures, in case you don’t want to watch a four minute video) check out this post from last month, when this capability was introduced.
If this information doesn’t make sense yet, please hold on. We still have one more incoming Power BIte in this series, and then we’ll have the big picture view.
I guarantee[3] it will make as much sense as anything on this blog.
[1] New videos every Monday morning![2]
[2] Did you notice that I just copied the previous post and made some small edits to it? That seemed very appropriate given the topic…
This week’s Power BIte is the second in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.
When creating a dataflow by defining new entities, the final dataflow will have the following characteristics:
Attribute
Value
Data ingress path
Ingress via the mashup engine hosted in the Power BI service, using source data that is also managed by the Power BI service, taking advantage of locality of data.
Data location
Data stored in the CDM folder defined for the dataflow for computed entities. Data for linked entities remains in source dataflow and is not moved or copied.
Data refresh
The dataflow is refreshed based on the schedule and policies defined in the workspace.
Let’s look at the dataflow’s model.json metadata to see some of the details.
At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 35 and the Power Query code for all of the entities on line 37. This will look awfully familiar from the last Power BIte post.
Things start to get interesting and different when we look at the entity definitions:
On line 80 we can see that the Product entity is defined as a ReferenceEntity, which is how the CDM metadata format describes what Power BI calles linked entities. Rather than having its attribute metadata defined in the current dataflow’s model.json file, it instead identifies the source entity it references, and the CDM folder in which the source entity is stored, similar to what we saw in the last example. Each modelId value for a linked entity references the id value in the referenceModels section as we’ll see below.
The Customers with Addresses entity, defined starting on line 93, is the calculated entity built in the video demo. This entity is a LocalEntity, meaning that its data is stored in the current CDM folder, and its metadata includes both the location, and its full list of attributes.
The end of the model.json file highlights the rest of the differences between local and linked entities.
At line 184 we can see the partitions for the Customers with Addresses entity, including the URL for the data file backing this entity. Because the other entities are linked entities, their partitions are not defined in the current model.json.
Instead, the CDM folders where their data does reside are identified in the referenceModels section starting at line 193. The id values in this section match the modelId values for the model.json file, above, and the location values are the URLs to the model.json files that define the source CDM folders for the linked entities.
If this information doesn’t make sense yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.
I guarantee[2] it will make as much sense as anything on this blog.
This week’s Power BIte is the first in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.
When creating a dataflow by defining new entities in Power Query Online, the final dataflow will have the following characteristics:
Attribute
Value
Data ingress path
Ingress via the mashup engine hosted in the Power BI service
Data location
Data stored in the CDM folder defined for the newly created dataflow
Data refresh
The dataflow is refreshed based on the schedule and policies defined in the workspace
Let’s look at the dataflow’s model.json metadata to see some of the details.
At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 19 and the Power Query code for all of the entities on line 22.
At the bottom of the file we can see information about the refresh and storage.[2] Line 26 identifies the entity as a LocalEntity, which means that the entity’s data is physically stored in the current CDM folder.
Line 30 shows that the entity is fully refreshed rather than incrementally refreshed, and line 31 shows the file name where the entity data is stored. Lines 97 through 99 identify the single partition where the data for the current version of the entity is stored, including the full URI for the data file. If this entity used incremental refresh, there would be multiple partitions to match the incremental refresh policy.
If this information doesn’t all make sense just yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.
I guarantee[3] it will make as much sense as anything on this blog.
[1] New videos every Monday morning!
[2] The same information is also included starting on line 103 for the Promotions entity, but is not pictured here.
Last week I kicked off the new BI Polar YouTube channel with a video and blog post comparing and contrasting Power BI dataflows and datasets. In the days that followed, I continued to hear questions that led me to believe I hadn’t done a great job answering one vital question:
When would I use dataflows, and when would I use shared datasets?
The video focuses on – and demonstrates – sharing and reuse with both dataflows and datasets. It’s short and to the point[1] and focuses on this one question.
The blog post takes a more conceptual view, using Lego bricks as a metaphor for dataflows and datasets and the types of reuse they enable.
If you’ve watched the videos and read the posts and you still have questions, please let me know.
[1] As short and to the point as anything you’re likely to get from me, anyway. Brevity has never been my forte.
This is still one of the most common dataflows questions: what’s the difference between Power BI dataflows and Power BI datasets?
For the last year I have resisted tackling this question head-on. This isn’t because it’s a bad or “dumb” question. Just the opposite – this is a very simple question, and the simpler a question is, the more complex and nuanced the answer is likely to be.
A graphical representation of the answer’s likely complexity.
If you’re a regular reader of this blog, you probably already know the answer, because I’ve answered it already. Sort of. The existing answer is distributed across dozens of posts, and if you’ve read all of them and picked up the answer along the way. But I keep hearing this question, and I keep thinking that there must be a more direct answer I could share.
Here it is, in a single, simple table[1].
Power BI dataflows
Power BI datasets
Implementation
CDM folder
Analysis Services tabular model
Storage[2]
CSV files
Vertipaq
Metadata[3]
Common Data Model – model.json
BISM
Development
Power Query Online
Power Query in Power BI Desktop
Primary purpose[4]
Data reuse
Data analysis
Reuse[5]
Acts as data source in multiple datasets
Shared datasets across workspaces
Scope of reuse[6]
Entity level reuse
Dataset level reuse
Mashup with other data sources[7]
Yes
No
Used for reporting[8]
Not directly
Yes
Reuse outside Power BI[9]
Yes, through ADLSg2
Yes, through XMLA
Data access methods[10]
Import
Import, DirectQuery
Connection methods[11]
Import
Live Connection
Row-level security
No
Yes
Certification and promotion
Not yet
Yes
What else am I missing?
Please let me know!
Seriously, you should let me know.
Update: I’ve added a few rows to the table after the post was originally published, to incorporate feedback from readers on differences I had missed. Thank you!
Each of the rows in this table could easily be an in-depth topic in and of itself, so if you’re looking at any of them and thinking “that’s not quite right” I might very well agree with you. There’s a lot of context and a lot of nuance here, and we’re trying to sum things up in a word or two… which is kind of the whole point.
Oh yeah, there’s a video too.[12]
I can’t wait to hear what you think!
[1] A simple table with ten footnotes.
[2] The storage aspect of dataflows and datasets is one of the most significant differences between the two. Datasets use the Vertipaq column store to load data into an optimized and highly compressed in-memory representation that is optimized for analysis. Dataflows use text files in folders, which are optimized for interoperability.
[3] The Analysis Services Tabular engine uses the BI Semantic Model (BISM) to represent its metadata. This is a metadata model originally included in SQL Server 2012 Analysis Services, and used by the Tabular engine ever since.
[4] Saying “this is the primary purpose” of any complex tool is fraught with risk, because no matter what you say, there are other valid things that remain unsaid. With this said… the big gap that dataflows close is that of self-service data preparation for the purpose of data sharing and reuse. Power BI has always had self-service data preparation through Power Query, but before dataflows the data that was prepared was “locked” in a dataset, for analysis, and not for sharing or reuse.
[5] Once you have loaded data into dataflows, authorized users can reuse entities from multiple dataflows, and use them as the building blocks for new dataflows or new datasets. Once you have loaded data into a dataset (and published it to the Power BI service) you can enable users to connect to it.
[6] With dataflows, users can pick and choose the entities they want, but a dataset can only be reused as-is.
[7] Dataflow entities can be used as data sources in the same Power BI Desktop file as other data sources, and can serve as part of a mashup or composite model, but a dataset can only be reused as-is.
[8] Although you can obviously use dataflows for reporting, you do so by first importing the data from the dataflow into a dataset.
[9] It’s interesting to point out that using your own organizational ADLSg2 account does not require Power BI Premium, but using the XMLA endpoint to connect to Power BI datasets from non-Power BI clients does.
[10] You can only import data into your dataflow entities, but tables in your dataset can import data or use DirectQuery, and a dataset can use a combination of the two.
[11] You can only import data from a dataflow into a dataset. When connecting to a shared dataset you can only use Live Connections.
[12] I’ve been thinking of making videos to supplement this blog for almost as long as I’ve been hearing the question that inspired this post. Please take a moment to share your thoughts on the video. This is something of a “soft launch” and although I have plans for a few dozen more videos already, your feedback will be a main factor in how the video series evolves.
Power BI dataflows have included capabilities for data lineage since they were introduced in preview way back in 2018. The design of dataflows, where each entity is defined by the Power Query that provides its data, enables a simple and easy view into its data lineage. The query is the authoritative statement on where the entity’s data comes from, and how it is transformed.
But what about everything else in a workspace? What about datasets, and reports, and dashboards? What about them?
Power BI has your back.
Lineage – it’s not just for dataflows anymore!
Late last month the Power BI team released a new preview capability that lets users view workspace content in a single end-to-end lineage view, in addition to the familiar list view.
Once the lineage view is selected, all workspace contents – data sources, dataflows, datasets, dashboards, and reports – are displayed, along with the relationships between them. Here’s a big-picture view of a workspace I’ve been working in lately[1]:
There’s a lot to unpack here, so I’ll break down what feels to me like the important parts:
The primary data source is a set of text files in folders. The text files are produced by various web scraping processes, and each has a different[2] format and contents.
The secondary data source is a set of reference and lookup data stored in Excel workbooks in SharePoint Online. These workbooks contain manually curated data that is used to cleanse, standardize and/or enrich the data from the primary data[3].
The primary data is staged with minimal transformation in a “raw” dataflow. This data is then progressively processed by a series of downstream dataflows, including mashing up with the secondary data from Excel, and reshaped into facts and dimensions.
There is one dataset based on the fact and dimension entities, and report based on this dataset. There’s a second dataset that includes data quality metrics from entities in multiple dataflows, and a report based on this dataset. And there are two dashboards, one that includes only visuals for data quality metrics, and one that presents the main data along with a few tiles from the quality report.
That overview is simplified enough as to be worthless from a technical understanding perspective, but it’s still a wall of text. Who wants to read that?
For a real-world workspace that implements a production BI application, there is likely to be more complexity, and less well defined boundaries between objects. How do you document the contents of a complex workspace, and the relationships between those components? How do you understand them well enough to identify and solve problems?
That’s where the lineage view comes in.
Let’s begin by looking at the data sources.
For data sources that use a gateway, I can easily see the gateway name. For other data sources I can see the data source location. We’re off to a good start, because I have a single place to look to see where my data is coming from.
Next, let’s look at the dataflows.
In addition to being able to see the dataflows and the dependencies between them, you can click on any dataflow to see the entities it contains, and can jump directly to edit the dataflow from this view.
This part of workspace lineage isn’t completely new – this is essentially what you could do with dataflows already. But now you can do it with datasets, reports, and dashboards as well.
Selecting a dataset shows me the tables it contains, and selecting a dashboard or report takes me directly to the visualization. But the real power of this view comes from the relationships between objects. The relationships are where data lineage comes to the fore.
The two primary questions asked in the context of data linage are around upstream “where does this data come from?” and downstream “where is this data used?” lineage scenarios.
The first question is often asked in the context of “why am I not seeing what I expect to see?” and the resulting investigation looks at upstream logic and data source to identify the root cause of problems.
The second question is often asked in the context of “what might break if I change this?” and the resulting investigation looks at downstream objects and processes.
The lineage view has a simple way to answer both questions: just click on the “double arrow” icon and the view will change to highlight all upstream and downstream objects. In a single click you can see where the data comes from, and where the data is used. Click again, and the view toggles back to the default view.
[4]
There’s more to lineage view than this, including support for shared and certified datasets, but this should be enough to get you excited. Be sure to check out the preview documentation as you check out the feature!
Update: We now have a video to supplement the blog post. Check it out!
Update: The Power BI blog now has the official announcement for this exciting feature. The blog post includes a look at where the lineage team is planning to invest to make this feature even better, and[5] that all of the information in the lineage view is now available using the Power BI API. If you want to integrate lineage and impact analysis into your own tools, or if you want to build a cross-workspace lineage view, you now have the APIs you need to be successful!
[1] This is a pet project that may one day turn into a viable demo, assuming work and life let me devote a little more time to it…
[2] Different, annoying, and difficult to clean up.
[3] For example, the source web site allows any user to contribute, and although the contribution process is moderated there is no enforcement of content or quality. One artist may be credited for “guitar” on one album, “guitars” on another, “lead guitar” on a third. This sounds pretty simple until you take into account there were close to 50,000 different “artist roles” in the raw source data, that needed to be standardized down to a few hundred values in the final data model.