Power BIte: Power Platform dataflows

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

INTERIOR: pan over cubicles of happy, productive office workers

CLOSE-UP: office worker at desk

NARRATOR: Is that Susie I see, giving Power Platform dataflows a try?

SUSIE: That’s right! With dataflows I can have all of the data I need, right where I need it!!

NARRATOR: Dataflows. They’re not just for Power BI anymore.

OK, you may not remember that orange juice ad campaign from the late 1970s and early 80s[1], but I’ve had it stuck in my head since I started working on this post and video. I couldn’t figure out how to work it into the video itself, so here it is in written form.

Anyway, with that awkward moment behind us, you probably want to watch the video. Here is it:

As the video discusses, Power Apps now have a dataflows capability that is a natural complement to Power BI dataflows. Power Platform dataflows have been generally available since November 2019, and have been in preview since summer.

Power Platform dataflows use Power Query Online – and the same set of connectors, gateways, and transformation capabilities as Power BI dataflows. But there are a few key differences that are worth emphasizing.

Power Platform dataflows can load data into the Common Data Service, either into the standard Common Data Model entities used by Dynamics 365 apps, or into custom entities used by custom Power Apps. This is important – this makes dataflows more like a traditional ETL tool like SSIS data flows in that at the end of the dataflow creation process you can map the columns in your queries to the columns in these existing tables[2].

Power Platform dataflows can load data into ADLSg2 for analytical scenarios, but Power Apps doesn’t have the same concept of “built-in storage” that Power BI does. That means if you want to use Power Platform dataflows to create CDM folders, you must configure your Power Apps environment to use an ADLSg2 resource in your Azure subscription.

The “link to data lake” feature in Power Apps feels to me like a better integration experience than what’s currently available in Power BI. In Power Apps you define the link at the environment level, not the tenant level – this provides more flexibility, and enables non-tenant admins[3] to configure and use the data lake integration.

2019-12-23-11-07-30-619--ApplicationFrameHost

The first time you create a Power Platform dataflow and select the “analytical entities” option, you’ll be prompted – and required – to link the Power Apps environment to an Azure Data Lake Storage resource. You’ll need to have an Azure subscription to use, but the process itself if pretty straightforward.

2019-12-23-11-19-10-160--ApplicationFrameHost.png

I can’t wait to hear what you think of this new capability. Please let me know in the comments or via Twitter.

See you in the new year!


[1] I just realized that this was 40 years ago. Were you even born yet?

[2] CDS entities aren’t tables by the strictest definition, but it’s close enough for our purposes today.

[3] I honestly don’t know enough about Power Apps security to go into too much depth on this point, but I am not a Power Apps admin and I was able to create a trial environment and link it to my own ADLSg2 resource in my own Azure subscription without affecting other users.

Power BIte: Dataflows enhanced compute engine

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

The enhanced compute engine in Power BI dataflows has been in preview since June. It’s not really new, and I’ve posted about it before. But I still keep hearing questions about it, so I thought it might make sense to record a video[1].

This video.


I won’t go into too much more depth here – just watch the video, and if you want more details check out one of these existing posts:

Now to get back on schedule with that next video…


[1] Also, I’m behind on my video schedule – this was a motivating factor as well. November was an unexpectedly busy month[2], and between work, life, and not really having the video editing skills I need to keep to a schedule… Yeah.

[2] And I expected it to be very, very busy.

Power BIte: Turning datasets into dataflows

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

At this point I’ve said “Power BI dataflows enable reuse” enough times that I feel like a broken record[1]. What does this phrase actually mean, and how can you take advantage of dataflows to enable reuse in your Power BI applications?

This Power BIte video is a bit longer than its predecessors, and part of this is because it covers both the problem and the solution.

The problem is that self-service BI applications often start out as one-off efforts, but don’t stay that way. At least in theory, if the problem solved by the application was widespread and well understood, there would be an existing solution already developed and maintained by IT, and business users wouldn’t need to develop their own solutions.

Successful applications have a tendency to grow. For self-service BI, this could mean that more and more functionality gets added to the application, or it could mean that someone copies the relevant portions of the application and uses them as the starting point for a new, different-but-related, application.

Once this happens, there is a natural and gradual process of drift[2] that occurs, as each branch of the application tree grows in its own direction. A predictable consequence of this drift in Power BI applications is that query definitions that start off as common will gradually become out of sync, meaning that “the same data” in two datasets will actually contain different values.

Moving queries that need to be shared across multiple applications from multiple datasets into a single dataflow is a simple and effective solution to this problem. There’s no dedicated tooling for this solution in Power BI today, but the steps are still simple and straightforward.

P.S. This is the first Power BIte video recorded in my home office. After struggling unsuccessfully to get decent audio quality in my office at work, I’m trying out a new environment and some new tools. I know there’s still work to be done, but hopefully this is a step in the right direction. As always, I’d love to know what you think…


 

[1] For my younger readers, this phrase is a reference to when Spotify used to be called “records” and the most common service outage symptom was a repeat of the audio buffer until the user performed a hard reset of the client application. True story.

[2] Is there a better term for this? I feel like there should be an existing body of knowledge that I could reference, but my searching did not yield any promising results. The fact that “Logical Drift” is the name of a band probably isn’t helping.

Power BIte: Creating dataflows by attaching external CDM folders

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

This week’s Power BIte is the fourth and final entry in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by attaching an external CDM folder, the dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via Azure Data Factory, Databricks, or whatever Azure service or app has created the CDM folder.
Data location Data stored in ADLSg2 in the CDM folder created by the data ingress process.
Data refresh The data is refreshed based on the execution schedule and properties of the data ingress process, not by any setting in Power BI.

The key to this scenario is the CDM folder storage format. CDM folders provide a simple and open way to persist data in a data lake. Because CDM folders are implemented using CSV data files and JSON metadata, any application can read from and write to CDM folders. This includes multiple Azure services that have libraries for reading and writing CDM folders and 3rd party data tools like Informatica that have implemented their own CDM folder connectors.

CDM folders enable scenarios like this one, which is implemented in a sample and tutorial published on GitHub by the Azure data team:

  • Create a Power BI dataflow by ingesting order data from the Wide World Importers sample database and save it as a CDM folder
  • Use an Azure Databricks notebook that prepares and cleanses the data in the CDM folder, and then writes the updated data to a new CDM folder in ADLS Gen2
  • Attach the CDM folder created by Databricks as an external dataflow in Power BI[2]
  • Use Azure Machine Learning to train and publish a model using data from the CDM folder
  • Use an Azure Data Factory pipeline to load data from the CDM folder into staging tables in Azure SQL Data Warehouse and then invoke stored procedures that transform the data into a dimensional model
  • Use Azure Data Factory to orchestrate the overall process and monitor execution

That’s it for this mini-series!

If all this information still doesn’t make sense yet, now is the time to ask questions.


[1] New videos every Monday morning!

[2] I added this bullet to the list because it fits in with the rest of the post – the other bullets are copied from the sample description.

Power BIte: Creating dataflows by importing model.json

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

This week’s Power BIte is the third in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by importing a model.json file previously exported from Power BI, the dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service.
Data location Data stored in the CDM folder defined for the newly created dataflow
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace.

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-11-06-14-50-53-981--Code

At the top of the file we can see the dataflow name on line 2…

…and that’s pretty much all that’s important here. The rest of the model.json file will exactly match what was exported from the Power BI portal, and will look a lot like this or this. Boom.

For a little more detail (and more pictures, in case you don’t want to watch a four minute video) check out this post from last month, when this capability was introduced.

If this information doesn’t make sense yet, please hold on. We still have one more incoming Power BIte in this series, and then we’ll have the big picture view.

I guarantee[3] it will make as much sense as anything on this blog.


[1] New videos every Monday morning![2]

[2] Did you notice that I just copied the previous post and made some small edits to it? That seemed very appropriate given the topic…

[3] Or your money back.

Power BIte: Creating dataflows with linked and computed entities

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

This week’s Power BIte is the second in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by defining new entities, the final dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service, using source data that is also managed by the Power BI service, taking advantage of locality of data.
Data location Data stored in the CDM folder defined for the dataflow for computed entities. Data for linked entities remains in source dataflow and is not moved or copied.
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace.

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-11-04-07-00-30-025--Code

At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 35 and the Power Query code for all of the entities on line 37. This will look awfully familiar from the last Power BIte post.

Things start to get interesting and different when we look at the entity definitions:

2019-11-04-07-04-23-519--Code

On line 80 we can see that the Product entity is defined as a ReferenceEntity, which is how the CDM metadata format describes what Power BI calles linked entities. Rather than having its attribute metadata defined in the current dataflow’s model.json file, it instead identifies the source entity it references, and the CDM folder in which the source entity is stored, similar to what we saw in the last example. Each modelId value for a linked entity references the id value in the referenceModels section as we’ll see below.

The Customers with Addresses entity, defined starting on line 93, is the calculated entity built in the video demo. This entity is a LocalEntity, meaning that its data is stored in the current CDM folder, and its metadata includes both the location, and its full list of attributes.

The end of the model.json file highlights the rest of the differences between local and linked entities.

2019-11-04-07-16-41-335--Code

At line 184 we can see the partitions for the Customers with Addresses entity, including the URL for the data file backing this entity. Because the other entities are linked entities, their partitions are not defined in the current model.json.

Instead, the CDM folders where their data does reside are identified in the referenceModels section starting at line 193. The id values in this section match the modelId values for the model.json file, above, and the location values are the URLs to the model.json files that define the source CDM folders for the linked entities.

If this information doesn’t make sense yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.

I guarantee[2] it will make as much sense as anything on this blog.


[1] New videos every Monday morning!

[2] Or your money back.

Power BIte: Creating dataflows with Power Query Online

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

This week’s Power BIte is the first in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by defining new entities in Power Query Online, the final dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service
Data location Data stored in the CDM folder defined for the newly created dataflow
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-10-27-10-13-08-592--Code

At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 19 and the Power Query code for all of the entities on line 22.

2019-10-27-10-23-37-698--Code.png

At the bottom of the file we can see information about the refresh and storage.[2] Line 26 identifies the entity as a LocalEntity, which means that the entity’s data is physically stored in the current CDM folder.

Line 30 shows that the entity is fully refreshed rather than incrementally refreshed, and line 31 shows the file name where the entity data is stored. Lines 97 through 99 identify the single partition where the data for the current version of the entity is stored, including the full URI for the data file. If this entity used incremental refresh, there would be multiple partitions to match the incremental refresh policy.

If this information doesn’t all make sense just yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.

I guarantee[3] it will make as much sense as anything on this blog.


[1] New videos every Monday morning!

[2] The same information is also included starting on line 103 for the Promotions entity,  but is not pictured here.

[3] Or your money back.

Power BIte: Sharing and reuse with dataflows and datasets

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

Last week I kicked off the new BI Polar YouTube channel with a video and blog post comparing and contrasting Power BI dataflows and datasets. In the days that followed, I continued to hear questions that led me to believe I hadn’t done a great job answering one vital question:

When would I use dataflows, and when would I use shared datasets?

Here’s the short answer:

And here’s the long answer: Lego Bricks and the Spectrum of Data Enrichment and Reuse.

The video focuses on – and demonstrates – sharing and reuse with both dataflows and datasets. It’s short and to the point[1] and focuses on this one question.

The blog post takes a more conceptual view, using Lego bricks as a metaphor for dataflows and datasets and the types of reuse they enable.

If you’ve watched the videos and read the posts and you still have questions, please let me know.


[1] As short and to the point as anything you’re likely to get from me, anyway. Brevity has never been my forte.

Power BIte: Dataflows vs. datasets

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

This is still one of the most common dataflows questions: what’s the difference between Power BI dataflows and Power BI datasets?

For the last year I have resisted tackling this question head-on. This isn’t because it’s a bad or “dumb” question. Just the opposite – this is a very simple question, and the simpler a question is, the more complex and nuanced the answer is likely to be.

See how complex this is?
A graphical representation of the answer’s likely complexity.

If you’re a regular reader of this blog, you probably already know the answer, because I’ve answered it already. Sort of. The existing answer is distributed across dozens of posts, and if you’ve read all of them and picked up the answer along the way. But I keep hearing this question, and I keep thinking that there must be a more direct answer I could share.

Here it is, in a single, simple table[1].

Power BI dataflows Power BI datasets
Implementation CDM folder Analysis Services tabular model
Storage[2] CSV files Vertipaq
Metadata[3] Common Data Model – model.json BISM
Development Power Query Online Power Query in Power BI Desktop
Primary purpose[4] Data reuse Data analysis
Reuse[5] Acts as data source in multiple datasets Shared datasets across workspaces
Scope of reuse[6] Entity level reuse Dataset level reuse
Mashup with other data sources[7] Yes No
Used for reporting[8] Not directly Yes
Reuse outside Power BI[9] Yes, through ADLSg2 Yes, through XMLA
Data access methods[10] Import Import, DirectQuery
Connection methods[11] Import Live Connection
Row-level security No Yes
Certification and promotion Not yet Yes
What else am I missing? Please let me know! Seriously, you should let me know.

Update: I’ve added a few rows to the table after the post was originally published, to incorporate feedback from readers on differences I had missed. Thank you!

Each of the rows in this table could easily be an in-depth topic in and of itself, so if you’re looking at any of them and thinking “that’s not quite right” I might very well agree with you. There’s a lot of context and a lot of nuance here, and we’re trying to sum things up in a word or two… which is kind of the whole point.

Oh yeah, there’s a video too.[12]

I can’t wait to hear what you think!


[1] A simple table with ten footnotes.

[2] The storage aspect of dataflows and datasets is one of the most significant differences between the two. Datasets use the Vertipaq column store to load data into an optimized and highly compressed in-memory representation that is optimized for analysis. Dataflows use text files in folders, which are optimized for interoperability.

[3] The Analysis Services Tabular engine uses the BI Semantic Model (BISM) to represent its metadata. This is a metadata model originally included in SQL Server 2012 Analysis Services, and used by the Tabular engine ever since.

[4] Saying “this is the primary purpose” of any complex tool is fraught with risk, because no matter what you say, there are other valid things that remain unsaid. With this said… the big gap that dataflows close is that of self-service data preparation for the purpose of data sharing and reuse. Power BI has always had self-service data preparation through Power Query, but before dataflows the data that was prepared was “locked” in a dataset, for analysis, and not for sharing or reuse.

[5] Once you have loaded data into dataflows, authorized users can reuse entities from multiple dataflows, and use them as the building blocks for new dataflows or new datasets. Once you have loaded data into a dataset (and published it to the Power BI service) you can enable users to connect to it.

[6] With dataflows, users can pick and choose the entities they want, but a dataset can only be reused as-is.

[7] Dataflow entities can be used as data sources in the same Power BI Desktop file as other data sources, and can serve as part of a mashup or composite model, but a dataset can only be reused as-is.

[8] Although you can obviously use dataflows for reporting, you do so by first importing the data from the dataflow into a dataset.

[9] It’s interesting to point out that using your own organizational ADLSg2 account does not require Power BI Premium, but using the XMLA endpoint to connect to Power BI datasets from non-Power BI clients does.

[10] You can only import data into your dataflow entities, but tables in your dataset can import data or use DirectQuery, and a dataset can use a combination of the two.

[11] You can only import data from a dataflow into a dataset. When connecting to a shared dataset you can only use Live Connections.

[12] I’ve been thinking of making videos to supplement this blog for almost as long as I’ve been hearing the question that inspired this post. Please take a moment to share your thoughts on the video. This is something of a “soft launch” and although I have plans for a few dozen more videos already, your feedback will be a main factor in how the video series evolves.