Power BI dataflows November updates

If you head on over to the official Power BI blog today you’ll see this announcement, and if you’re like me there will be a few things that immediately jump out at you:

2019-11-04-15-08-37-362--msedge

  1. That’s almost two features per day
  2. Dataflows should have a lower-case D
  3. Miguel’s profile picture is even older than Matthew’s profile picture

All snark aside[1], Miguel and the whole dataflows team have been awfully busy, and it’s great to see their work available to Power BI authors. I won’t attempt to repeat what’s in the announcement, but I will highlight the new capabilities that have me most excited:

  • Support for data profiling in Power Query Online – we’ve had this in Power BI Desktop for a while, but it’s just as important for dataflows as it is for datasets.
  • Better support for files and folders – a lot of the data I play with[2] these days is in folders full of text files, and Power Query Online hasn’t had the best experience for working with this type of data.
  • Better support for query parameters – there are lots of scenarios[3] where having parameterized queries makes working with dataflows easier, and now Power Query Online makes it easier to work with query parameters.

Do yourself a favor and check out the whole list. Odds are there’s something you’ve been waiting for that will excite you as much as these new capabilities excite me.

And I can’t wait to hear what they are…


[1] No, I don’t believe that’s possible either, but it is nice to see that you’ve been paying attention.

[2] Very little of my actual work involves data prep these days, so I need to find data to play with to avoid getting too bored.

[3] Like this one.

Power BIte: Creating dataflows with linked and computed entities

This week’s Power BIte is the second in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by defining new entities, the final dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service, using source data that is also managed by the Power BI service, taking advantage of locality of data.
Data location Data stored in the CDM folder defined for the dataflow for computed entities. Data for linked entities remains in source dataflow and is not moved or copied.
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace.

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-11-04-07-00-30-025--Code

At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 35 and the Power Query code for all of the entities on line 37. This will look awfully familiar from the last Power BIte post.

Things start to get interesting and different when we look at the entity definitions:

2019-11-04-07-04-23-519--Code

On line 80 we can see that the Product entity is defined as a ReferenceEntity, which is how the CDM metadata format describes what Power BI calles linked entities. Rather than having its attribute metadata defined in the current dataflow’s model.json file, it instead identifies the source entity it references, and the CDM folder in which the source entity is stored, similar to what we saw in the last example. Each modelId value for a linked entity references the id value in the referenceModels section as we’ll see below.

The Customers with Addresses entity, defined starting on line 93, is the calculated entity built in the video demo. This entity is a LocalEntity, meaning that its data is stored in the current CDM folder, and its metadata includes both the location, and its full list of attributes.

The end of the model.json file highlights the rest of the differences between local and linked entities.

2019-11-04-07-16-41-335--Code

At line 184 we can see the partitions for the Customers with Addresses entity, including the URL for the data file backing this entity. Because the other entities are linked entities, their partitions are not defined in the current model.json.

Instead, the CDM folders where their data does reside are identified in the referenceModels section starting at line 193. The id values in this section match the modelId values for the model.json file, above, and the location values are the URLs to the model.json files that define the source CDM folders for the linked entities.

If this information doesn’t make sense yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.

I guarantee[2] it will make as much sense as anything on this blog.


[1] New videos every Monday morning!

[2] Or your money back.

Power BI and ADLSg2 – but not dataflows

Back in July[1] the Power BI team announced the availability of a new connector for Azure Data Lake Storage Gen2.

It's a data lake. Get it?
When Matthew closes his eyes and pictures a data lake, this is what he sees.

In recent weeks I’ve been starting to hear questions that sound like these:

Question: Is this ADLSg2 connector how you get to the data behind dataflows?

Answer: No. Dataflows are how you get to the data behind dataflows.

Question: Is this how I can access dataflows if I don’t use Power BI Premium?

Answer: No. Dataflows are not a Premium-only feature.

Question: Can I use the ADLSg2 connector to work with CDM folder data?

Answer: Yes, but why would you?

If your data is already in CDM folders, using the ADLSg2 connector simply adds effort to consuming it in Power BI. You’ll be working with raw, untyped text files instead of working with strongly typed entities.

If your ADLSg2 data is already in CDM folders, strongly consider attaching the CDM folder as a dataflow. This means less up-front work for you, and less ongoing work for the users who need to get insights from the data.

Question: Why do we need an ADLSg2 connector if we have dataflows?

Answer: Now that is a good question!

Power BI dataflows store their data in CDM folder format, and they can be configured to store those CDM folders in your organization’s ADLSg2 data lake. In addition to this, you can attach a CDM folder in ADLSg2 as an external dataflow, making its data available to Power BI users even though the data ingress is taking place through another tool like Azure Data Factory.

But ADLSg2 is much, much more[2] than a repository for dataflows or CDM folders. ADLSg2 supports all sorts of file and blob data, not just CDM folders. And sometimes you need to work with that data in Power BI.

The ADLSg2 connector exists for these scenarios, when your data is not stored in CDM folders. With this connector, users in Power BI Desktop can connect to ADLSg2 resources and work with the files they contain, similar to the existing HDFS and Folder connectors.


[1] Yes, this is another catch-up post that has been waiting to be finished. No, I do not have any reason to believe that 2020 will be any more forgiving than 2019 has been.

[2] I could have linked to the product documentation or the official product page, but I believe that Melissa‘s blog does the best job summing up ADLSg2 in a single post.

Power BIte: Creating dataflows with Power Query Online

This week’s Power BIte is the first in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by defining new entities in Power Query Online, the final dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service
Data location Data stored in the CDM folder defined for the newly created dataflow
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-10-27-10-13-08-592--Code

At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 19 and the Power Query code for all of the entities on line 22.

2019-10-27-10-23-37-698--Code.png

At the bottom of the file we can see information about the refresh and storage.[2] Line 26 identifies the entity as a LocalEntity, which means that the entity’s data is physically stored in the current CDM folder.

Line 30 shows that the entity is fully refreshed rather than incrementally refreshed, and line 31 shows the file name where the entity data is stored. Lines 97 through 99 identify the single partition where the data for the current version of the entity is stored, including the full URI for the data file. If this entity used incremental refresh, there would be multiple partitions to match the incremental refresh policy.

If this information doesn’t all make sense just yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.

I guarantee[3] it will make as much sense as anything on this blog.


[1] New videos every Monday morning!

[2] The same information is also included starting on line 103 for the Promotions entity,  but is not pictured here.

[3] Or your money back.

Using and reusing Power BI dataflows

I use this diagram a lot[1]:

excel white

This diagram neatly summarizes a canonical use case for Power BI dataflows, with source data being ingested and processed as part of an end-to-end BI application. It showcases the Lego-like composition that’s possible with dataflows. But it also has drawbacks – its simplicity omits common scenarios for using and reusing dataflows.

So, let’s look at what’s shown – and at what’s not shown – in my favorite diagram. Let’s look at some of the ways these dataflows and their entities can be used.

  1. Use the final entities as-is: This is the scenario implied by the diagram. The entities in the “Final Business View” dataflow represent a star schema, and are loaded as-is into a dataset.
  2. Use the final entities with modification: The entities in the “Final Business View” dataflow are loaded into a dataset, but with additional transformation or filtering applied in the dataset’s queries.
  3. Use the final entities with mashup: The entities in the “Final Business View” dataflow are loaded into a dataset, but with additional data from other sources added via the dataset’s queries.
  4. Use upstream entities: The entities in other dataflows are loaded into a dataset, likely with transformations and filtering applied, and with data from other sources added via the dataset’s queries.

Please understand that this list is not exhaustive. There are likely dozens of variations on these themes that I have not called out explicitly. Use this list as a starting point and see where dataflows will take you. I’ll keep the diagram simple, but you can build solutions as complex as you need them to be.


[1] This is my diagram. There are many like it, but this one is mine.

 

Power BIte: Sharing and reuse with dataflows and datasets

Last week I kicked off the new BI Polar YouTube channel with a video and blog post comparing and contrasting Power BI dataflows and datasets. In the days that followed, I continued to hear questions that led me to believe I hadn’t done a great job answering one vital question:

When would I use dataflows, and when would I use shared datasets?

Here’s the short answer:

And here’s the long answer: Lego Bricks and the Spectrum of Data Enrichment and Reuse.

The video focuses on – and demonstrates – sharing and reuse with both dataflows and datasets. It’s short and to the point[1] and focuses on this one question.

The blog post takes a more conceptual view, using Lego bricks as a metaphor for dataflows and datasets and the types of reuse they enable.

If you’ve watched the videos and read the posts and you still have questions, please let me know.


[1] As short and to the point as anything you’re likely to get from me, anyway. Brevity has never been my forte.