Quick Tip: Working with dataflow-created CDM folders in ADLSg2

If you’re using your own organizational Azure Data Lake Storage Gen2 account for Power BI dataflows, you can use the CDM folders that Power BI creates as a data source for other efforts, including data science with tools like Azure Machine Learning and Azure Databricks.

Image by Arek Socha from Pixabay
A world of possibilities appears before you…

This capability has been in preview since early this year, so it’s not really new, but there are enough pieces involved that it may not be obvious how to begin – and I continue to see enough questions about this topic that another blog post seemed warranted.

The key point is that because dataflows are writing data to ADLSg2 in CDM folder format, Azure Machine Learning and Azure Databricks can both read the data using the metadata in the model.json file.

This json file serves as the “endpoint” for the data in the CDM folder; it’s a single resource that you can connect to, and not have to worry about the complexities in the various subfolders and files that the CDM folder contains.

This tutorial is probably the best place to start if you want to know more[1]. It includes directions and sample code for creating and consuming CDM folders from a variety of different Azure services – and Power BI dataflows. If you’re one of the people who has recently asked about this, please go through this tutorial as your next step!


[1] It’s the best resource I’m aware of  – if you find a better one, please let me know!

Power BIte: Creating dataflows by importing model.json

This week’s Power BIte is the third in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by importing a model.json file previously exported from Power BI, the dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service.
Data location Data stored in the CDM folder defined for the newly created dataflow
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace.

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-11-06-14-50-53-981--Code

At the top of the file we can see the dataflow name on line 2…

…and that’s pretty much all that’s important here. The rest of the model.json file will exactly match what was exported from the Power BI portal, and will look a lot like this or this. Boom.

For a little more detail (and more pictures, in case you don’t want to watch a four minute video) check out this post from last month, when this capability was introduced.

If this information doesn’t make sense yet, please hold on. We still have one more incoming Power BIte in this series, and then we’ll have the big picture view.

I guarantee[3] it will make as much sense as anything on this blog.


[1] New videos every Monday morning![2]

[2] Did you notice that I just copied the previous post and made some small edits to it? That seemed very appropriate given the topic…

[3] Or your money back.

Power BI dataflows November updates

If you head on over to the official Power BI blog today you’ll see this announcement, and if you’re like me there will be a few things that immediately jump out at you:

2019-11-04-15-08-37-362--msedge

  1. That’s almost two features per day
  2. Dataflows should have a lower-case D
  3. Miguel’s profile picture is even older than Matthew’s profile picture

All snark aside[1], Miguel and the whole dataflows team have been awfully busy, and it’s great to see their work available to Power BI authors. I won’t attempt to repeat what’s in the announcement, but I will highlight the new capabilities that have me most excited:

  • Support for data profiling in Power Query Online – we’ve had this in Power BI Desktop for a while, but it’s just as important for dataflows as it is for datasets.
  • Better support for files and folders – a lot of the data I play with[2] these days is in folders full of text files, and Power Query Online hasn’t had the best experience for working with this type of data.
  • Better support for query parameters – there are lots of scenarios[3] where having parameterized queries makes working with dataflows easier, and now Power Query Online makes it easier to work with query parameters.

Do yourself a favor and check out the whole list. Odds are there’s something you’ve been waiting for that will excite you as much as these new capabilities excite me.

And I can’t wait to hear what they are…


[1] No, I don’t believe that’s possible either, but it is nice to see that you’ve been paying attention.

[2] Very little of my actual work involves data prep these days, so I need to find data to play with to avoid getting too bored.

[3] Like this one.

Power BIte: Creating dataflows with linked and computed entities

This week’s Power BIte is the second in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by defining new entities, the final dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service, using source data that is also managed by the Power BI service, taking advantage of locality of data.
Data location Data stored in the CDM folder defined for the dataflow for computed entities. Data for linked entities remains in source dataflow and is not moved or copied.
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace.

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-11-04-07-00-30-025--Code

At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 35 and the Power Query code for all of the entities on line 37. This will look awfully familiar from the last Power BIte post.

Things start to get interesting and different when we look at the entity definitions:

2019-11-04-07-04-23-519--Code

On line 80 we can see that the Product entity is defined as a ReferenceEntity, which is how the CDM metadata format describes what Power BI calles linked entities. Rather than having its attribute metadata defined in the current dataflow’s model.json file, it instead identifies the source entity it references, and the CDM folder in which the source entity is stored, similar to what we saw in the last example. Each modelId value for a linked entity references the id value in the referenceModels section as we’ll see below.

The Customers with Addresses entity, defined starting on line 93, is the calculated entity built in the video demo. This entity is a LocalEntity, meaning that its data is stored in the current CDM folder, and its metadata includes both the location, and its full list of attributes.

The end of the model.json file highlights the rest of the differences between local and linked entities.

2019-11-04-07-16-41-335--Code

At line 184 we can see the partitions for the Customers with Addresses entity, including the URL for the data file backing this entity. Because the other entities are linked entities, their partitions are not defined in the current model.json.

Instead, the CDM folders where their data does reside are identified in the referenceModels section starting at line 193. The id values in this section match the modelId values for the model.json file, above, and the location values are the URLs to the model.json files that define the source CDM folders for the linked entities.

If this information doesn’t make sense yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.

I guarantee[2] it will make as much sense as anything on this blog.


[1] New videos every Monday morning!

[2] Or your money back.

Power BI and ADLSg2 – but not dataflows

Back in July[1] the Power BI team announced the availability of a new connector for Azure Data Lake Storage Gen2.

It's a data lake. Get it?
When Matthew closes his eyes and pictures a data lake, this is what he sees.

In recent weeks I’ve been starting to hear questions that sound like these:

Question: Is this ADLSg2 connector how you get to the data behind dataflows?

Answer: No. Dataflows are how you get to the data behind dataflows.

Question: Is this how I can access dataflows if I don’t use Power BI Premium?

Answer: No. Dataflows are not a Premium-only feature.

Question: Can I use the ADLSg2 connector to work with CDM folder data?

Answer: Yes, but why would you?

If your data is already in CDM folders, using the ADLSg2 connector simply adds effort to consuming it in Power BI. You’ll be working with raw, untyped text files instead of working with strongly typed entities.

If your ADLSg2 data is already in CDM folders, strongly consider attaching the CDM folder as a dataflow. This means less up-front work for you, and less ongoing work for the users who need to get insights from the data.

Question: Why do we need an ADLSg2 connector if we have dataflows?

Answer: Now that is a good question!

Power BI dataflows store their data in CDM folder format, and they can be configured to store those CDM folders in your organization’s ADLSg2 data lake. In addition to this, you can attach a CDM folder in ADLSg2 as an external dataflow, making its data available to Power BI users even though the data ingress is taking place through another tool like Azure Data Factory.

But ADLSg2 is much, much more[2] than a repository for dataflows or CDM folders. ADLSg2 supports all sorts of file and blob data, not just CDM folders. And sometimes you need to work with that data in Power BI.

The ADLSg2 connector exists for these scenarios, when your data is not stored in CDM folders. With this connector, users in Power BI Desktop can connect to ADLSg2 resources and work with the files they contain, similar to the existing HDFS and Folder connectors.


[1] Yes, this is another catch-up post that has been waiting to be finished. No, I do not have any reason to believe that 2020 will be any more forgiving than 2019 has been.

[2] I could have linked to the product documentation or the official product page, but I believe that Melissa‘s blog does the best job summing up ADLSg2 in a single post.

Power BIte: Creating dataflows with Power Query Online

This week’s Power BIte is the first in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by defining new entities in Power Query Online, the final dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service
Data location Data stored in the CDM folder defined for the newly created dataflow
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-10-27-10-13-08-592--Code

At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 19 and the Power Query code for all of the entities on line 22.

2019-10-27-10-23-37-698--Code.png

At the bottom of the file we can see information about the refresh and storage.[2] Line 26 identifies the entity as a LocalEntity, which means that the entity’s data is physically stored in the current CDM folder.

Line 30 shows that the entity is fully refreshed rather than incrementally refreshed, and line 31 shows the file name where the entity data is stored. Lines 97 through 99 identify the single partition where the data for the current version of the entity is stored, including the full URI for the data file. If this entity used incremental refresh, there would be multiple partitions to match the incremental refresh policy.

If this information doesn’t all make sense just yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.

I guarantee[3] it will make as much sense as anything on this blog.


[1] New videos every Monday morning!

[2] The same information is also included starting on line 103 for the Promotions entity,  but is not pictured here.

[3] Or your money back.

It all comes down to culture

I talk about data culture a lot, and in my presentations I often emphasize how the most important success factor when adopting a tool like Power BI[1] is the culture of the organization, not the tool itself.

I talk about this a lot, but I think Caitie McCaffrey may have just had the final word.[2]

2019-10-24-12-47-22-722--msedge

I don’t think that Caitie was talking about the enterprise adoption of self-service business intelligence, but she could have been.

In my day job I get to talk to leaders from large companies around the world, and to see how they’re adopting and using Power BI, Azure. Before today I didn’t think of Moby Dick – I thought of Leo Tolstoy’s classic Anna Karenina, which starts with this classic line:

All happy families are alike; each unhappy family is unhappy in its own way.

Although the details vary, large companies that have successfully adopted managed self-service BI at scale have cultures with important aspects in common:

  • Leaders empower business users to work with data
  • Leaders trust business users to use data to make better decisions
  • IT supports business users with platforms and tools and with curated data sources
  • Business users work with the tools from IT and the guidance from leaders, and work within the guardrails and guidelines given to them for this use
  • Business and IT collaborate to deliver responsive solutions and mature/stable solutions, with clearly defined responsibilities between them

Companies that are successful with managed self-service BI do these things. Companies that are not successful do not. The details vary, but the pattern holds up again and again.

How do these roles and responsibilities relate to culture?

In many ways a culture is defined by the behaviors it rewards, the behaviors it allows, and the behaviors it punishes. A culture isn’t what you say – it’s what you do.

In the context of BI, having a culture with shared goals that enable business and IT to work together with the support from the company leaders is the key. If you have this culture, you can be successful with any tool. Some tools may be more helpful than others, and the culture will enable the selection of better tools over time, but the tool is not the most important factor. The culture – not the tool – inevitably determines success.

This is not to say that BI tools should not improve to be a bigger part of the solution. But to paraphrase Caitie… maybe you should let that white whale swim past.

 


[1] But definitely not only Power BI.

[2] He says unironically, before writing many more words.