One key aspect of Power BI dataflows is that they store their data in CDM folders in Azure Data Lake Storage gen2. When a dataflow is refreshed, the queries that define the dataflow entities are executed, and their results are stored in the underlying CDM folders in the data lake storage that’s managed by the Power BI service.
By default the Power BI service hides the details of the underlying storage. Only the Power BI service can write to the CDM folders, and only the Power BI service can read from them.
But Matthew knew that there are other options beyond the default…
Please note: At the time this post is published, the capabilities it describes are being rolled out to Power BI customers around the world. If you do not yet see these capabilities in your Power BI tenant, please understand that the deployment process may take several days to reach all regions.
In addition to writing to the data lake storage that is included with Power BI, you can also configure Power BI to write to an Azure Data Lake Storage gen2 resource in your own Azure subscription. This configuration opens up powerful capabilities for using data created in Power BI as the source for other Azure services. This means that data produced by analysts in a low-code/no-code Power BI experience can be used by data scientists in Azure Machine Learning, or by data engineers in Azure Data Factory or Azure Databricks.
Let that sink in for a minute, because it’s more important that it seemed when you just read it. Business data experts – the people who may not know professional data tools and advanced concepts in depth, but who are intimately involved with how the data is used to support business processes – can now use Power BI to produce data sets that can be easily used by data professionals in their tools of choice. This is a Big Deal. Not only does this capability deliver the power of Azure Data Lake Storage gen2 for scale and computing capability, it enables seamless collaboration between business and IT.
The challenge of operationalization/industrialization that has been part of self-service BI since self-service BI has been around has typically been solved by business handing off to IT the solution that they created. Ten years ago the artifact being handed off may have been an Excel workbook full of macros and VLOOKUP. IT would then need to reverse-engineer and re-implement the logic to reproduce it in a different tool and different language. Power Query and dataflows have made this story simpler – an analyst can develop a query that can be re-used directly by IT. But now an analyst can easily produce data that can be used – directly and seamlessly – by IT projects. Bam.
Before I move on, let me add a quick sanity check here. You can’t build a production data integration process on non-production data sources and expect it to deliver a stable and reliable solution, and that last paragraph glossed over this fact. When IT starts using a business-developed CDM folder as a data source, this needs to happen in the context of a managed process that eventually includes the ownership of the data source transitioning to IT. The integration of Power BI dataflows and CDM folders in Azure Data Lake Storage gen2 will make this process much simpler, but the process will still be essential.
Now let’s take a look at how this works.
I’m not going to go into details about the data lake configuration requirements here – but there are specific steps that need to be taken on the Azure side of things before Power BI can write to the lake. For information on setting up Azure Data Lake Storage gen2 to work with Power BI, check the documentation.
The details are in the documentation, but once the setup is complete, there will be a filesystem named powerbi, and the Power BI service will be authorized to read it and write to it. As the Power BI service refreshes dataflows, it writes entity data in a folder structure that matches the content structure in Power BI. This approach – which has folders named after workspaces, dataflows, and entities, and files named after entities, makes it easier for all parties to understand what data is stored where, and how the file storage in the data lake relates to the the objects in Power BI.
To enable this feature, a Power BI administrator first needs to use the Power BI admin portal to connect Power BI to Azure Data Lake Storage gen2. This is a tenant-level setting. The administrator must enter the Subscription ID, the Resource Group ID, and the Storage Account name for the Azure Data Lake Storage gen2 resource that Power BI will use. The administrator needs to turn it on. In the admin portal there is an option labeled “Allow workspace admins to assign workspaces to this storage account.” Once this is turned on, we’re ready to go.
And of course, by “we” I mean ” workspace admins” and by “go” I mean “configure our workspaces storage settings.”
When creating a new app workspace, in the “Advanced” portion of the UI, you can see the “Dataflow storage (Preview)” option. When this option is enabled, any dataflow in the workspace will be created in the ADLSg2 resource configured by the Power BI admin, rather than in the default internal ADLSg2 storage that is managed by the Power BI service.
There are a few things worth mentioning about this screen shot:
- This is not a Premium-only feature. Although the example above shows a workspace being created in dedicated Premium capacity, this is not required to use your own data lake storage account.
- If no Power BI administrator has configured an organizational data lake storage account, this option will not be visible.
- Apparently I need to go back and fix every blog post I’ve made up until now to replace “gen2” with “Gen2” because we’re using an upper-case G now.
There are a few limitations mentioned in the screen shot, and a few that aren’t, that are worth pointing out as well:
- Because linked and computed entities use in-lake compute, you need to be using the same lake for them to work.
- You can’t change this setting for a workspace that already has dataflows in it. This option is always available when creating a new workspace, and will also be available in existing workspaces without dataflows, but if you have defined dataflows in a workspace you cannot change its storage location.
- Permissions… get a little complicated.
…so let’s look at permissions a little.
When you’re using the default Power BI storage, the Power BI service manages data access through the workspace permissions. Power BI service is the only reader and the only writer for the underlying CDM folders, and the Power BI service controls any access to the data the CDM folders contain.
When you’re using your organization’s data lake resource, ADLSg2 manages data access through the ACLs set on the folders and files. The Power BI service will grant permissions to the dataflow creator, but any additional permissions must be manually set on the files and folders in ADLSg2. This means that for any user to access the dataflow through Power BI or the CDM folder through ADLSg2, they need to be granted permissions on all files and folders in ADLSg2.
Between the ability to store dataflow data in your organization’s Azure Data Lake Storage gen2 resource, and the ability to attach external CDM folders as dataflows, Power BI now enables a wide range of collaboration scenarios
 This time I just copied the opening sentence from the last blog post. Since I was writing them at the same time, that was much easier.
 Basically a root folder, says the guy who doesn’t really know much about Azure Data Lake Storage gen2.
 I’m planning a post dedicated to dataflows security, but it’s not ready yet. Hopefully this will be useful in the interim.
 This early experience will improve as the integration between Power BI and ADLSg2 continues to evolve.