Quick Tip: Working with dataflow-created CDM folders in ADLSg2

If you’re using your own organizational Azure Data Lake Storage Gen2 account for Power BI dataflows, you can use the CDM folders that Power BI creates as a data source for other efforts, including data science with tools like Azure Machine Learning and Azure Databricks.

Image by Arek Socha from Pixabay
A world of possibilities appears before you…

This capability has been in preview since early this year, so it’s not really new, but there are enough pieces involved that it may not be obvious how to begin – and I continue to see enough questions about this topic that another blog post seemed warranted.

The key point is that because dataflows are writing data to ADLSg2 in CDM folder format, Azure Machine Learning and Azure Databricks can both read the data using the metadata in the model.json file.

This json file serves as the “endpoint” for the data in the CDM folder; it’s a single resource that you can connect to, and not have to worry about the complexities in the various subfolders and files that the CDM folder contains.

This tutorial is probably the best place to start if you want to know more[1]. It includes directions and sample code for creating and consuming CDM folders from a variety of different Azure services – and Power BI dataflows. If you’re one of the people who has recently asked about this, please go through this tutorial as your next step!


[1] It’s the best resource I’m aware of  – if you find a better one, please let me know!

One thought on “Quick Tip: Working with dataflow-created CDM folders in ADLSg2

  1. Matt, I’m trying to access the data snapshots created by my ingestion dataflow (i.e. the one that retrieves data from an external source). I’m able to do so successfully from Power BI Desktop by using the DFS endpoint for my folder in ADLS Gen2 (i.e. BYOD as described in your post), but this source is not yet supported in PBI dataflows (i.e. I can’t reingest the snapshots from the ingestion dataflow back into a second dataflow).

    I tried the CDM folder method you described here (in Power BI Desktop for now) but that seems to expose only data from the latest snapshot, correct?

    I also tried accessing the DFS endpoint (https://xxxx.dfs.core.windows.net/) via Web.Contents but ran into authentication errors. I guess I should just wait for ADLSg2 to be officially supported as a source in PBI dataflows, or do it via ADF wrangling dataflows if I really want all the ETL to happen upstream from the PBI dataset.

    Anyway I was curious to hear your thoughts about accessing the ADLSg2 snapshots, as snapshots are a fairly common requirement and they’re not otherwise provided by the Power BI stack. So far I’ve built a proof of concept dataset that seems to work fairly well by using the modified date of the snapshots and reconstructing data tables by querying both the snapshots (headerless raw data) and model.json (though I was not able to automate column Type assignment in Power Query because the [Type] type is a weird animal).

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s