Dataflows in Power BI: Overview Part 4 – CDM Folders

Important: This post was written and published in 2018, and the content below no longer represents the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

One key aspect of Power BI dataflows is that they use Azure Data Lake Storage gen2 for their data storage. As mentioned in part 1, the technology is not exposed to Power BI users. If you’re working in Power BI, a dataflow is just a collection of entities in a workspace, with data that can be reused. But if you’re trying to understand dataflows, it’s worth looking under the hood at some of the details.

Power BI stores dataflow data in a format known as CDM Folders. The “CDM” part stands for Common Data Model[1] and the “Folder” part… is because they’re folders, with files in them.

Each CDM folder is a simple and self-describing structure. The folder contains one or more[2] CSV files for each entity, plus a JSON metadata file. Having the data in a simple and standard format like CSV means that it is easy for any application or service to read the data.[3] Having a JSON metadata file in the folder to describe its contents means that any consumer can read the JSON to easily understand the contents and their structure.

2018-10-20_19-14-54

The JSON metadata file contains:

  • The names and locations of all files in the folder.
  • Entity metadata, including names, descriptions, attributes, data types, last modified dates, and so on.
  • Lineage information for the entities – specifically, the Power Query “M” query that defines the entity
  • Information about how each entity conforms (or does not conform) to Common Data Model standard entity schemas.

If you’re interested in seeing this for yourself, the JSON metadata for a dataflow can be exported from the Power BI portal. Just select “export JSON” from the menu in the dataflows list.

export json

You don’t need to know any of this to use dataflows in Power BI. But if you’re interested in getting the most from dataflows in your end-to-end data architecture, there’s no time like the present to see how things work.


[1] The Common Data Model is a bigger topic than we’ll cover here, but if you’re interested in an introduction, you can check out this excellent session from Microsoft Ignite 2018.

[2] For simple scenarios, each entity will be backed by a single text file. For more complex scenarios involving partitioned data or incremental refresh, there will be one file per partition.

[3] Please note that in the default configuration, Power BI manages the underlying storage, and it is not available to other applications or services, so this won’t do you all that much good to start off. Power BI will, however, provide integration with Azure that will make this important and valuable.

21 thoughts on “Dataflows in Power BI: Overview Part 4 – CDM Folders

  1. Pingback: Power BI Dataflows | MS Excel | Power Pivot | DAX | SSIS |SQL

  2. Neville de Sousa

    Hi Matthew,

    This is great and want to dive right in. From a change perspective, how would you counter the argument, “let’s just stick to a DW”. I can see this come up, especially when final data files are in CSV format.

    Thank you.

    Like

    1. I would counter that argument by saying “go for it!!”

      Dataflows deliver capabilities for self-service data preparation, and because of their use of Azure Data Lake they enable a bunch of interesting integration scenarios for data science, big data, and the like. They are NOT designed to replace the data warehouse. If your scenario includes the ability to add a new dimension to a data warehouse, or to add new attributes to existing dimensions, that’s probably a good direction to choose. But without dataflows, this “data warehousing” task typically requires assistance from IT – it’s not a self-service task.

      Liked by 1 person

      1. Respectfully, I don’t agree.

        Tools like Data flows are designed to reduce latency and data delivery to analytics platforms. that way you see isn’t just in the actual physical delivery of the data, but also in the time it often takes IT to integrate new data into the classic data warehouse.

        More it’s becoming evident for many data warehouses are an anti pattern relative to their BI needs.

        Also when you consider that Microsoft is investing heavily in CDM alongside Dataflows, becomes clear that Azure Data Lake Gen 2 is poised to become the data warehouse of the future, when you consider the opportunities that exist when configure Power BI Dataflows to use an organizational data like in lieu of the Power BI Dataflows specific one.

        Liked by 1 person

  3. Pingback: Dataflows in Power BI: Overview Part 5 – Data Refresh – BI Polar

  4. Pingback: Dataflows in Power BI: Overview Part 6 – Linked and Computed Entities – BI Polar

  5. Pingback: Dataflows in Power BI – BI Polar

  6. Pingback: Dataflows and Data Profiling – BI Polar

  7. Pingback: Positioning Power BI Dataflows – BI Polar

  8. Pingback: Power BI Dataflows – Reuse without Premium – BI Polar

  9. Pingback: Power BI Dataflows FAQ – BI Polar

  10. Pingback: Positioning Power BI Dataflows (Part 2) – BI Polar

  11. Pingback: Dataflows in Power BI: Overview Part 7 – External CDM Folders – BI Polar

  12. Pingback: Dataflows in Power BI: Overview Part 8 – Using an Organizational Azure Data Lake Resource – BI Polar

  13. Pingback: Power BI dataflows and CDM Sessions from MBAS 2019 – BI Polar

  14. Pingback: Power BI dataflows enhanced compute engine – BI Polar

  15. Pingback: Session resources: Power BI dataflows and Azure Data Lake integration – BI Polar

  16. Pingback: Power BI and ADLSg2 – but not dataflows – BI Polar

  17. Pingback: Quick Tip: Working with dataflow-created CDM folders in ADLSg2 – BI Polar

  18. HT

    Hi Matthew, thanks for sharing these great posts. I am very new at Data Flows, I was wondering, since it is using ADLS Gen2, do I have to specifically create an ADLS Gen2 resource in Azure or, this is bundled within the Data Flows?

    Like

    1. Thanks for the question!

      If you want to have non-Power BI apps and services work with the data produced by a dataflow, you need to use your own ADLSg2 resource. If you only want to work with the data in Power BI, all you need is a Pro license, and the Azure storage under the hood is taken care of by the Power BI service.

      Like

Leave a Reply to Matthew Roche Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s