This morning I presented a new webinar for the Istanbul Power BI user group, covering one of my favorite subjects: common patterns for successfully using and adopting dataflows in Power BI.
This session represents an intersection of my data culture series in that it presents lessons learned from successful enterprise customers, and my dataflows series in that… in that it’s about dataflows. I probably didn’t need to point out that part.
The session recording is available for on-demand viewing. The presentation is around 50 minutes, with about 30 minutes of dataflows-centric Q&A at the end. Please check it out, and share it with your friends!
This is my personal blog – I try to be consistently explicit in reminding all y’all about this when I post about topics that are related to my day job as a program manager on the Power BI CAT team. This is one of those posts.
If I had to oversimplify what I do at work, I’d say that I represent the voice of enterprise Power BI customers. I work with key stakeholders from some of the largest companies in the world, and ensure that their needs are well-represented in the Power BI planning and prioritization process, and that we deliver the capabilities that these enterprise customers need[1].
Looking behind this somewhat grandiose summary[2], a lot of what I do is tell stories. Not my own stories, mind you – I tell the customers’ stories.
It was the best of clouds, it was the worst of clouds.
On an ongoing basis, I ask customers to tell me their stories, and I help them along by asking these questions:
What goals are you working to achieve?
How are you using Power BI to achieve these goals?
Where does Power BI make it hard for you to do what you need to do?
When they’re done, I have a pretty good idea what’s going on, and do a bunch of work[3] to make sure that all of these stories are heard by the folks responsible for shipping the features that will make these customers more successful.
Most of the time these stories are never shared outside the Power BI team, but on occasion there are customers who want to share their stories more broadly. My amazing teammate Lauren has been doing the heavy lifting[4] in getting them ready to publish for the world to see, and yesterday the fourth story from her efforts has been published.
Update: Apparently the Cerner story was getting published while I was writing this post. Added to the list above.
I know that some people will look at these stories and discount them as marketing – there’s not a lot I can do to change that – but these are real stories that showcase how real customers are overcoming real challenges using Power BI and Azure. Being able to share these stories with the world is very exciting for me, because it’s an insight into the amazing work that these customers are doing, and how they’re using Power BI and Azure services to improve their businesses and to make people’s lives better. They’re demonstrating the art of the possible in a way that is concrete and real.
And for each public story, there are scores of stories that you’ll probably never hear. But the Power BI team is listening, and as long as they keep listening, I’ll keep helping the customers tell their stories…
[1] This makes me sound much more important than I actually am. I should ask for a raise.
[2] Seriously, if I do this, shouldn’t I be be a VP or Partner or something?
[3] Mainly boring work that is not otherwise mentioned here.
[4] This is just one more reason why having a diverse team is so important – this is work that would be brutally difficult for me, and she makes it look so easy!
But how do you do this in as secure a manner as possible, so that the right users have the minimum necessary permissions on the right data?
The short answer is that you let the data source handle secure access to the data it manages. ADLSg2 has a robust security model, which supports both Azure role-based access control (RBAC) and POSIX-like access control lists (ACLs)[1].
The longer answer is that this robust security model may make it more difficult to know how to set up permissions in the data lake to meet your analytics and security requirements.
Earlier this week I received a question from a customer on how to get Power BI to work with data in ADLSg2 that is secured using ACLs. I didn’t know the answer, but I knew who would know, and I looped in Ben Sack from the dataflows team.Ben answered the customer’s questions and unblocked their efforts, and he said that I could turn them into a blog post. Thank you, Ben![2]
Here’s what you should know:
1 – If you’re using ACLs, you must at least specify a filesystem name in the URL to load in the connector (or if you access ADLS Gen2 via API or any other client).
4 – Default ACLs are great way to have ACLs propagate to child items. But they have to be set before creating subfolders and files, otherwise you need to explicitly set ACLs on each item.[3]
6 – If you have an error accessing a path that is deep in the filesystem, work your way from the filesystem level downwards, fixing ACL settings in each step.
i.e. if you are having trouble accessing https:/StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1/SubFolder2/File(s)
Update: James Baker, a Program Manager on the Azure Storage team has published on GitHub a PowerShell script to recursively set ACLs. Thanks to Simon for commenting on this post to make me aware of it, Josh from the Azure support team for pointing me to the GitHub repo, and of course to James for writing the actual script!
[1] This description is copied directly from the ADLSg2 documentation, which you should also read before acting on the information in this post.
[2] Disclaimer: This post is basically me using my blog as a way to get Ben’s assistance online so more people can get insights from it. If the information is helpful, all credit goes to Ben. If anything doesn’t work, it’s my fault. Ok, it may also be your fault, but it’s probably mine.
[3] This one is very important to know before you begin, even though it may be #3 on the list.
[4] This is a best practice pretty much everywhere, not just here.
Most of my blog posts that discuss the integration of Azure data services and Power BI dataflows via Common Data Model folders[1][2][3] include links to a tutorial and sample originally published in late 2018 by the Azure team. This has long been the best resource to explain in depth how CDM folders fit in with the bigger picture of Azure data.
Microsoft Solutions Architect Ted Malone has used the Azure sample as a starting point for a GitHub project of his own, and has extended this sample project to start making it suitable for more scenarios.
The thing that has me the most excited (beyond having Ted contributing to a GitHub repo, and having code that works with large datasets) is the plan to integrate with Apache Atlas for lineage and metadata. That’s the good stuff right there.
If you’re following my blog for more than just Power BI and recipes, this is a resources you need in your toolkit. Check it out, and be sure to let Ted know if it solves your problems.
This week’s Power BIte is the fourth and final entry in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.
When creating a dataflow by attaching an external CDM folder, the dataflow will have the following characteristics:
Attribute
Value
Data ingress path
Ingress via Azure Data Factory, Databricks, or whatever Azure service or app has created the CDM folder.
Data location
Data stored in ADLSg2 in the CDM folder created by the data ingress process.
Data refresh
The data is refreshed based on the execution schedule and properties of the data ingress process, not by any setting in Power BI.
The key to this scenario is the CDM folder storage format. CDM folders provide a simple and open way to persist data in a data lake. Because CDM folders are implemented using CSV data files and JSON metadata, any application can read from and write to CDM folders. This includes multiple Azure services that have libraries for reading and writing CDM folders and 3rd party data tools like Informatica that have implemented their own CDM folder connectors.
CDM folders enable scenarios like this one, which is implemented in a sample and tutorial published on GitHub by the Azure data team:
Create a Power BI dataflow by ingesting order data from the Wide World Importers sample database and save it as a CDM folder
Use an Azure Databricks notebook that prepares and cleanses the data in the CDM folder, and then writes the updated data to a new CDM folder in ADLS Gen2
Attach the CDM folder created by Databricks as an external dataflow in Power BI[2]
Use Azure Machine Learning to train and publish a model using data from the CDM folder
Use an Azure Data Factory pipeline to load data from the CDM folder into staging tables in Azure SQL Data Warehouse and then invoke stored procedures that transform the data into a dimensional model
Use Azure Data Factory to orchestrate the overall process and monitor execution
That’s it for this mini-series!
If all this information still doesn’t make sense yet, now is the time to ask questions.
[1] New videos every Monday morning!
[2] I added this bullet to the list because it fits in with the rest of the post – the other bullets are copied from the sample description.
If you’re using your own organizational Azure Data Lake Storage Gen2 account for Power BI dataflows, you can use the CDM folders that Power BI creates as a data source for other efforts, including data science with tools like Azure Machine Learning and Azure Databricks.
A world of possibilities appears before you…
This capability has been in preview since early this year, so it’s not really new, but there are enough pieces involved that it may not be obvious how to begin – and I continue to see enough questions about this topic that another blog post seemed warranted.
The key point is that because dataflows are writing data to ADLSg2 in CDM folder format, Azure Machine Learning and Azure Databricks can both read the data using the metadata in the model.json file.
This json file serves as the “endpoint” for the data in the CDM folder; it’s a single resource that you can connect to, and not have to worry about the complexities in the various subfolders and files that the CDM folder contains.
This tutorial is probably the best place to start if you want to know more[1]. It includes directions and sample code for creating and consuming CDM folders from a variety of different Azure services – and Power BI dataflows. If you’re one of the people who has recently asked about this, please go through this tutorial as your next step!
[1] It’s the best resource I’m aware of – if you find a better one, please let me know!
You probably already know that Power BI dataflows store their data in CDM folders. But what does this actually mean?
Matthew apparently thinks this is what CDM looks like inside of computers.
This is a quick post to share information that I hope will answer some of the most common questions that I hear from time to time, and which I discuss when I present on Power BI dataflows integration with Azure. I don’t believe any of the information in this post is new or unique[1], but I do believe it is delivered in a more targeted manner that might help.
Point #1: CDM is a metadata system
The Common Data Model is a metadata system that simplifies data management and application development by unifying data into a known form and applying structural and semantic consistency across multiple apps and deployments. If you’re coming from a SQL Server background, it may help to think about CDM as the “system tables” for data that’s stored in multiple locations and formats. This analogy doesn’t hold up to particularly close inspection, but it’s a decent place to start.
Point #2: CDM includes standard entity schemas
In addition to being a metadata system, the Common Data Model includes a set of standardized, extensible data schemas that Microsoft and its partners have published. This collection of predefined schemas includes entities, attributes, semantic metadata, and relationships. The schemas represent commonly used concepts and activities, such as Account and Campaign, to simplify the creation, aggregation, and analysis of data.
Point #3: CDM folders are data storage that use CDM metadata
A CDM folder is a folder in a data lake that conforms to specific, well-defined, and standardized metadata structures and self-describing data. These folders facilitate metadata discovery and interoperability between data producers and data consumers.
CDM folders store metadata in a model.json file; this is what makes them self-describing. This metadata conforms to the CDM metadata format, and can be read by any client application or code that knows how to work with CDM.
Point #4: You don’t need to use any standard entities
The most common misconception I hear about CDM and CDM folders is that you only use them when you’re storing “standard data.” This is not correct. The data in a CDM entity may map to a standard entity schema, but for 99% of the entities I have built or used, this is not the case. There is nothing in CDM or CDM folders that requires you to use a standard schema.
I hope this helps – please let me know if you have questions!
[1] Check out the documentation for CDM and CDM folders here and here, and here for more detail. You’ll probably notice that some chunks of text in this post were simply copied from that documentation.
I received a question today via Twitter, and although I know the information needed to answer it is available online, I don’t believe there’s a single concise answer anywhere[1]. This is the question, along with a brief elaboration following my initial response:
Here’s the short answer: When you use an organizational ADLSg2 account to store dataflow data, your Azure subscription will be billed for any storage and egress based on however Azure billing works[2].
Here’s the longer answer:
Power BI dataflows data counts against the same limits as Power BI datasets. Each Pro license grants 10 GB of storage, and a Premium capacity node includes 100 TB of storage.
Integrating Power BI dataflows with ADLSg2 is notlimited to Power BI Premium.
When you’re using Power BI dataflows in their default configuration, dataflow data is stored to this Power BI storage, and counts against the appropriate quota.
When dataflow data is saved to Power BI storage, it can only be accessed by Power BI – no other services or applications can read the data.
When you configure your dataflows to use an organizational ADLSg2 account, the dataflow data is saved to the Azure resource you specify, and not to the Power BI storage, so it doesn’t count against the Pro or Premium storage quota. This is particularly significant when you’re not using Power BI Premium, as ADLSg2 storage will scale to support any scenario, and not be limited by the 10 GB Pro storage limit.
When dataflow data is saved to ADLSg2, the CDM folders can be accessed by any authorized client via Azure APIs, and by Power BI as dataflow entities. This is particularly valuable for enabling collaboration between analysts and other Power BI users, and data scientists and other data professionals using Azure tools.
Hopefully this will help clear things up. If you have any questions, please let me know!
[1] Please note that I didn’t actually go looking to make sure, because I was feeling lazy and needed an excuse to blog about something vaguely technical.
[2] I add that final qualifier because I am not an authority on Azure or Power BI billing, or on licensing of any sort. For any specific information on licensing or billing, please look elsewhere for expert advice, because you won’t find it here.
In the last few weeks I’ve seen a spike in questions related to the integration of Power BI dataflows and Azure Data Lake Storage Gen2. Here’s a quick “status check” on the current state of this feature to get the answers out for as many people as possible.
Power BI dataflows are generally available (GA) for capabilities that use the built-in Power BI-managed storage.
Power BI dataflows integration with Azure is currently in preview – this includes the “Bring your own storage account” capabilities where you can configure Power BI to use your ADLSg2 storage account for dataflows storage, instead of using the built-in Power BI-managed storage.
Only a single ADLSg2 storage account can be configured for a Power BI tenant.
The storage account, once configured, cannot be changed.
The setup process to connect Power BI with ADLSg2 is somewhat lengthy and step-intensive.
To grant users other than the owner of the dataflow access to the dataflow in Power BI, you must grant them permissions to access the workspace in Power BI and grant them access to the CDM folder in ADLSg2.
These limitations will be addressed when this capability hits GA, but you should definitely be aware of them in the meantime. (You may also want to take a look at this MBAS session for an overview of the roadmap for the rest of this calendar year.)
I’ve seen customers take different approaches:
Some customers delay their integration of Power BI and ADLSg2, and are waiting for these limitations to be removed before they move forward.
Some customers adopt within the constraints of the preview, and choose a workload or set of workloads where the current limitations are acceptable.
Some customers set up a demo tenant of Power BI and use it to test and validate the ADLSg2 integration while deciding on option 1 or option 2.
I hope this helps. If you or your customers have any questions on this topic that aren’t answered here, please let me know!
[1] And they’re all documented. Nothing in this blog post is new, but hopefully it will help to have this summary online and sharable.
Last week Microsoft held its annual Microsoft Business Applications Summit (MBAS) event in Atlanta. This two-day technical conference covers the whole Business Applications platform – including Dynamics, PowerApps, and Flow – and not just Power BI, but there was a ton of great Power BI content to be had. Now that the event is over, the session recordings and resources are available to everyone.
There’s a dedicated page on the Power BI community site with all of the sessions, but I wanted to call out a few sessions on dataflows and the Common Data Model that readers of this blog should probably watch[1].
This session is something of a “deep technical introduction” to dataflows in Power BI. If you’re already familiar with dataflows a lot of this will be a repeat, but there are some gems as well.
This session is probably my favorite dataflows session from any conference. This is a deep dive into the dataflows architecture, including the brand-new-in-preview compute engine for performance and scale.
Common Data Model sessions
As you know, Power BI dataflows build on CDM and CDM folders. As you probably know, CDM isn’t just about Power BI – it’s a major area of investment across Azure data services as well. The session lineup at MBAS reflected this importance with three dedicated CDM sessions.
This ironically-named session[2] provides a comprehensive overview of CDM. It’s not really everything you need, but it’s the right place to begin if you’re new to CDM and want to the big-picture view.
This session covers how CDM and CDM folders are used in Power BI and Azure data services. If you’ve been following dataflows and CDM closely over the past six months much of this session might be review, but it’s an excellent “deep overview” nonetheless.
This session is probably the single best resource on CDM available today. The presenters are the key technical team behind CDM, and goes into details and concepts that aren’t available in any other presentation I’ve found. I’ve been following CDM pretty closely for the past year or more, and I learned a lot from this session. You probably will too.
[1] I have a list of a dozen or more sessions that I want to watch, and only a few of them are dataflows-centric. If you look through the catalog you’ll likely find some unexpected gems.
[2] If this is all you need to know, why do we have these other two sessions?
[3] Including Jeff Bernhardt, the architect behind CDM. Jeff doesn’t have the rock star reputation he deserves, but he’s been instrumental in the design and implementation of many of the products and services on which I’ve built my career. Any time Jeff is talking, I make a point to listen closely.