Real customers, real stories

This is my personal blog – I try to be consistently explicit in reminding all y’all about this when I post about topics that are related to my day job as a program manager on the Power BI CAT team. This is one of those posts.

If I had to oversimplify what I do at work, I’d say that I represent the voice of enterprise Power BI customers. I work with key stakeholders from some of the largest companies in the world, and ensure that their needs are well-represented in the Power BI planning and prioritization process, and that we deliver the capabilities that these enterprise customers need[1].

Looking behind this somewhat grandiose summary[2], a lot of what I do is tell stories. Not my own stories, mind you – I tell the customers’ stories.

Image by Daria Głodowska from Pixabay
It was the best of clouds, it was the worst of clouds.

On an ongoing basis, I ask customers to tell me their stories, and I help them along by asking these questions:

  • What goals are you working to achieve?
  • How are you using Power BI to achieve these goals?
  • Where does Power BI make it hard for you to do what you need to do?

When they’re done, I have a pretty good idea what’s going on, and do a bunch of work[3] to make sure that all of these stories are heard by the folks responsible for shipping the features that will make these customers more successful.

Most of the time these stories are never shared outside the Power BI team, but on occasion there are customers who want to share their stories more broadly. My amazing teammate Lauren has been doing the heavy lifting[4] in getting them ready to publish for the world to see, and yesterday the fourth story from her efforts has been published.

You should check them out:

  1. Metro Bank: Metro Bank quickly delivers business efficiency gains without requiring involvement from IT
  2. Cummins: Cummins uses self-service BI to increase productivity and reduce unnecessary costs
  3. Veolia: Environmental services company builds sustainable, data-driven solutions with Power BI and Azure
  4. Avanade: Microsoft platform–focused IT consulting company innovates with Power BI and Azure to improve employee retention
  5. Cerner: Global healthcare solutions provider moves to the cloud for a single source of truth in asset and configuration management

Update: Apparently the Cerner story was getting published while I was writing this post. Added to the list above.

I know that some people will look at these stories and discount them as marketing – there’s not a lot I can do to change that – but these are real stories that showcase how real customers are overcoming real challenges using Power BI and Azure. Being able to share these stories with the world is very exciting for me, because it’s an insight into the amazing work that these customers are doing, and how they’re using Power BI and Azure services to improve their businesses and to make people’s lives better. They’re demonstrating the art of the possible in a way that is concrete and real.

And for each public story, there are scores of stories that you’ll probably never hear. But the Power BI team is listening, and as long as they keep listening, I’ll keep helping the customers tell their stories…


[1] This makes me sound much more important than I actually am. I should ask for a raise.

[2] Seriously, if I do this, shouldn’t I be be a VP or Partner or something?

[3] Mainly boring work that is not otherwise mentioned here.

[4] This is just one more reason why having a diverse team is so important – this is work that would be brutally difficult for me, and she makes it look so easy!

 

Power BI and ACLs in ADLSg2

In addition to using Azure Data Lake Storage Gen2 as the location for Power BI dataflows data, Power BI can also use ADLSg2 as a data source. As organizations choose ADLSg2 as the storage location for more and more data, this capability is key to enabling analysts and self-service BI users to get value from the data in the lake.

boje-2914324_640
Oh buoy, that is one big data lake!

But how do you do this in as secure a manner as possible, so that the right users have the minimum necessary permissions on the right data?

The short answer is that you let the data source handle secure access to the data it manages. ADLSg2 has a robust security model, which supports both Azure role-based access control (RBAC) and POSIX-like access control lists (ACLs)[1].

The longer answer is that this robust security model may make it more difficult to know how to set up permissions in the data lake to meet your analytics and security requirements.

Earlier this week I received a question from a customer on how to get Power BI to work with data in ADLSg2 that is  secured using ACLs. I didn’t know the answer, but I knew who would know, and I looped in Ben Sack from the dataflows team. Ben answered the customer’s questions and unblocked their efforts, and he said that I could turn them into a blog post. Thank you, Ben![2]

Here’s what you should know:

1 – If you’re using ACLs, you must at least specify a filesystem name in the URL to load in the connector (or if you access ADLS Gen2 via API or any other client).

i.e. Path in Power BI Connector must at least be: https://storageaccountname.dfs.core.windows.net/FileSystemName/

2 – For every file you want to read its contents, all parent folders and filesystem must have the “x” ACL. And the file must have a “r” ACL.

i.e. if you want to access the file: https://StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1/File1.csv

3 – For files you want to list, all parent folders and filesystem must have the “x” ACL. The immediate parent folder must also have a “r” ACL.

i.e. if you want to view and access the files in this subfolder: https://StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1/

4 – Default ACLs are great way to have ACLs propagate to child items. But they have to be set before creating subfolders and files, otherwise you need to explicitly set ACLs on each item.[3]

5 – If permission management is going to be dynamic, use groups as much as possible rather than assigning permissions to individual users[4]. First, ACL the groups to folders/files and then manage access via membership in the group.

6 – If you have an error accessing a path that is deep in the filesystem, work your way from the filesystem level downwards, fixing ACL settings in each step.

i.e. if you are having trouble accessing https:/StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1/SubFolder2/File(s)

First try: https://StorageAccountName.dfs.core.windows.net/FileSystemName

Then: https://StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1

And so on.

Update: James Baker, a Program Manager on the Azure Storage team has published on GitHub a PowerShell script to recursively set ACLs. Thanks to Simon for commenting on this post to make me aware of it, Josh from the Azure support team for pointing me to the GitHub repo, and of course to James for writing the actual script!


[1] This description is copied directly from the ADLSg2 documentation, which you should also read before acting on the information in this post.

[2] Disclaimer: This post is basically me using my blog as a way to get Ben’s assistance online so more people can get insights from it. If the information is helpful, all credit goes to Ben. If anything doesn’t work, it’s my fault. Ok, it may also be your fault, but it’s probably mine.

[3] This one is very important to know before you begin, even though it may be #3 on the list.

[4] This is a best practice pretty much everywhere, not just here.

New resource: Generating CDM folders from Azure Databricks

Most of my blog posts that discuss the integration of Azure data services and Power BI dataflows via Common Data Model folders[1][2][3] include links to a tutorial and sample originally published in late 2018 by the Azure team. This has long been the best resource to explain in depth how CDM folders fit in with the bigger picture of Azure data.

Now there’s something better.

Microsoft Solutions Architect Ted Malone has used the Azure sample as a starting point for a GitHub project of his own, and has extended this sample project to start making it suitable for more scenarios.

2019-12-20-15-39-41-744--msedge

The thing that has me the most excited (beyond having Ted contributing to a GitHub repo, and having code that works with large datasets) is the plan to integrate with Apache Atlas for lineage and metadata. That’s the good stuff right there.

If you’re following my blog for more than just Power BI and recipes, this is a resources you need in your toolkit. Check it out, and be sure to let Ted know if it solves your problems.


[1] Power BIte: Creating dataflows by attaching external CDM folders

[2] Quick Tip: Working with dataflow-created CDM folders in ADLSg2

[3] Dataflows, CDM folders and the Common Data Model

Power BIte: Creating dataflows by attaching external CDM folders

This week’s Power BIte is the fourth and final entry in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by attaching an external CDM folder, the dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via Azure Data Factory, Databricks, or whatever Azure service or app has created the CDM folder.
Data location Data stored in ADLSg2 in the CDM folder created by the data ingress process.
Data refresh The data is refreshed based on the execution schedule and properties of the data ingress process, not by any setting in Power BI.

The key to this scenario is the CDM folder storage format. CDM folders provide a simple and open way to persist data in a data lake. Because CDM folders are implemented using CSV data files and JSON metadata, any application can read from and write to CDM folders. This includes multiple Azure services that have libraries for reading and writing CDM folders and 3rd party data tools like Informatica that have implemented their own CDM folder connectors.

CDM folders enable scenarios like this one, which is implemented in a sample and tutorial published on GitHub by the Azure data team:

  • Create a Power BI dataflow by ingesting order data from the Wide World Importers sample database and save it as a CDM folder
  • Use an Azure Databricks notebook that prepares and cleanses the data in the CDM folder, and then writes the updated data to a new CDM folder in ADLS Gen2
  • Attach the CDM folder created by Databricks as an external dataflow in Power BI[2]
  • Use Azure Machine Learning to train and publish a model using data from the CDM folder
  • Use an Azure Data Factory pipeline to load data from the CDM folder into staging tables in Azure SQL Data Warehouse and then invoke stored procedures that transform the data into a dimensional model
  • Use Azure Data Factory to orchestrate the overall process and monitor execution

That’s it for this mini-series!

If all this information still doesn’t make sense yet, now is the time to ask questions.


[1] New videos every Monday morning!

[2] I added this bullet to the list because it fits in with the rest of the post – the other bullets are copied from the sample description.

Quick Tip: Working with dataflow-created CDM folders in ADLSg2

If you’re using your own organizational Azure Data Lake Storage Gen2 account for Power BI dataflows, you can use the CDM folders that Power BI creates as a data source for other efforts, including data science with tools like Azure Machine Learning and Azure Databricks.

Image by Arek Socha from Pixabay
A world of possibilities appears before you…

This capability has been in preview since early this year, so it’s not really new, but there are enough pieces involved that it may not be obvious how to begin – and I continue to see enough questions about this topic that another blog post seemed warranted.

The key point is that because dataflows are writing data to ADLSg2 in CDM folder format, Azure Machine Learning and Azure Databricks can both read the data using the metadata in the model.json file.

This json file serves as the “endpoint” for the data in the CDM folder; it’s a single resource that you can connect to, and not have to worry about the complexities in the various subfolders and files that the CDM folder contains.

This tutorial is probably the best place to start if you want to know more[1]. It includes directions and sample code for creating and consuming CDM folders from a variety of different Azure services – and Power BI dataflows. If you’re one of the people who has recently asked about this, please go through this tutorial as your next step!


[1] It’s the best resource I’m aware of  – if you find a better one, please let me know!

Power BI dataflows, Premium, and ADLSg2

I received a question today via Twitter, and although I know the information needed to answer it is available online, I don’t believe there’s a single concise answer anywhere[1]. This is the question, along with a brief elaboration following my initial response:

Billing Twitter

Here’s the short answer: When you use an organizational ADLSg2 account to store dataflow data, your Azure subscription will be billed for any storage and egress based on however Azure billing works[2].

Here’s the longer answer:

  • Power BI dataflows data counts against the same limits as Power BI datasets. Each Pro license grants 10 GB of storage,  and a Premium capacity node includes 100 TB of storage.
  • Integrating Power BI dataflows with ADLSg2 is not limited to Power BI Premium.
  • When you’re using Power BI dataflows in their default configuration, dataflow data is stored to this Power BI storage, and counts against the appropriate quota.
  • When dataflow data is saved to Power BI storage, it can only be accessed by Power BI – no other services or applications can read the data.
  • When you configure your dataflows to use an organizational ADLSg2 account, the dataflow data is saved to the Azure resource you specify, and not to the Power BI storage, so it doesn’t count against the Pro or Premium storage quota. This is particularly significant when you’re not using Power BI Premium, as ADLSg2 storage will scale to support any scenario, and not be limited by the 10 GB Pro storage limit.
  • When dataflow data is saved to ADLSg2, the CDM folders can be accessed by any authorized client via Azure APIs, and by Power BI as dataflow entities. This is particularly valuable for enabling collaboration between analysts and other Power BI users, and data scientists and other data professionals using Azure tools.

Hopefully this will help clear things up. If you have any questions, please let me know!


[1] Please note that I didn’t actually go looking to make sure, because I was feeling lazy and needed an excuse to blog about something vaguely technical.

[2] I add that final qualifier because I am not an authority on Azure or Power BI billing, or on licensing of any sort. For any specific information on licensing or billing, please look elsewhere for expert advice, because you won’t find it here.

 

Status Check: Power BI dataflows and ADLSg2

In the last few weeks I’ve seen a spike in questions related to the integration of Power BI dataflows and Azure Data Lake Storage Gen2. Here’s a quick “status check” on the current state of this feature to get the answers out for as many people as possible.

  • Power BI dataflows are generally available (GA) for capabilities that use the built-in Power BI-managed storage.
  • Power BI dataflows integration with Azure is currently in preview – this includes the “Bring your own storage account” capabilities where you can configure Power BI to use your ADLSg2 storage account for dataflows storage, instead of using the built-in Power BI-managed storage.
  • During preview, there are several known limitations:
    • Only a single ADLSg2 storage account can be configured for a Power BI tenant.
    • The storage account, once configured, cannot be changed.
    • The setup process to connect Power BI with ADLSg2 is somewhat lengthy and step-intensive.
    • To grant users other than the owner of the dataflow access to the dataflow in Power BI, you must grant them permissions to access the workspace in Power BI and grant them access to the CDM folder in ADLSg2.

These limitations will be addressed when this capability hits GA, but you should definitely be aware of them in the meantime. (You may also want to take a look at this MBAS session for an overview of the roadmap for the rest of this calendar year.)

I’ve seen customers take different approaches:

  1. Some customers delay their integration of Power BI and ADLSg2, and are waiting for these limitations to be removed before they move forward.
  2. Some customers adopt within the constraints of the preview, and choose a workload or set of workloads where the current limitations are acceptable.
  3. Some customers set up a demo tenant of Power BI and use it to test and validate the ADLSg2 integration while deciding on option 1 or option 2.

I hope this helps. If you or your customers have any questions on this topic that aren’t answered here, please let me know!


[1] And they’re all documented. Nothing in this blog post is new, but hopefully it will help to have this summary online and sharable.

Power BI dataflows and CDM Sessions from MBAS 2019

Last week Microsoft held its annual Microsoft Business Applications Summit (MBAS) event in Atlanta. This two-day technical conference covers the whole Business Applications platform – including Dynamics, PowerApps, and Flow – and not just Power BI, but there was a ton of great Power BI content to be had. Now that the event is over, the session recordings and resources are available to everyone.

MBAS 2019 Banner

There’s a dedicated page on the Power BI community site with all of the sessions, but I wanted to call out a few sessions on dataflows and the Common Data Model that readers of this blog should probably watch[1].

Power BI dataflows sessions

Microsoft Power BI: Democratizing self-service data prep with dataflows

This session is something of a “deep technical introduction” to dataflows in Power BI. If you’re already familiar with dataflows a lot of this will be a repeat, but there are some gems as well.

Microsoft Power BI: Enterprise-grade BI with Power BI dataflows

This session is probably my favorite dataflows session from any conference. This is a deep dive into the dataflows architecture, including the brand-new-in-preview compute engine for performance and scale.

Common Data Model sessions

As you know, Power BI dataflows build on CDM and CDM folders. As you probably know, CDM isn’t just about Power BI – it’s a major area of investment across Azure data services as well. The session lineup at MBAS reflected this importance with three dedicated CDM sessions.

Common Data Model: All you need to know

This ironically-named session[2] provides a comprehensive overview of CDM. It’s not really everything you need, but it’s the right place to begin if you’re new to CDM and want to the big-picture view.

Microsoft Power BI: Common Data Model and Azure Data Services

This session covers how CDM and CDM folders are used in Power BI and Azure data services. If you’ve been following dataflows and CDM closely over the past six months much of this session might be review, but it’s an excellent “deep overview” nonetheless.

Microsoft Power BI: Advanced concepts in the Common Data Model

This session is probably the single best resource on CDM available today. The presenters are the key technical team behind CDM, and goes into details and concepts that aren’t available in any other presentation I’ve found. I’ve been following CDM pretty closely for the past year or more, and I learned a lot from this session. You probably will too.

Once you’re done watching these sessions, remember that there’s a huge library of technical sessions you can watch on-demand. Also some less-technical sessions.


[1] I have a list of a dozen or more sessions that I want to watch, and only a few of them are dataflows-centric. If you look through the catalog you’ll likely find some unexpected gems.

[2] If this is all you need to know, why do we have these other two sessions?

[3] Including Jeff Bernhardt, the architect behind CDM. Jeff doesn’t have the rock star reputation he deserves, but he’s been instrumental in the design and implementation of many of the products and services on which I’ve built my career. Any time Jeff is talking, I make a point to listen closely.

Session resources: Power BI dataflows and Azure Data Lake integration

Last week I delivered two online sessions on the topic of integrating Power BI dataflows with an organizational Azure Data Lake Storage Gen2 storage account. I’ve blogged on this topic before (link | link | link | link) but sometimes a presentation and demo is worth a thousand words.

On April 30th I delivered a “Power BI dataflows 201 – beyond the basics” session for the PASS Business Intelligence Virtual Chapter. The session recording is online here, and you can download the slides here.

On May 4th I delivered a “Integrating Power BI and Azure Data Lake with dataflows and CDM Folders” session for the SQL Saturday community event in Stockholm, Sweden. I was originally planning to deliver the Stockholm session in person, but due to circumstances beyond my control[1] I ended up presenting remotely, which meant that I could more easily record the session. The session recording is online here, and you can download the slides here.

Each of these sessions covers much of the same material. The Stockholm presentation got off to a bit rocky start[2] but it gets smoother after the first few minutes.

Please feel free to use these slides for your own presentations if they’re useful. And please let me know if you have any questions!

Update: I delivered an updated version of this session on September 27, at the Power Platform World Tour in Vancouver, BC. There is no recording of this session, but the updated slides can be downloaded here.


[1] I forgot to book flights. Seriously, I thought I had booked flights in February when I committed to speaking, and by the time I realized that I had not booked them, they were way out of my budget. This was not my finest hour.

[2] The presentation was scheduled to start at 6:00 AM, so I got up at 4:00 and came into the office to review and prepare. Instead I spent the 90 minutes before the session start time fighting with PC issues and got everything working less than a minute before 6:00. I can’t remember ever coming in quite this hot…