Power BIte: Turning datasets into dataflows

At this point I’ve said “Power BI dataflows enable reuse” enough times that I feel like a broken record[1]. What does this phrase actually mean, and how can you take advantage of dataflows to enable reuse in your Power BI applications?

This Power BIte video is a bit longer than its predecessors, and part of this is because it covers both the problem and the solution.

The problem is that self-service BI applications often start out as one-off efforts, but don’t stay that way. At least in theory, if the problem solved by the application was widespread and well understood, there would be an existing solution already developed and maintained by IT, and business users wouldn’t need to develop their own solutions.

Successful applications have a tendency to grow. For self-service BI, this could mean that more and more functionality gets added to the application, or it could mean that someone copies the relevant portions of the application and uses them as the starting point for a new, different-but-related, application.

Once this happens, there is a natural and gradual process of drift[2] that occurs, as each branch of the application tree grows in its own direction. A predictable consequence of this drift in Power BI applications is that query definitions that start off as common will gradually become out of sync, meaning that “the same data” in two datasets will actually contain different values.

Moving queries that need to be shared across multiple applications from multiple datasets into a single dataflow is a simple and effective solution to this problem. There’s no dedicated tooling for this solution in Power BI today, but the steps are still simple and straightforward.

P.S. This is the first Power BIte video recorded in my home office. After struggling unsuccessfully to get decent audio quality in my office at work, I’m trying out a new environment and some new tools. I know there’s still work to be done, but hopefully this is a step in the right direction. As always, I’d love to know what you think…


 

[1] For my younger readers, this phrase is a reference to when Spotify used to be called “records” and the most common service outage symptom was a repeat of the audio buffer until the user performed a hard reset of the client application. True story.

[2] Is there a better term for this? I feel like there should be an existing body of knowledge that I could reference, but my searching did not yield any promising results. The fact that “Logical Drift” is the name of a band probably isn’t helping.

Power BI Premium Dedicated Capacity Load Assessment Tool on GitHub

Last month[1] at the Microsoft Business Applications Summit (MBAS), Power BI program managers David Magar and Swati Gupta showed off a new load assessment tool for Power BI Premium capacity workloads.

premium tool

This new tool was included as part of the BRK2046  session on Power BI Premium at MBAS. The whole session is valuable, but the tool itself comes in around the the 32 minute mark. There’s a demo at the 37 minute mark. The tool is available today on github.

This tool will help Power BI Premium customers better plan for how their specific workloads (reports, dashboards, datasets, dataflows, and patterns of access) will perform on a given Premium capacity.

The tool is built on top of a PBIE load generation tool my teammate Sergei Gundorov has built to help ISVs better handle load on their PBIE capacities. The tool grabs a user’s token and uses it to render reports again and again, cycling through preset filter values and incrementing a “render counter”. The tool stops rendering when the authentication token expires, so the result is an empirical benchmark: “report X can run against capacity Y, Z times in 1 hour”.

The tool that’s available publicly used Sergei’s work as the starting point and uses PowerShell to turn it into a simple “menu-based” UX that anybody can run. The tool enables users to:

  • choose multiple reports to run at once
  • choose the credentials used for each report
  • define filter values to cycle through between renders for each report
  • define how many users (browser windows) should request the report at once

Once all definitions are set the tool launches multiple browser windows, each targeting different reports and the users can see the load test happening on screen.

The tool was an effective way for David and Swati to generate “interesting” load scenarios for their MBAS workshop. They used it demonstrate how phenomena related to overloaded capacities (such as query wait time build up and frequent evictions) are visible using the Power BI Premium metrics app. If you haven’t already watched the session recording, be sure to check it out.

The dedicated capacity load assessment tool is published on GitHub for anyone to use. There’s a post introducing it on the Power BI blog.

The folks at Artis Consulting have already taken the tool that Sergei developed and which was shown at MBAS and have released a “Realistic” load test tool, also on GitHub. This tool build on the original one, and makes it easier to simulate users interacting with reports in a more realistic manner, such as selecting filters and bookmarks.

If you’re interested in understanding how your Power BI application will scale under load on your dedicated capacity[2], check out these tools and consider how to incorporate them into your development and deployment processes.


[1] Yes, MBAS took place in June, and this is getting posted in October. I originally wrote this post in early July, and I put it on hold to wait for the official blog post to be done so I could include a link. It’s been languishing in my drafts ever since. Life comes at you fast…

[2] It’s worth emphasizing that this tool and this post apply only to dedicated Power BI capacity. If you are using shared capacity, you should not use this tool.

Power BIte: Creating dataflows by attaching external CDM folders

This week’s Power BIte is the fourth and final entry in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by attaching an external CDM folder, the dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via Azure Data Factory, Databricks, or whatever Azure service or app has created the CDM folder.
Data location Data stored in ADLSg2 in the CDM folder created by the data ingress process.
Data refresh The data is refreshed based on the execution schedule and properties of the data ingress process, not by any setting in Power BI.

The key to this scenario is the CDM folder storage format. CDM folders provide a simple and open way to persist data in a data lake. Because CDM folders are implemented using CSV data files and JSON metadata, any application can read from and write to CDM folders. This includes multiple Azure services that have libraries for reading and writing CDM folders and 3rd party data tools like Informatica that have implemented their own CDM folder connectors.

CDM folders enable scenarios like this one, which is implemented in a sample and tutorial published on GitHub by the Azure data team:

  • Create a Power BI dataflow by ingesting order data from the Wide World Importers sample database and save it as a CDM folder
  • Use an Azure Databricks notebook that prepares and cleanses the data in the CDM folder, and then writes the updated data to a new CDM folder in ADLS Gen2
  • Attach the CDM folder created by Databricks as an external dataflow in Power BI[2]
  • Use Azure Machine Learning to train and publish a model using data from the CDM folder
  • Use an Azure Data Factory pipeline to load data from the CDM folder into staging tables in Azure SQL Data Warehouse and then invoke stored procedures that transform the data into a dimensional model
  • Use Azure Data Factory to orchestrate the overall process and monitor execution

That’s it for this mini-series!

If all this information still doesn’t make sense yet, now is the time to ask questions.


[1] New videos every Monday morning!

[2] I added this bullet to the list because it fits in with the rest of the post – the other bullets are copied from the sample description.

Where has the time gone?

My last post was apparently my 100th post since BI Polar kicked off last October.

2019-11-13-19-30-18-503--msedge

That’s an average of right around two posts per week, although my actual writing output has been much less even and predictable than this number might suggest.

2019-11-13-19-33-01-422--msedge

After 100 posts and a little over one year, where should BI Polar go?

What are the topics you would like to see emphasized in the next year and the next hundred posts?

For the past month or so I’ve been sticking to a three-posts-per-week schedule, but I don’t know how sustainable that is. I’m thinking about switching to a Monday-Wednesday schedule, with one post each Monday to accompany the week’s new YouTube video, plus one additional post each week. This feels like a much more reasonable long-term plan.

So… what topics or themes are you interested in? What would you like to see more of, based on what you’ve seen over the last year? I can’t promise I’ll do what you want, but I can promise that I’ll read every comment, and I expect I’ll be inspired by whatever ideas you have…

Quick Tip: Working with dataflow-created CDM folders in ADLSg2

If you’re using your own organizational Azure Data Lake Storage Gen2 account for Power BI dataflows, you can use the CDM folders that Power BI creates as a data source for other efforts, including data science with tools like Azure Machine Learning and Azure Databricks.

Image by Arek Socha from Pixabay
A world of possibilities appears before you…

This capability has been in preview since early this year, so it’s not really new, but there are enough pieces involved that it may not be obvious how to begin – and I continue to see enough questions about this topic that another blog post seemed warranted.

The key point is that because dataflows are writing data to ADLSg2 in CDM folder format, Azure Machine Learning and Azure Databricks can both read the data using the metadata in the model.json file.

This json file serves as the “endpoint” for the data in the CDM folder; it’s a single resource that you can connect to, and not have to worry about the complexities in the various subfolders and files that the CDM folder contains.

This tutorial is probably the best place to start if you want to know more[1]. It includes directions and sample code for creating and consuming CDM folders from a variety of different Azure services – and Power BI dataflows. If you’re one of the people who has recently asked about this, please go through this tutorial as your next step!


[1] It’s the best resource I’m aware of  – if you find a better one, please let me know!

Power BIte: Creating dataflows by importing model.json

This week’s Power BIte is the third in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by importing a model.json file previously exported from Power BI, the dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service.
Data location Data stored in the CDM folder defined for the newly created dataflow
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace.

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-11-06-14-50-53-981--Code

At the top of the file we can see the dataflow name on line 2…

…and that’s pretty much all that’s important here. The rest of the model.json file will exactly match what was exported from the Power BI portal, and will look a lot like this or this. Boom.

For a little more detail (and more pictures, in case you don’t want to watch a four minute video) check out this post from last month, when this capability was introduced.

If this information doesn’t make sense yet, please hold on. We still have one more incoming Power BIte in this series, and then we’ll have the big picture view.

I guarantee[3] it will make as much sense as anything on this blog.


[1] New videos every Monday morning![2]

[2] Did you notice that I just copied the previous post and made some small edits to it? That seemed very appropriate given the topic…

[3] Or your money back.

Diversity of Perspective

I blogged last October[0] about the challenges I faced when trying to use the new Html.Table function in Power Query. A key part of my challenge was the gap between my perspective, and the perspective of the team shipping and documenting this feature.

At first I chalked this off as an old dog[1] trying to learn a new trick[2], but I’ve been thinking about this since then, and I’m not sure that this is the case. I think that it may have been the result of differing perspectives – and differing expectations resulting from those perspectives.

Image by S. Hermann & F. Richter from Pixabay
This may or may not be a photo of Matthew

My perspective is that as a data integration tool, Power Query will work the way that my ~20 years as a data professional have trained me to expect data integration tools will work. If there’s a query language or formula language or expression language that is required to access a specific source, I expect that language to be identified and documented in the tool.

The Power Query team, on the other hand, may have had a different perspective here. I haven’t explicitly asked them[3], but I suspect that their perspective is that it’s 2018, and anyone working with HTML data already knows what CSS selectors are, and either knows how to use them, or where to look to learn enough to use them.

I don’t know which perspective is more valid. Part of me believes that mine is, and bemoans the time I spent struggling to achieve a simple goal, because the documentation didn’t connect the dots for me. But I also note that no one – not in comments here, not on Twitter – has said that they were similarly challenged.

But I can say this: A difference in perspective meant that what was delivered wasn’t what was needed, at least by one user.

Another example of this type of mismatch is one I see too often[4] at conferences: Microsoft presenters using Microsoft’s specialized vocabulary when speaking with non-Microsoft audiences. This typically takes the form of using internal code names and acronyms, rather than official product names – if you’ve been to more than a handful of Microsoft events you’ve probably seen this yourself. I think the worst example I’ve seen was when a presenter mentioned that an upcoming feature was coming “in the scandium time frame.[5]

Every culture – whether it’s centered in a geographical region, a profession, a religion, or a large corporation – has a specialized internal nomenclature. It enables members to communicate more efficiently. This isn’t a bad thing – it’s natural and good, and helps teams and groups deliver on their goals and priorities.

But problems can and do arise when one party doesn’t take the other party’s background into consideration. This is where having a diverse team can help.

When a team has a diverse makeup, it makes it more likely that potential problems will be prevented before they need to be identified, and identified before they need to be fixed.

If you want to be more efficient and to produce products and services (and documentation!) that delivers on your customers’ needs the first time, every time, by default, your team makeup should reflect the customers who use your product. If you look around and everyone on your team looks the same, this should be a warning sign that customers who don’t look like you probably don’t have the same experience that you do.

And if you don’t see that as a problem, you should probably look elsewhere for your problem. Try looking in the mirror.

Update: Two days after this post was published, David Heinemeier Hansson posted a blistering example of why diversity is so important, using his wife’s experience with Apple’s new credit card to drive the point home. I strongly recommend reading the whole thread.


[0] I started writing this post in November 2018, and it’s been languishing in my drafts folder ever since. I’m making an effort to clean up my drafts by the end of the year, so hopefully this one will actually see the light of day before it’s 2020. Fingers crossed…

[1] Me.

[2] CSS selectors.

[3] I feel like I’m enough of a problem child most days, so I try not to bother them unless it’s really necessary.

[4] Although thankfully not nearly as often as I used to.

[5] If you know what this means, you work on the Azure team[6]. Sadly, the people in the audience did not work on the Azure team. Thankfully, someone in the audience stood up and asked for clarification.

[6] When I worked on the Azure team I still didn’t know. I was constantly asking for clarification in meeting after meeting and email after email. Maybe I am just slow…