Analyzing Plex Media Server Data using Power BI

Around a year ago I left Spotify and started using my locally hosted Plex Media Server to feed my daily music consumption. Since I spent much of the last 30+ years building an extensive CD and MP3 collection, the transition was surprisingly smooth, and the excellent Plexamp client app feels like an upgrade over Spotify in many ways.

Now it’s time to start digging into the data that my year of music listening has produced.

I blogged last year about how I report on my Spotify listening history. It’s a bit of a hack because Spotify doesn’t have a database or API you can use, so you need to make a GDPR request and wait a few weeks to get a JSON archive… but I made it work. Plex uses a SQLite database to keep track of its data and metadata, and it’s relatively easy to start using it as a data source for your own Power BI reports. Since this is a largely undocumented transactional database[1] I’m still figuring out some important details and locating important data points, but I wanted to share what I’ve learned.

Here’s how to get started.

Installing Plex, setting up libraries, listening

For this post to be useful to you, you need to run a Plex media server and use it to listen to your music collection[2]. Head over to plex.tv, download the software, install it, set up your libraries, and start listening to the music you love.

This is all well-documented by the fine folks at Plex, so I won’t share any additional details here.

Download a database backup

As mentioned above, Plex uses SQLite as its database, and the Plex web UI provides an option for downloading a copy of the database file. Although it may be possible to report directly on the “live” database, I’m erring on the side of caution and using Power BI to connect to backup files.

Here’s how to download the database:

  1. Open the Plex web UI at https://app.plex.tv/ and sign in to your server
  2. In the upper right, click on the “wrench” icon to open the settings
  3. In the lower right, select “Troubleshooting” in the “Manage” category of the settings options
  4. On the troubleshooting settings page, click the “Download Database” button

After a few seconds your browser will download a zip file containing a SQLite database file with a long system-generated file name.

Copy this file to folder and give it a more useful name for my project the database file is \Plex DB\plex.db.

SQLite driver and ODBC DSN

Power BI doesn’t have a native SQLite connector, but SQLite does have ODBC drivers. It looks like there are multiple options – I used this one mainly because it was the first one I found.

Once you’ve downloaded and installed the ODBC driver, create a DSN that points to the database file you downloaded earlier. The connection string is just the file path.

With the DSN created, you can start working in Power BI Desktop. Create a new PBIX, select ODBC from the Get Data dialog, and select the DSN.

Click OK and you’re ready to go.

Key Plex database objects

As mentioned earlier, the Plex database is largely undocumented. It has a somewhat normalized schema, which is slowing my exploration and discovery. When a table is mostly integer IDs and obscurely-named values, exploration can feel more like archaeology.

With that said, there are two tables that are probably going to be most useful: metadata_items and metadata_item_views.

The metadata_items table contains one row for each item (track, album, artist, episode, season, show, etc.) in your Plex media library. This table implements an implied three-tier hierarchy by including an id column and a parent_id column that define a relationship[3] between metadata items and the items that contain them.

The metadata_items table also includes vital data such as the title (track name, album name, artist name, etc.) of each item as well as index (track number on the album, episode number in the season, etc.) and such. Not all fields are used for all metadata item types, so you’ll see a lot more nulls than you might like.

The metadata_item_views table contains one row for each time a user has watched a video, listened to a track, and so on. This table includes the titles for the viewed item, as well as the parent and grandparent items, so you can easily get the track, album, and artist without needing to join other tables. You can join with  the metadata_items table if you need additional details like the album release year or the record label.

The metadata_item_views table includes a viewed_at column to track when a given item was viewed. This column is stored as a Unix timestamp, so you’ll need to convert it to datetime before you can start working with it. Plex tracks times using UTC, so if you want to work with data in a different time zone you’ll need to handle that in your PBIX.

Refreshing the data

Since your PBIX is connected to a database backup, it won’t automatically show your current library contents and listening history. If you want more current data, just download a new database backup and replace your plex.db (or whatever you named it) file with the new one. The next time you refresh your report, you’ll have new data.

I don’t have a use case for “real time” Plex reporting, so this approach works for me. It’s certainly easier and more timely than the hoops I had to jump through to “refresh” my Spotify data.

Next steps

I’m still just starting to get my hands dirty with this fun data source, so my next steps are mainly to keep exploring. Using my Spotify listening report as a target, I’m planning to duplicate that report and its insights with Plex data. Ideally I’ll have one report where I can see my legacy listening history in Spotify and my ongoing listening history in Plex… but that might be a few months away.

If you’re using Plex and have been thinking about exploring your data – or if you’re already doing this – I’d love to hear from you in the comments about your experiences.


 

[1] To the best of my knowledge after a year or more of occasionally googling to answer my questions, there is no official documentation for the database. The Plex community support forums have a few bits and pieces here and there, but most of what I’ve included in this post is the result of my own exploration. Honestly, it’s a real joy to have a side project with this type of work involved.

[2] Plex also has great features for managing and streaming TV shows, movies, and photos – it’s not just about music. This blog post is going to focus on the music side of things because that’s what’s most interesting to me today, but there’s a lot more to love in Plex.

[3] It’s worth noting that there are no “real” FK > PK relationships in the database, or if they are they’re not exposed through the client tools I’ve used. To figure out where the data relationships exist you need to explore the data.

Dataflows with benefits

Power BI datamarts are like dataflows with benefits.

In case you missed the announcements this week from Microsoft’s Build conference, datamarts are a new major new capability coming to Power BI that are now available in public preview. There are also preview docs available, but the best datamarts content I’ve seen so far is this fantastic video from Adam and Patrick at Guy in a Cube[1].

I’m going to assume that since you’re reading this blog, you didn’t miss the announcement. I’m also going to assume that some small part of you is asking “what the heck are datamarts, anyway?”

For me, datamarts are like dataflows with benefits[2].

It should come as no surprise to any regular reader of this blog that I’m a big fan of dataflows in Power BI. Dataflows let Power BI users build reusable data tables  in a workspace using Power Query Online, and share them with other users for reuse in other workspaces. What’s not to love?[3]

If you’ve spent a lot of time working with dataflows, you can probably think of a few things you wished dataflows did differently, or better. These are some of the most common requests I’ve heard in customer conversations over the years:

  • “I wish I could define row-level security (RLS) on my dataflows so I could share them securely with more users.”
  • “I wish my users could connect to dataflows using SQL, because analysts in my org all know SQL.”
  • “I wish <operation that would benefit from a compute engine> performed better in dataflows.”[4]

You probably see where I’m going here. Datamarts in Power BI deliver solutions to these problems. Datamarts in Power BI build on the strength of dataflows, while enabling common scenarios where dataflows did not offer an obvious or simple solution.

Almost like datamarts were dataflows with benefits.

Datamarts, like dataflows, provide a Power Query Online experience to perform data preparation. Datamarts, like dataflows, allow users to create reusable data tables in a Power BI workspace.

But datamarts, unlike dataflows, store their data in a managed SQL database. Dataflows use CDM folders for their storage, which means CSV files with some extra metadata. Although this file-based approach provides some benefits for reuse in integration scenarios, it can also be a challenge for simply connecting to the data in tools other than Power BI.

With datamarts, the data produced by executing your Power Query queries[5] is loaded into tables in an Azure SQL database that’s managed by the Power BI service. Having data in a full-featured database, as opposed to folders full of text files, makes a lot of difference.

  • Datamarts support row-level security. Using a simple in-browser user experience a datamart author can define RLS rules that restrict what users can see what data when connecting to the datamart.
  • Anyone with the right permissions can query[6] the datamart’s underlying SQL database using any SQL query tool. This means that authorized users can perform exploratory data analysis in Excel, SQL Server Management Studio, Azure Data Studio, or whatever tool they’re most comfortable using. It’s a database.
  • Merges and joins and other operations common to building a star schema perform much better, because these are things that SQL has been awesome at for decades.[7]

Is it just me, or is this sounding a lot like dataflows with benefits?

But wait, you might say, what about measures and that automatically created dataset thingie and all the other stuff they showed at Build, and which I don’t really understand yet? What about deciding when I should use a datamart over a dataflow? What about the expanded web authoring experience, and querying the datamart directly from within the browser??

Yeah, I’m not going to cover any of that in this post. The post is already too long, and I didn’t really have time to write this one as it is.[8] But I think it’s those things that make the product team scowl when I describe datamarts as “dataflows with benefits” because they’re really a lot more. But if you think about dataflows with benefits, you’re probably off to a good start, and heading in the right direction.

I’m going to end on this note: all of my personal Power BI projects going forward will be built using datamarts. Datamarts do everything I need dataflows to do, and for me they do it better. I’ll still always love dataflows, and there will likely still be places where dataflows make sense, but for me…


 

[1] And I’m not only saying that because Roche’s Maxim makes an awkward surprise appearance.

[2] In case the “with benefits” descriptor isn’t meaningful to you, I’m attempting to make a play on the phrase “friends with benefits.” You can check out the mildly NSFW Urban Dictionary definition if you really feel like you need to. Honestly, I wouldn’t recommend it, but you do you.

[3] I literally got goosebumps typing that “reusable data tables” sentence. Even after all these years, having the promise of self-service data preparation and reuse realized in Power BI still feels a little bit like magic.

[4] Yes, this typically leads into a conversation about the dataflows “enhanced compute engine” but since using that engine requires following specific design patterns, this isn’t always as straightforward a conversation as you might want it to be.

[5] Power Queries? I always struggle with what to use for the plural noun for the queries that you build in Power Query. Maybe I should ask Alex.

[6] I use the term “query” here to mean SELECT statements. Only read operations are permitted, so if you try to UPDATE or whatever, you’ll get an error. Use Power Query to transform the data as you load it, like you would with a traditional data mart or data warehouse.

[7] I don’t have enough hands-on with the datamarts preview at this point to say much more than “faster” but in my non-scientific testing queries that would take “I guess I’ll make another pot of coffee” in dataflows take “oh it’s done already” in datamarts.

[8] If you had any idea how mean my calendar is this year, you’d weep. I’m weeping right now.

Into the Date Dimension!

Late last year I was spending some vacation days playing with one of my favorite personal data sets[1], and I realized I needed a better date dimension. So I asked for one.

A bunch of folks from the Power BI community shared their favorite Power Query date dimension queries[2] – you can look at the Twitter thread for links if you’re interested. With their inspiration (and code!) I picked what I liked, discarded what I didn’t, and came up with this:

let
// To be turned into parameter
startDate = #date(2017,01,01) as date, 
// To be turned into parameter
endDate = #date(2021,12,31) as date, 
// To be turned into parameter
culture = "en-US" as text,

// Beginning of actual query 👇 
Source = List.Dates(startDate, Duration.Days(Duration.From(endDate - startDate)) + 1, #duration(1, 0, 0, 0)),

TableFromList = Table.FromList(Source, Splitter.SplitByNothing()),
ChangedType = Table.TransformColumnTypes(TableFromList,{{"Column1", type date}}),
RenamedColumns = Table.RenameColumns(ChangedType,{{"Column1", "Date"}}),
InsertYear = Table.AddColumn(RenamedColumns , "Year", each Date.Year([Date]),type text),
InsertQuarterNumber = Table.AddColumn(InsertYear, "Quarter Number", each Date.QuarterOfYear([Date])),
InsertQuarterName = Table.AddColumn(InsertQuarterNumber, "Quarter Name", each "Q" & Number.ToText([Quarter Number])),
InsertMonth = Table.AddColumn(InsertQuarterName, "Month Number", each Date.Month([Date]), type text),
InsertStartOfMonth = Table.AddColumn(InsertMonth, "Start Of Month", each Date.StartOfMonth([Date]), type date),
InsertEndOfMonth = Table.AddColumn(InsertStartOfMonth, "End Of Month", each Date.EndOfMonth([Date]), type date),
InsertDayOfMonth = Table.AddColumn(InsertEndOfMonth, "Day Of Month", each Date.Day([Date])),
InsertDayInt = Table.AddColumn(InsertDayOfMonth, "DateInt", each [Year]*10000 + [Month Number]*100 + [Day Of Month]),
InsertMonthName = Table.AddColumn(InsertDayInt, "Month Name", each Date.ToText([Date], "MMMM", culture), type text),
InsertShortMonthName = Table.AddColumn(InsertMonthName, "Month Name Short", each Date.ToText([Date], "MMM", culture), type text),
InsertShortMonthNameYear = Table.AddColumn(InsertShortMonthName, "Month Short and Year", each [Month Name Short] & " " & Number.ToText([Year]), type text),
InsertCalendarMonth = Table.AddColumn(InsertShortMonthNameYear, "Month and Year", each [Month Name]& " " & Number.ToText([Year]),type text),
InsertCalendarQtr = Table.AddColumn(InsertCalendarMonth, "Quarter and Year", each "Q" & Number.ToText([Quarter Number]) & " " & Number.ToText([Year]), type text),
InsertDayWeek = Table.AddColumn(InsertCalendarQtr, "Day of Week", each Date.DayOfWeek([Date]) + 1),
InsertDayName = Table.AddColumn(InsertDayWeek, "Day Name", each Date.ToText([Date], "dddd", culture), type text),
InsertShortDayName = Table.AddColumn(InsertDayName, "Day Name Short", each Date.ToText([Date], "ddd", culture), type text),
InsertEndOfWeek = Table.AddColumn(InsertShortDayName , "End Of Week", each Date.EndOfWeek([Date]), type date),
InsertedStartOfWeek = Table.AddColumn(InsertEndOfWeek, "Start of Week", each Date.StartOfWeek([Date]), type date),
InsertWeekNumber= Table.AddColumn(InsertedStartOfWeek, "Week of Year Number", each Date.WeekOfYear([Date])),
InsertMonthWeekNumber= Table.AddColumn(InsertWeekNumber, "Week of Month Number", each Date.WeekOfMonth([Date])),
InsertYearMonthNumber = Table.AddColumn(InsertMonthWeekNumber,"Year and Month", each [Year] * 100 + [Month Number]),
InsertYearAndQuarterNumber = Table.AddColumn(InsertYearMonthNumber, "Year and Quarter", each [Year] * 100 + [Quarter Number]),
InsertYearAndWeekNumber = Table.AddColumn(InsertYearAndQuarterNumber, "Year and Week", each [Year] * 100 + [Week of Year Number]),
InsertDecadeName = Table.AddColumn(InsertYearAndWeekNumber, "Decade Name", each Text.Range(Text.From([Year]), 0, 3) & "0s"),
InsertDecadeNumber = Table.AddColumn(InsertDecadeName, "Decade Number", each Text.Range(Text.From([Year]), 0, 3) & "0"),
InsertStartOfQuarter = Table.AddColumn(InsertDecadeNumber, "Start of Quarter", each Date.StartOfQuarter([Date]), type date),
InsertEndOfQuarter = Table.AddColumn(InsertStartOfQuarter, "End of Quarter", each Date.EndOfQuarter([Date]), type date),
InsertDayOfYear = Table.AddColumn(InsertEndOfQuarter, "Day of Year", each Date.DayOfYear([Date]), Int64.Type),
InsertIsWeekday = Table.AddColumn(InsertDayOfYear, "Is Weekday", each if [Day of Week] = 1 or [Day of Week] = 7 then 0 else 1),
InsertIsWeekend = Table.AddColumn(InsertIsWeekday, "Is Weekend", each if [Day of Week] = 1 or [Day of Week] = 7 then 1 else 0),
#"Reordered Columns" = Table.ReorderColumns(InsertIsWeekend,{"DateInt", "Date", "Year", "Year and Quarter", "Year and Month", "Year and Week", "Quarter Name", "Quarter Number", "Quarter and Year", "Start of Quarter", "End of Quarter", "Month Name", "Month Name Short", "Month Number", "Month and Year", "Month Short and Year", "Start Of Month", "End Of Month", "Week of Year Number", "Week of Month Number", "Start of Week", "End Of Week", "Day of Year", "Day Of Month", "Day of Week", "Day Name", "Day Name Short", "Decade Name", "Decade Number", "Is Weekday", "Is Weekend"}),
#"Changed Type" = Table.TransformColumnTypes(#"Reordered Columns",{{"DateInt", Int64.Type}, {"Year", Int64.Type}, {"Year and Quarter", Int64.Type}, {"Year and Month", Int64.Type}, {"Year and Week", Int64.Type}, {"Quarter Number", Int64.Type}, {"Month Number", Int64.Type}, {"Week of Year Number", Int64.Type}, {"Week of Month Number", Int64.Type}, {"Day of Year", Int64.Type}, {"Day Of Month", Int64.Type}, {"Day of Week", Int64.Type}, {"Decade Number", Int64.Type}, {"Is Weekday", Int64.Type}, {"Is Weekend", Int64.Type}, {"Quarter Name", type text}, {"Decade Name", type text}})
in
#"Changed Type"

In case you can’t visually parse that unfortunate wall of text, here’s what it looks like in my data model:

As I keep playing with my pet project I’ll probably keep tweaking the date dimension query to add, remove, or refine it as needed. When I’m done I may turn it into a dataflow, or I may make it into a function that takes my project’s min and max dates as input parameters. Who knows?

What I know for sure is that the next time I’m looking for a date dimension query, I’ll look here.

Update: I’ve published the pet project where this date dimension ended up – you can find the details and download the PBIT file if you’re interested.

This PBIT also includes a tweak to the query shared above to dynamically set the start and end dates included in the data source. If you’re interested in this pattern and don’t want to bother downloading the PBIT, here’s what’s different.

// Update as necessary to reference the appropriate date column in your data
startDate = Date.StartOfYear(Date.From(List.Min(#"Spotify Extended History"[ts]))) as date,
// Update as necessary to reference the appropriate date column in your data
endDate = Date.EndOfYear(Date.From(List.Max(#"Spotify Extended History"[ts]))) as date,
// To be turned into parameter as needed
culture = "en-US" as text,

 


[1] I made a GDPR request of Spotify, asking for all of the data they had related to my account. It took them about a month to prepare that data and make it available for me to download, but it was worth the wait. You can request your own data too: https://www.spotify.com/us/account/privacy/

[2] And a bunch of other folks shared DAX and SQL solutions.

Roche’s Maxim of Data Transformation

According to the internet, a maxim is a succinct formulation of a fundamental principle, general truth, or rule of conduct.[1] Maxims tend to relate to common situations and topics that are understandable by a broad range of people.

Topics like data transformation.

Roche’s Maxim of Data Transformation[2] states:

Data should be transformed as far upstream as possible, and as far downstream as necessary.

In this context “upstream” means closer to where the data is originally produced, and “downstream” means closer to where the data is consumed.

By transforming data closer to its ultimate source costs can be reduced, and the value added through data transformation can be applied to a greater range of uses. The farther downstream a given transformation is applied, the more expensive it tends to be – often because operations are performed more frequently – and the smaller the scope of potential value through reuse.

I’ve been using this guideline for many years, but I only started recognizing it as a maxim in the past year or so. The more I work with enterprise Power BI customers the more I realize how true and how important it is – and how many common problems could be avoided if more people thought about it when building data solutions.

Please note that this maxim is generalizable to data solutions implementing using any tools or technology. The examples below focus on Power BI because that’s where I spend my days, but these principles apply to every data platform I have used or seen used.

In day-to-day Power BI conversations, perhaps the most common question to which Roche’s Maxim applies is about where to implement a given unit of logic: “Should I do this in DAX or in Power Query?”

Short answer: Do it in Power Query.

If you’re ever faced with this question, always default to Power Query if Power Query is capable of doing what you need – Power Query is farther upstream. Performing data transformation in Power Query ensures that when your dataset is refreshed the data is loaded into the data model in the shape it needs to be in. Your report logic will be simplified and thus easier to maintain, and will likely perform better[3] because the Vertipaq engine will need to do less work as users interact with the report.[4]

But what if you need data transformation logic that depends on the context of the current user interacting with the report – things like slicers and cross-filtering? This is the perfect job for a DAX measure, because Power Query doesn’t have access to the report context. Implementing this logic farther downstream in DAX makes sense because it’s necessary.

Another common question to which Roche’s Maxim applies is also about where to implement a given unit of logic: “Should I do this in Power BI or in the data warehouse?”

Short answer: Do it in the data warehouse.

If you’re ever faced with this question, always default to transforming the data into its desired shape when loading it into the data warehouse – the data warehouse is farther upstream. Performing data transformation when loading the data warehouse ensures that any analytics solution that uses the data has ready access to what it needs – and that every solution downstream of the warehouse is using a consistent version of the data.

From a performance perspective, it is always better to perform a given data transformation as few times as possible, and it is best to not need to transform data at all.[5] Data transformation is a costly operation – transforming data once when loading into a common location like a data warehouse, data mart, or data lake, is inherently less costly than transforming it once for every report, app, or solution that uses that common location.

A much less common question to which Roche’s Maxim applies might be “What about that whole ‘not transforming at all’ pattern you mentioned a few paragraphs back – how exactly does that dark magic work?”

Short answer: Have the data already available in the format you need it to be in.

That short answer isn’t particularly useful, so here are two brief stories to illustrate what I mean.

Many years ago I was working with an oil & gas company in an engagement related to master data management. This company had a fundamental data problem: the equipment on their drilling platforms around the world was not standardized, and different meters reported the same production data differently. These differences in measurement meant that all downstream reporting and data processing could only take place using the least common denominator across their global set of meters… and this was no longer good enough. To solve the problem, they were standardizing on new meters everywhere, and updating their data estate to take advantage of the new hardware. My jaw dropped when I learned that the cost of upgrading was upwards of one hundred million dollars… which was a lot of money at the time.

Much more recently I was working with a retail company with over 5,000 locations across North America. They had similar challenges with similar root causes: their stores did not have consistent point of sale (POS) hardware[6], which meant that different stores produced different data and produced some common data at different grain, and analytics could only take place using the least common denominator data from all stores. Their solution was also similar: they upgraded all POS systems in all stores. I don’t have a dollar amount to put with this investment, but it was certainly significant – especially in an industry where margins are traditionally small and budgets traditionally very conservative.

Both of these stories illustrate organizations taking Roche’s Maxim to the extreme: they transformed their key data literally as far upstream as possible, by making the necessary changes to produce the data in its desired form.[7]

Each of these stories included both technical and non-technical factors. The technical factors revolve around data. The non-technical factors revolve around money. Each company looked at the cost and the benefit and decided that the benefit was greater. They implemented an upstream change that will benefit every downstream system and application, which will simplify their overall data estate, and which corrects a fundamental structural problem in their data supply chain that could only be mitigated, not corrected, by any downstream change.

There’s one additional part of Roche’s Maxim that’s worth elaborating on – what does “necessary” mean? This post has looked at multiple scenarios that emphasize the “as far upstream as possible” part of the maxim – what about the “as far downstream as necessary” part?

Some factors for pushing transformations downstream are technical, like the DAX context example above. Other technical factors might be the availability of data – you can’t produce a given output unless you have all necessary inputs. Others may be organizational – if data is produced by a 3rd party, your ability to apply transformations before a given point may be constrained more by a contract than by technology.

Still other factors may be situational and pragmatic – if team priorities and available resources prevent you from implementing a unit of data transformation logic in the data warehouse, it may be necessary to implement it in your Power BI solution in order to meet project deadlines and commitments.

These are probably the most frustrating types of “necessary” factors, but they’re also some of the most common. Sometimes you need to deliver a less-than-ideal solution and incur technical debt that you would prefer to avoid. The next time you find yourself in such a situation, keep this maxim in mind, and remember that even though it may be necessary to move that data transformation logic downstream today, tomorrow is another day, with different constraints and new opportunities.

Callout: This may be my maxim, but this isn’t the first blog post on the topic. Stuart Box from the UK data consultancy BurningSuit blogged back in March and was in first with this excellent article.


Update June 2021: The video from my June 11 DataMinutes presentation is now available, so if you prefer visual content, this video might be a good place to start.

 

Update May 2022: The fine folks at Greyskull Analytics have added a wonderful t-shirt to their online store. If you want to look cooler than I ever will, you might want to head on over to https://greyskullanalytics.com/ to order yours.

 

Update August 2022: There also a video from SQLBits, in case you want a slightly longer version presented to a live audience.

 

Update October 2022: I had the pleasure of visiting Patrick’s cube, and we recorded a video together. You should check it out.


[1] Or a men’s magazine. I really really wanted to use this more pop-culture meaning to make a “DQ” joke playing on the men’s magazine “GQ” but after watching this post languish in my drafts for many months and this joke not even beginning to cohere, I decided I should probably just let it go and move on.

But I did not let it go. Not really.

[2] If you think that sounds pretentious when you read it, imagine how it feels typing it in.

[3] The performance benefit here is not always obvious when working with smaller data volumes, but will become increasingly obvious as the data volume increases. And since the last thing you want to do in this situation is to retrofit your growing Power BI solution because you made poor decisions early on, why not refer to that maxim the next time you’re thinking about adding a calculated column?

[4] This post only spent a week or so in draft form, but during this week I watched an interesting work email conversation unfold. A Power BI customer was experiencing unexpected performance issues related to incremental refresh of a large dataset, and a DAX calculated column on a table with hundreds of millions of records was part of the scenario. The email thread was between members of the engineering and CAT teams, and a few points jumped out at me, including one CAT member observing “in my experience, calculated columns on large tables [can] increase processing times and also can greatly increase the time of doing a process recalc… it also depends on the complexity of the calculated column.”

I don’t have enough knowledge of the Veripaq engine’s inner workings to jump into the conversation myself, but I did sip my coffee and smile to myself before moving on with my morning. I checked back in on the conversation later on, and saw that a Power BI  group engineering manager (GEM) had shared this guidance, presented here with his approval:

“From a pure perf standpoint, its true that we can say:

  • The most efficient approach for a large fact table is to have all the columns be present in the source table (materialized views also might work), so that no extra processing is necessary during the import operation (either in Mashup or in DAX)
  • The next most efficient approach for a large fact table is usually going to be to have the computation be part of the M expression, so that it only needs to be evaluated for the rows in the partitions being processed
  • DAX calculated columns are a great option for flexibility and are particularly useful for dimension tables, but will be the least efficient compared to the above two options for large fact tables”

That sounds pretty familiar, doesn’t it? The GEM effectively summarized Roche’s Maxim, including specific guidance for the specific customer scenario. The details will differ from context to context, but I have never found a scenario to which the maxim did not apply.

Yes, this is a challenge for you to tell me where and how I’m wrong.

[5] Just as Sun Tzu said “To fight and conquer in all our battles is not supreme excellence; supreme excellence consists in breaking the enemy’s resistance without fighting,” supreme excellence in data transformation is not needing to transform the data at all.

[6] That’s “cash registers” to the less retail inclined readers.

[7] If you feel inclined to point out that in each of these stories there is additional data transformation taking place farther downstream, I won’t argue. You are almost certainly correct… but the Maxim still holds, as the key common transformations have been offloaded into the most upstream possible component in the data supply chain. Like a boss.[8]

[8] Like a supremely excellent[5] boss.

Deep thoughts on dataflows

As you may have noticed, life is complicated and keeps getting in the way of my plans to be a more active blogger and YouTuber[1]. I haven’t released a new dataflows video of my own in far too long[2], but my teammate Kasper is helping out by doing the hard work for me:

Last week I had an awesome conversation on “everything dataflows” with Kasper and the video is now available on his excellent Kasper On BI YouTube channel. In this hour-long video we talk about a lot of the “big picture” topics related to dataflows, including how dataflows fit into a larger data estate – and when you should use them, or avoid using them.

Please check it out, and let me know what you think!


[1] If this isn’t the name of a social media platform for potatoes, it should be.

[2] To add insult to the injury that is life in a global pandemic, my video editing PC died a few weeks ago. I now have it back up and running, but I lost all of my project templates and works in progress, which is likely to introduce more delays. FFS.

Will it fold? Look in Power Query Online!

Query folding is a key capability in Power Query. In short, query folding is the process by which the Power Query mashup engine takes the steps defined in the Power Query editor and translates them into processing to be performed by the data source from which the query is extracting data.

The use of query folding can produce significant performance improvements. Consider a query that includes a grouping step. Without query folding, all of the detailed data required must be extracted from the data source, and the aggregation performed by the mashup engine. With query folding, the mashup engine can add a GROUP BY clause[1] to the data source query it generates, and lets the data source perform the costly operation – only the aggregated data leaves the data source.

Because of the performance benefit that query folding provides, experienced query authors are typically very careful to ensure that their queries take advantage of the capabilities of their data sources, and that they fold as many operations as possible. But for less experienced query authors, telling what steps will fold and which will not has not always been simple…

Until now.

This week the Power Query team announced the introduction of query folding indicators in Power Query Online. If you’re authoring a dataflow in Power BI or Power Apps, you will now see visual indicators to let you know which steps will fold, and which will not.

Each step will include an icon[2] that shows how that step will be handled. The announcement blog post goes into more details (because connectivity and data access are complicated topics, and because of the huge diversity in the data sources to which Power Query can connect) but the short version of the story is that life just got simpler for people who build dataflows.

After the last 12 months anything that makes my life simpler is a good thing.


[1] Or whatever the syntax is for the data source being used.

[2] Or is it a logo?

A hat full of of dataflows knowledge

Life and work have been getting the best of me this month, and I haven’t found the time[1] to keep up on blogging and video now that my series on building a data culture has wrapped up. I’ve been watching the dataflows and Power Query teams releasing all sorts of exciting new capabilities, and realizing that I’m not going to be writing about them in a timely manner.

Thankfully Laura Graham-Brown is picking up the slack – and then some.

Laura is a Microsoft MVP whose “Hat Full of Data” blog has become one of my favorite morning reads, and whose YouTube channel seems to include all of the videos I’ve been thinking about making, but not actually finding the time to make them.

Like this one on the new Power Query Online diagram view that is now available in public preview for dataflows in Power BI:

If you’ve been waiting for new dataflows content, you should definitely head over to Laura’s blog today to check out the awesome work she’s been doing.

I hope to be writing more regularly in December after my work-related “crunch mode” has passed, but if 2020 has taught me anything[2] it’s that I have no idea what’s waiting around the corner. In the meantime, you should follow Laura, because she’s doing awesome work.


[1] Or the spare creative mental energies, which seem to be in sparser supply than spare minutes and hours.

[2] If 2020 has taught me anything, it’s that 2020 has taught me nothing.

Weekend reading: New dataflows docs

Important: This post was written and published in 2020, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be more of an historical record and less of a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

In case you didn’t have any plans for the weekend, the Power Query team has released a treasure trove of new documentation for dataflows in Power BI and Power Apps.

You can find the new docs here: https://docs.microsoft.com/en-us/power-query/dataflows/overview-dataflows-across-power-platform-dynamics-365

There are over 20 articles in this new resource, and the content they contain answers many of the questions I hear most frequently. For some reason WordPress isn’t letting me upload the screen shot I took of the content outline, so instead here is a picture of me[1] when I first learned about this new content:

woman-591576_1280

Please check out this new information, and please submit feedback to the documentation team using the feedback form on each page. When you let the docs team know what you like, what you don’t like, or what is missing, they read every comment!


[1] Me in my avatar as a photogenic young woman.

Power BI and ACLs in ADLSg2

In addition to using Azure Data Lake Storage Gen2 as the location for Power BI dataflows data, Power BI can also use ADLSg2 as a data source. As organizations choose ADLSg2 as the storage location for more and more data, this capability is key to enabling analysts and self-service BI users to get value from the data in the lake.

boje-2914324_640
Oh buoy, that is one big data lake!

But how do you do this in as secure a manner as possible, so that the right users have the minimum necessary permissions on the right data?

The short answer is that you let the data source handle secure access to the data it manages. ADLSg2 has a robust security model, which supports both Azure role-based access control (RBAC) and POSIX-like access control lists (ACLs)[1].

The longer answer is that this robust security model may make it more difficult to know how to set up permissions in the data lake to meet your analytics and security requirements.

Earlier this week I received a question from a customer on how to get Power BI to work with data in ADLSg2 that is  secured using ACLs. I didn’t know the answer, but I knew who would know, and I looped in Ben Sack from the dataflows team. Ben answered the customer’s questions and unblocked their efforts, and he said that I could turn them into a blog post. Thank you, Ben![2]

Here’s what you should know:

1 – If you’re using ACLs, you must at least specify a filesystem name in the URL to load in the connector (or if you access ADLS Gen2 via API or any other client).

i.e. Path in Power BI Connector must at least be: https://storageaccountname.dfs.core.windows.net/FileSystemName/

2 – For every file you want to read its contents, all parent folders and filesystem must have the “x” ACL. And the file must have a “r” ACL.

i.e. if you want to access the file: https://StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1/File1.csv

3 – For files you want to list, all parent folders and filesystem must have the “x” ACL. The immediate parent folder must also have a “r” ACL.

i.e. if you want to view and access the files in this subfolder: https://StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1/

4 – Default ACLs are great way to have ACLs propagate to child items. But they have to be set before creating subfolders and files, otherwise you need to explicitly set ACLs on each item.[3]

5 – If permission management is going to be dynamic, use groups as much as possible rather than assigning permissions to individual users[4]. First, ACL the groups to folders/files and then manage access via membership in the group.

6 – If you have an error accessing a path that is deep in the filesystem, work your way from the filesystem level downwards, fixing ACL settings in each step.

i.e. if you are having trouble accessing https:/StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1/SubFolder2/File(s)

First try: https://StorageAccountName.dfs.core.windows.net/FileSystemName

Then: https://StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1

And so on.

Update: James Baker, a Program Manager on the Azure Storage team has published on GitHub a PowerShell script to recursively set ACLs. Thanks to Simon for commenting on this post to make me aware of it, Josh from the Azure support team for pointing me to the GitHub repo, and of course to James for writing the actual script!


[1] This description is copied directly from the ADLSg2 documentation, which you should also read before acting on the information in this post.

[2] Disclaimer: This post is basically me using my blog as a way to get Ben’s assistance online so more people can get insights from it. If the information is helpful, all credit goes to Ben. If anything doesn’t work, it’s my fault. Ok, it may also be your fault, but it’s probably mine.

[3] This one is very important to know before you begin, even though it may be #3 on the list.

[4] This is a best practice pretty much everywhere, not just here.

Power BIte: Power Platform dataflows

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

INTERIOR: pan over cubicles of happy, productive office workers

CLOSE-UP: office worker at desk

NARRATOR: Is that Susie I see, giving Power Platform dataflows a try?

SUSIE: That’s right! With dataflows I can have all of the data I need, right where I need it!!

NARRATOR: Dataflows. They’re not just for Power BI anymore.

OK, you may not remember that orange juice ad campaign from the late 1970s and early 80s[1], but I’ve had it stuck in my head since I started working on this post and video. I couldn’t figure out how to work it into the video itself, so here it is in written form.

Anyway, with that awkward moment behind us, you probably want to watch the video. Here is it:

As the video discusses, Power Apps now have a dataflows capability that is a natural complement to Power BI dataflows. Power Platform dataflows have been generally available since November 2019, and have been in preview since summer.

Power Platform dataflows use Power Query Online – and the same set of connectors, gateways, and transformation capabilities as Power BI dataflows. But there are a few key differences that are worth emphasizing.

Power Platform dataflows can load data into the Common Data Service, either into the standard Common Data Model entities used by Dynamics 365 apps, or into custom entities used by custom Power Apps. This is important – this makes dataflows more like a traditional ETL tool like SSIS data flows in that at the end of the dataflow creation process you can map the columns in your queries to the columns in these existing tables[2].

Power Platform dataflows can load data into ADLSg2 for analytical scenarios, but Power Apps doesn’t have the same concept of “built-in storage” that Power BI does. That means if you want to use Power Platform dataflows to create CDM folders, you must configure your Power Apps environment to use an ADLSg2 resource in your Azure subscription.

The “link to data lake” feature in Power Apps feels to me like a better integration experience than what’s currently available in Power BI. In Power Apps you define the link at the environment level, not the tenant level – this provides more flexibility, and enables non-tenant admins[3] to configure and use the data lake integration.

2019-12-23-11-07-30-619--ApplicationFrameHost

The first time you create a Power Platform dataflow and select the “analytical entities” option, you’ll be prompted – and required – to link the Power Apps environment to an Azure Data Lake Storage resource. You’ll need to have an Azure subscription to use, but the process itself if pretty straightforward.

2019-12-23-11-19-10-160--ApplicationFrameHost.png

I can’t wait to hear what you think of this new capability. Please let me know in the comments or via Twitter.

See you in the new year!


[1] I just realized that this was 40 years ago. Were you even born yet?

[2] CDS entities aren’t tables by the strictest definition, but it’s close enough for our purposes today.

[3] I honestly don’t know enough about Power Apps security to go into too much depth on this point, but I am not a Power Apps admin and I was able to create a trial environment and link it to my own ADLSg2 resource in my own Azure subscription without affecting other users.