Power BI Dataflows and Slow Data Sources

Important: This post was written and published in 2018, and the content below no longer represents the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

One advantage of data warehouses is the ability to have data from multiple data sources available for analysis without being constrained by the availability and performance characteristics of those data sources. Although Power BI dataflows are not a replacement for data warehouses in a general sense, they can provide same value to a self-service BI solution.

I’ve been using Power Query for over five years, and one consistently recurring pattern is the need to work with slow data sources such as web pages or APIs. “Slow” might mean “takes 10 minutes to load” but it could also mean “takes 10 hours” or “takes 10 days.” In each of these cases, my overall productivity ended up being closely tied to the refresh duration for the slowest data source in my solution, and this often introduced significant delay and frustration into my life. So I looked for solutions…

In the early days of Power BI, before Power BI dataflows, I would sometimes work around this problem by defining a set of Power Query queries in an Excel workbook, configuring those queries to load data into worksheets rather than into the data model. I would then use that Excel workbook as a data source for my primary queries. Essentially I was using an Excel file as a self-service data mart, since there wasn’t any similar capability in Power BI at the time. Here’s an example of the queries in a Power BI solution that I built almost five years ago[1]:

2018-11-23_15-33-06

These queries are the foundation of a solution I built to track my personal fitness, back when my exercise routine revolved around lifting weights. Back before I knew that people fought with longswords. I haven’t looked at it in years, but the organization shown in the screen shot above sure looks familiar.

There’s one folder for the queries extracting data from each source system. These queries are configured as “connection only” which means the data isn’t loaded directly into the data model – these queries serve as data sources for other queries. One of the folders for source queries references my Excel “staging area” workbook. As implied above, this other workbook contains data from sources that are too slow to include in the refresh of the primary solution workbook, as well as sources that update very infrequently, so that refreshing the data on a regular basis would be unwarranted and inefficient.

There’s also one folder for the queries that define the solution’s data model. These queries are loaded and represent a simple star schema that is friendly for analytics use cases.[2]

Does this pattern look familiar?

I hope it does. This is essentially the same pattern that I’ve included in almost every dataflows presentation I’ve delivered. This one:

2018-11-23_15-59-39

Dataflows make this pattern much easier to implement and maintain.

Were I to build a similar solution today[3], I would factor it in the same way. Each folder in the Excel-based solution would translate into a Power BI dataflow, and the queries in the folders would translate into entities. By having each source as its own dataflow, I can independently configure refresh schedules based on the performance and data freshness characteristics of each source. And by having the data in dataflow entities, I can easily reuse that data in multiple Power BI solutions – and enable other users in my organization to do the same[4].

If I were using Power BI Premium, I would probably build the whole solution using dataflows and computed entities, and only load the star schema into Power BI Desktop.

If I were not using Power BI Premium, I would probably build everything but the star schema using dataflows, and build the star schema using queries in Power BI Desktop that get their data from these dataflows.[5] As if the dataflows were logically a data warehouse.

Which is sort of where we came in…[6]

 


[1] The astute reader will realize that this is back before Power BI existed as we know it today. It feels strange to share the details of such an old solution, but this goes to show that although new tools are always being introduced, old problem patterns never seem to go away.

[2] In September 2014 I did a presentation about this solution for the PASS Data Warehousing and Business Intelligence Virtual Chapter, and the recording is still online on YouTube. I have not watched it, and don’t plan on watching it, and you should be warned before you watch it, because if I remember correctly it includes shirtless progress pictures of me.

[3] I will not do this, but only because I don’t have similarly rich sources of swordfighting-related data.

[4] This sounds like I’m implying that I have a group of analysts working on my personal fitness, doesn’t it? That’s just weird, even to me.

[5] This is another “reuse without Premium” scenario, independent of the one introduced in my earlier Power BI Dataflows – Reuse without Premium post.

[6] Imagine my disappointment when this screen shot just didn’t fit into this post. If only there was some way I could find a reason to include it somewhere…

2018-11-23_15-57-32.png

4 thoughts on “Power BI Dataflows and Slow Data Sources

  1. Pingback: Dataflows in Power BI – BI Polar

  2. Pingback: Positioning Power BI Dataflows (Part 2) – BI Polar

  3. Pingback: Power BI, Caching, Parallelism And Power Query Refresh Performance « Chris Webb's BI Blog

  4. Pingback: Dataflows and API data sources – BI Polar

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s