Lego Bricks and the Spectrum of Data Enrichment and Reuse

I love Lego bricks.

I love them because they were a favorite toy from my childhood. I love them because my children love them. I love them because they foster and encourage creativity.

I love them because they serve as an excellent metaphor for the enrichment and reuse of data in the enterprise.[1]

Consider this starting point.

1

A single brick has almost unlimited potential. This brick could become the floor of a building, the chassis of a fire truck, the wing of an airplane or spaceship, or part of something that no Lego engineer had ever imagined. This potential comes with a cost – this single brick must be enriched with many other bricks before it can achieve any of these lofty goals.

Similarly, a single file, entity, or table has massive potential. The data it maintains could become part of many different applications in many different contexts. But as with the lone brick, it would need to first be enriched, to be combined with other data for that potential to be realized.

 

As each additional brick is combined with the first brick, its complexity increases, but so does its value. These more complex components are closer to the ultimate end state, so less work will be required to achieve that goal.

But at the same time, each additional brick reduces the potential for reuse. After only a few additions we’ve already ruled out creating a floor or an airplane wing. We might still create a car or truck, or some types of spaceships, but the enrichment we’ve performed is already constraining our potential.

Similarly, each time we enrich a data source to move closer to our goal, we also limit the potential scenarios for which the data can be used. If we filter the data horizontally or vertically to eliminate records or fields we don’t currently need, we are eliminating data that may be needed for some other purpose. If we merge our starter data set with additional data, we may also be adding records or fields that aren’t needed for future purposes, while increasing complexity and adversely affecting performance.

 

As we continue building, we see this pattern continue. We also see the pattern repeated on multiple levels, while contributing to the overall goal. At multiple points we will combine a small number of individual bricks to build a component, and then add that component to the main model to make it more complex, and more specialized. Each time – both for the component and for the whole – the process of enrichment adds value and complexity, and reduces the potential scope of reuse. When the final model is finished we have the final model we needed[2]. The value is very high, but the opportunities for reuse are very small.

The parallel is clear: when we finish building a BI application, the value is very high, but the potential for reuse is very low. The dashboards and reports, the data model with its measures, dimensions, and KPIs, the data warehouse and data lake, and all of the upstream logic and components that make up a BI solution need to be combined in specific ways to achieve specific goals. The application as a whole is specialized for a given purpose…

…but what about the components that make up the application? Where can we find the sweet spot, the perfect balance between achieved value and potential value?

Like these Lego components:

 

When you’re building your solution using Power BI, this is where dataflows come in.

The answer will of course differ for each context, but when designing an application, it’s important to take larger and longer-term requirements into account.

Consider this diagram[3]:

New Excel Diagram

In this simple architecture, each dataflow (represented by the lighter squares) represents a stage in the overall data processing and enrichment flow. Each one adds value toward the application, and each serves as a building block that can be further reused both in this application and in other applications with overlapping data requirements.

The nature of Power BI dataflows lends itself well to this problem – each dataflow is a collection of reusable data entities managed by the Power BI service, easily discoverable and usable by technical and business users in BI applications. The computed entities feature in particular makes this type of scenario easy to set up and manage.

At the same time, the nature of Power BI dataflows introduces challenges for this Lego-style reuse. Dataflows in Power BI are optimized to enable non-IT users, business users and analysts who are typically focused on solving their immediate data challenges without relying on support from IT. These users are less likely to be focused on bigger-picture requirements like broad reuse of the entities they create.

This is where a little process and collaboration can come in, aided by the fact that dataflows are managed by the Power BI service. Power BI administrators can monitor the service to understand what dataflows are being used most frequently and most widely, and in what contexts. With this as a starting point, they can then operationalize[4] dataflows and entities created by business users, so that they are managed and maintained by IT. Since each dataflow entity is defined by the Power Query “M” code in the entity definition, this operationalization process is likely to be simpler and easier than similar processes with other technologies.

This approach also fits in well with how many larger enterprises implement Power BI. It is common[5] for larger organizations to use both shared capacity and dedicated Premium capacity for different purposes, and often those applications deployed to Premium capacity are those that are managed by a central IT/BI team. Since computed entities are only available when using Power BI Premium[6], this approach could lend itself well to the hand-off from business users to IT.

In any event, the next time you’re building dataflow entities, pause for a moment to think about Lego bricks, and what types of bricks or components your entities and dataflows represent. And then maybe take a break to go play with your kids.


[1] All images and examples in this post are taken from the building instructions for the Lego City Starter Set. I used to have a similar Lego set in my office that I would use to tell this story in person, but I gave it away during an office move shortly before I started this blog. The moral of the story: never get rid of your Lego sets.

[2] A fire truck!!

[3] This diagram is my reformatted version of a diagram included in the Power BI dataflows whitepaper. If you haven’t read this document, you really should.

[4] Or industrialize – which term do you use?

[5] Please note that this is not a blanket recommendation. I have the advantage of talking to many top Power BI customers around the world, so I can see this approach emerging as a common pattern, but the “right” approach in a given context will always depend on the context and the goals of the organization. I personally believe it is too early to start talking about best practices for Power BI dataflows (as I write this, dataflows have been in public preview for barely three weeks) but this is one of the areas where I am most excited to see best practices start to emerge.

[6] Even though Power BI dataflows do enable reuse in other ways that do not require Premium capacity.

Power BI Dataflows and Slow Data Sources

One advantage of data warehouses is the ability to have data from multiple data sources available for analysis without being constrained by the availability and performance characteristics of those data sources. Although Power BI dataflows are not a replacement for data warehouses in a general sense, they can provide same value to a self-service BI solution.

I’ve been using Power Query for over five years, and one consistently recurring pattern is the need to work with slow data sources such as web pages or APIs. “Slow” might mean “takes 10 minutes to load” but it could also mean “takes 10 hours” or “takes 10 days.” In each of these cases, my overall productivity ended up being closely tied to the refresh duration for the slowest data source in my solution, and this often introduced significant delay and frustration into my life. So I looked for solutions…

In the early days of Power BI, before Power BI dataflows, I would sometimes work around this problem by defining a set of Power Query queries in an Excel workbook, configuring those queries to load data into worksheets rather than into the data model. I would then use that Excel workbook as a data source for my primary queries. Essentially I was using an Excel file as a self-service data mart, since there wasn’t any similar capability in Power BI at the time. Here’s an example of the queries in a Power BI solution that I built almost five years ago[1]:

2018-11-23_15-33-06

These queries are the foundation of a solution I built to track my personal fitness, back when my exercise routine revolved around lifting weights. Back before I knew that people fought with longswords. I haven’t looked at it in years, but the organization shown in the screen shot above sure looks familiar.

There’s one folder for the queries extracting data from each source system. These queries are configured as “connection only” which means the data isn’t loaded directly into the data model – these queries serve as data sources for other queries. One of the folders for source queries references my Excel “staging area” workbook. As implied above, this other workbook contains data from sources that are too slow to include in the refresh of the primary solution workbook, as well as sources that update very infrequently, so that refreshing the data on a regular basis would be unwarranted and inefficient.

There’s also one folder for the queries that define the solution’s data model. These queries are loaded and represent a simple star schema that is friendly for analytics use cases.[2]

Does this pattern look familiar?

I hope it does. This is essentially the same pattern that I’ve included in almost every dataflows presentation I’ve delivered. This one:

2018-11-23_15-59-39

Dataflows make this pattern much easier to implement and maintain.

Were I to build a similar solution today[3], I would factor it in the same way. Each folder in the Excel-based solution would translate into a Power BI dataflow, and the queries in the folders would translate into entities. By having each source as its own dataflow, I can independently configure refresh schedules based on the performance and data freshness characteristics of each source. And by having the data in dataflow entities, I can easily reuse that data in multiple Power BI solutions – and enable other users in my organization to do the same[4].

If I were using Power BI Premium, I would probably build the whole solution using dataflows and computed entities, and only load the star schema into Power BI Desktop.

If I were not using Power BI Premium, I would probably build everything but the star schema using dataflows, and build the star schema using queries in Power BI Desktop that get their data from these dataflows.[5] As if the dataflows were logically a data warehouse.

Which is sort of where we came in…[6]

 


[1] The astute reader will realize that this is back before Power BI existed as we know it today. It feels strange to share the details of such an old solution, but this goes to show that although new tools are always being introduced, old problem patterns never seem to go away.

[2] In September 2014 I did a presentation about this solution for the PASS Data Warehousing and Business Intelligence Virtual Chapter, and the recording is still online on YouTube. I have not watched it, and don’t plan on watching it, and you should be warned before you watch it, because if I remember correctly it includes shirtless progress pictures of me.

[3] I will not do this, but only because I don’t have similarly rich sources of swordfighting-related data.

[4] This sounds like I’m implying that I have a group of analysts working on my personal fitness, doesn’t it? That’s just weird, even to me.

[5] This is another “reuse without Premium” scenario, independent of the one introduced in my earlier Power BI Dataflows – Reuse without Premium post.

[6] Imagine my disappointment when this screen shot just didn’t fit into this post. If only there was some way I could find a reason to include it somewhere…

2018-11-23_15-57-32.png

Power BI Dataflows – Data Profiling Without Premium

If you’ve been reading this blog, you already know a few things:

  1. Power BI includes a capability for self-service data preparation called “dataflows”
  2. Dataflows include computed entities, which enable some powerful reuse scenarios, but which are available only in Power BI Premium
  3. You can use computed entities to make current data profiles available for your dataflow entities, so that this valuable metadata can be used wherever the data is used
  4. Dataflows enable scenarios for reuse that don’t require computed entities and Premium

This post presents a variation on the data profiling pattern that doesn’t require Premium capacity. Let’s jump right in.

This is the approach that I took last time: I created a single dataflow to contain the data profiles for entities in other dataflows. As you can see, my workspace is no longer backed by Premium capacity, so this approach isn’t going to work.

2018-11-24_12-09-47

Instead of having a dedicated “Data Profiles” dataflow, we’re going to have data profile entities in the same dataflows that contain the entities being profiled. Dataflows like this one.

2018-11-24_12-16-57

As you can see, this dataflow contains two entities. We want to profile each of them. The most intuitive approach would be to create new queries that reference the queries for these entities, and to put the profile in the dependent query…

2018-11-24_12-19-26.png

…but if you do this, Power BI thinks you’re trying to create a computed entity, which requires Premium.

2018-11-24_12-19-47

Please allow me to rephrase that last sentence. If you reference a query that is loaded into a dataflow entity, you are creating a computed entity, which requires Premium.

So let’s not do that.

Specifically, let’s use the same pattern we used in the “reuse without premium” post to address this specific scenario.

Let’s begin by disabling the data load for the two “starter” entities that reference the external data source.

2018-11-24_12-27-50.png

Once this is done, the Premium warning goes away, because we’re no longer trying to create computed entities.

2018-11-24_12-29-20.png

Let’s rename the queries, and look at the M code behind the new queries we’ve created.

2018-11-24_12-32-04

As you can see, the new queries don’t contain any real logic – all of the data acquisition and transformation takes place in the “source” queries. The new ones just reference them, and get loaded into the CDM Folder that’s backing the dataflow.

At this point we’re functionally right back where we started – we just have a more complex set of queries to achieve the same results. But we’re also now positioned to add in queries to profile these entities, without needing Premium.

To do this, we’ll simply add new queries that reference the “source” queries, and add a step that calls Table.Profile().[1]

2018-11-24_12-38-37

And that’s that.

When I save my dataflow and refresh it, the entities for the data, and the entities for the data profile, will load, and will be saved for reuse. When I connect to this dataflow from Power BI Desktop, I have available all four entities.

2018-11-24_12-46-22

At this point you may be wondering about what the difference is between this approach and the approach that uses computed entities. To help answer this question, let’s look at the refresh details in the CSV file that can be downloaded from the refresh history.[2]

2018-11-24_12-48-06

If you look at the start and end time for each of the four entities, you’ll see that each of them took roughly the same time to complete. This is because for each entity, the query extracted data from the data source and transformed it before loading into the CDM Folder. Even though the extract logic was defined in the shared “source” queries, when the dataflow is refreshed each entity is loaded by executing its query against the data source.

By comparison, in the data profiling pattern that relies on computed entities, the data source is not used to generate the profile. The computed entity uses the CDM Folder managed by Power BI as its source, and generates from profile from there. This means that the data source is placed under lighter load[3], and the profile generation itself should take less time.

For meaningfully large data sources, this different may be significant. For the trivial data sources used in this example, the difference is measured in seconds, not minutes or hours. You’ll probably want to explore these patterns and others – I’m eager to hear what you discover, and what you think…


[1] Yes, I literally copied the code from my other blog post.

[2] For more details on refresh, see Dataflows in Power BI: Overview Part 5 – Data Refresh.

[3] 50% lighter, if my math skills haven’t failed me.

 

Power BI Dataflows and World Bank Climate Change Data

It is increasingly common to include external data in business intelligence and data science. If you’re working to understand retail sales patterns, you may want to include weather data, traffic data, event data, crime data, and other externally-sourced data sets to supplement your internal corporate data.

The World Bank publishes a number of freely available data sets, including this one, which provides per-country-per year metrics related to global climate change. The data set is available in three different formats: CSV, XML, and XLS. Each version is accessible via HTTP.

Using Power Query in Power BI Desktop makes it easy to include this data in a PBIX file and dataset, but doesn’t provide easy opportunities for reuse.  And because significant transformation is required before the World Bank data can be used, each dataset that incorporates this data needs to duplicate this transformation logic, which is far from ideal.

Power BI dataflows present an obvious and elegant solution to this problem. Using dataflows, a user can use Power Query to define entities that are stored in and managed by the Power BI service. The queries and the transformation logic they contain can be defined once, and the data produced by the queries can be used by authorized users in as many workspaces and datasets as necessary.

But the nature of the World Bank API introduces a few challenges.

  • The Excel data is in the binary XLS format, not XLSX, which means that it requires a gateway to refresh. This isn’t a show-stopper, but I always prefer to avoid using an on-premises gateway to refresh online data. It just feels wrong.
  • The XML and CSV data isn’t actually exposed as XML and CSV files, They’re exposed as Zip files that contain XML and CSV files. And Power Query doesn’t support Zip files as a data source.

Or does it?

BI specialist Ken Russel, with a little help from MVP Chris Webb[1], figured out how to make this work. Ken published a Power Query “M” script that will decompress a Zip file, essentially exposing its content to the rest of the script as if it were a folder containing files.

Life is good.

With this part of the problem solved, the rest is pretty straightforward, especially considering that I already had a Power BI Desktop file that uses the World Bank Excel API, and that solves most of the rest of the problem.

With a little text editing aided by the Power Query extension for Visual Studio Code, I ended up with two queries: one for per-country metadata, and one for the climate change data itself[2]. I can now get into the fun and easy part: dataflows.

I’ll start by opening my workspace, and creating a new dataflow, and adding an entity.

01 - Create new dataflow

02 - New entity

Since I already have the query written, I’ll choose “Blank Query” from the available data sources, and I’ll just paste in the query text.

03 - Blank Query

04 - Query 1

At this point, all I need to do is give the query a name that will be meaningful and understandable to the users who will have access to the dataflow. As you can see in the screen shot below, even though the query was written largely by hand, all of the steps are maintained and can be interacted with in the Power Query editor.

05 - Edit Query

Once I finish with the first entity, I can follow the same basic steps with the second entity, and can save the dataflow.

06 - Save dataflow

Since the dataflow can’t be used until its entities contain data, I’ll refresh it immediately. The dataflow and its last refresh time will now be shown in the dataflows list in my workspace.[3]

07 - Refreshing

From this point on, no user in my organization who wants to include World Bank climate change data in her reports will need to connect to the World Bank API and transform it until it is fit for purpose. All users will need to do is connect to Power BI dataflows from Power BI Desktop, where the data is always available, always current, and always consistent, managed by the Power BI service. Life is good.

08 - Get data

Obviously, this pattern doesn’t apply only to World Bank data. It applies to any external data source where multiple users need to use the data in different solutions. It can also apply to internal/organizational data sources that need to be transformed to prepare them for analytics.


[1] See Ken’s awesome blog post here for the full details.

[2] You can download the queries from OneDrive: World Bank Metadata.pq, and World Bank Climate Change Data.pq.

[3] I could also set up a refresh schedule to ensure that the data in the entities is always current, but I’m saving this for a future blog post.