Dataflows, Datasets, and Models – Oh My!

How do Power BI datasets and dataflows relate to each other? Do you need one if you have the other?

Photo by Chris Liverani on Unsplash

I received this question as a comment on another post, and I think it warrants a full post as a reply:

Hi Matthew,  my organization is currently evaluating where to put BI data models for upcoming PBI projects. Central in the debates is the decision of whether to use PBI Datasets, SSAS or DataFlows. I know a lot of factors need considering. I’m interested in hearing your thoughts.

Rather than answering the question directly, I’m gong to rephrase and re-frame it in a slightly different context.

I’m currently evaluating how to best chop and prepare a butternut squash. Central in the debates is the decision of whether to use a 6″ chef’s knife, a 10″ chef’s knife, or a cutting board.

(I’ll pause for a moment to let that sink in.)

It doesn’t really make sense to compare two knives and a cutting board in this way, does it? You can probably get the job done with either knife, and the cutting board will certainly make the job easier… but it’s not like you’d need to choose one of the three, right? Right?


Your choice of knife will depend on multiple factors including the size of the squash, the size of your hand, and whether or not you already have one or the other or both.

Your choice of using a cutting board will come down to your workflow and priorities. Do you already have a cutting board? Is it more important to you to have a safe place to chop the squash and not damage the edge of your knife, or is it more important to not have one more thing to clean?

Both of these are valid decisions that need to be made – but they’re not dependent on each other.

Let’s get back to the original question by setting some context for dataflows and datasets in Power BI.


This image is from one of the standard slides in my dataflows presentation deck, and I stole it from the dataflows team[1]. It shows where datasets and dataflows fit in Power BI from a high-level conceptual perspective.

Here’s what seems most important in the context of the original question:

  • Power BI visualizations are built using datasets as their sources
  • Power BI includes datasets, which are tabular BI models hosted in the Power BI service
  • Dataflows are a data preparation capability in Power BI for loading data into Azure Data Lake Storage gen2
  • Dataflows can be used as a data source when building datasets in Power BI, but cannot currently be used as a data source for models outside of Power BI, including SSAS and AAS
  • Dataflows and datasets solve different problems and serve different purposes, and cannot be directly compared to each other as the original question tries to do – that’s like comparing chef’s knives and cutting boards

What’s not shown in this diagram is SQL Server Analysis Services (SSAS) or Azure Analysis Services (AAS) because the diagram is limited in scope to capabilities that are natively part of Power BI. SSAS and AAS are both analytics services that can host tabular BI models that are very similar to Power BI datasets, and which can be used as a data source for Power BI datasets. Each option – SSAS, AAS, or Power BI datasets – is implemented using the same underlying technology[2], but each has different characteristics that make it more or less desirable for specific scenarios.

This list isn’t exhaustive, and I make no claims to being an expert on this topic, but these are the factors that seem most significant when choosing between SSAS, AAS, or Power BI datasets as your analytics engine of choice:

  • Cost and pricing model – if you choose SSAS you’ll need to own and manage your own physical or virtual server. If you choose AAS or Power BI you’ll pay to use the managed cloud service. Dedicated Power BI Premium capacity and shared Power BI Pro capacity have different licensing models and costs tp target different usage patterns.
  • Model size – you can scale SSAS to pretty much any workload if you throw big enough hardware at it[3]. AAS can scale to models that are hundreds of gigabytes in size. Power BI Premium can support PBIX files up to 10GB[4], and Power BI Pro supports PBIX files up to 1GB.
  • Deployment and control scenarios – with SSAS and AAS, you have a huge range of application lifecycle management (ALM) and deployment capabilities that are enabled by the services’ XMLA endpoint and a robust tool ecosystem. Power BI Premium will support this before too long[5] as well.

I’m sure I’m missing many things, but this is what feels most important to me. Like I said, I’m far from being an expert on this aspect of Power BI and the Microsoft BI stack.

So let’s close by circling back to the original question, and that delicious analogy. You need a knife, but the knife you choose will depend on your requirements. Having a cutting board will probably also help, but it’s not truly required.

Now I’m hungry.


[1] If you want to watch a conference presentation or two that includes this slide, head on over to the Dataflows in Power BI: Resources post.

[2] This feels like an oversimplification, but it’s technically correct at the level of abstraction at which I’m writing it. If anyone is interested in arguing this point, please reply with a comment that links to your article or blog post where the salient differences are listed.

[3] Remember I’m not an expert on this, so feel free to correct me by pointing me to documentation. Thanks!

[4] This is not a direct size-to-size comparison. The services measure things differently.

[5] As announced at Microsoft Ignite a few months back, no firm dates shared yet.

Authoring Power BI Dataflows in Power BI Desktop

That title got your attention, didn’t it?

Before we move on, no, you cannot create and publish dataflow entities from Power BI Desktop today. Creating dataflows is a task you need to perform in the browser. But you can build your queries in Power BI Desktop if that is your preferred query authoring tool. Here’s how.[1]

  1. Create a query or set of queries in Power BI Desktop.
  2. Copy the query script(s) from the Power BI Desktop query editor, and paste it into a “Blank query” in Power Query Online.
  3. Rename the new queries to match your desired entity names, being careful to match the names of the source queries if there are any references between them.
  4. If necessary, disable the “load” option for queries that only implement shared logic and should not be persisted in the dataflow’s CDM folder.

That’s it.

Some of you may be asking “but why would I want to do this, when there’s already an awesome query authoring experience in the browser?”

Good question! There are three reasons why I will often use this technique:

  1. I prefer rich, non-browser-based editing tools[2], and Power BI Desktop has a polished and refined UX.
  2. The online editor doesn’t have all of the transformations in its UI compared to Power BI Desktop.
  3. The online editor doesn’t have all supported connectors exposed in the UI.

Each of these points relates to the the maturity of Power BI Desktop as a tool[3], as opposed to the relatively new Power Query Online. Power Query Online is part of the dataflows preview and is continuing to improve and expand in functionality, but Power BI Desktop has been generally available for years.

And Although I didn’t realize it until I started writing this post, Power BI Desktop actually has features that make this scenario easier than expected. Let’s look at this example. Here are the queries I’m starting with:


In the PBIX file I have three sets of queries, organized by folder:

  1. A query that references the one data source that is used in the other queries, so I can change the connection information in one place and have everything else update.
  2. Three queries that each contain minimally-transformed data from the data source, and which are not loaded into the data model.
  3. Two queries that are loaded into the data model and which are used directly for reporting.

This is a common pattern for my PBIX files. I don’t know if it counts as a best practice (especially now that Power BI has better support for parameterization than it did when I started doing things this way) but it works for me, and nicely illustrates the approach I’m following.

To move this query logic out of my PBIX file and into a dataflow definition, I first need to copy the query scripts. Power BI Desktop makes this incredibly easy. When I right-click on any folder and choose “copy”, Power BI Desktop places the scripts for all queries in the workbook – including the query names as comments – on the clipboard.


Now that I have all of the query scripts in a text editor, I can get started by creating a new dataflow, and selecting “Blank query” for my first entity.


After I paste in the query, I will un-select the “Enable load” option, and will paste in the query name as well.


Once this is done, I can repeat the process by selecting “Get data” option in the query editor, and choosing “Blank query” for each additional query.


After I repeat this process for each remaining query, my dataflow will look something like this.


And if I want to, I can even add groups to organize the queries.


This looks a lot like where we started, which is both a good thing and a bad thing. The good side is that it demonstrates how we can use a more mature and authoring experience for our interactive query development. The bad side is that it introduces additional steps into our workflow.

I expect the integration between dataflows and Desktop to only get better over time[4], but for today, there’s still an easy path if you want to use Power BI Desktop to author your dataflow entities.

As a closing word of caution, please be aware that not all data sources and functions that work in Power BI Desktop will work in dataflows. If you’re using a data source for the first time, or are using a new function[5], you’ll probably want to test things early to avoid a potentially unpleasant surprise later on.


[1] I’ve mentioned this technique in a few other posts, but I’ve heard a bunch of questions in recent days that makes me believe that the word isn’t getting out. Hopefully this post will help.

[2] I’m writing this post from the WordPress Windows app – even though the app offers nothing that the web editor does not, and actually appears to be the thinnest of wrappers around the web editor.

[3] And they all relate to the fact that Power BI Desktop is just so awesome, nothing compares to it, and although the Power Query Online editor is useful and good, it hasn’t had a team making it better every month, year after year.

[4] Please remember that this is my personal blog, and that even though I’m a member of the Power BI team, I’m not working on either dataflows or Desktop, so what I expect and what will actually happen aren’t always well aligned.

[5] Like Html.Table, which is not yet supported in dataflows. Darn it.

Lego Bricks and the Spectrum of Data Enrichment and Reuse

I love Lego bricks.

I love them because they were a favorite toy from my childhood. I love them because my children love them. I love them because they foster and encourage creativity.

I love them because they serve as an excellent metaphor for the enrichment and reuse of data in the enterprise.[1]

Consider this starting point.


A single brick has almost unlimited potential. This brick could become the floor of a building, the chassis of a fire truck, the wing of an airplane or spaceship, or part of something that no Lego engineer had ever imagined. This potential comes with a cost – this single brick must be enriched with many other bricks before it can achieve any of these lofty goals.

Similarly, a single file, entity, or table has massive potential. The data it maintains could become part of many different applications in many different contexts. But as with the lone brick, it would need to first be enriched, to be combined with other data for that potential to be realized.


As each additional brick is combined with the first brick, its complexity increases, but so does its value. These more complex components are closer to the ultimate end state, so less work will be required to achieve that goal.

But at the same time, each additional brick reduces the potential for reuse. After only a few additions we’ve already ruled out creating a floor or an airplane wing. We might still create a car or truck, or some types of spaceships, but the enrichment we’ve performed is already constraining our potential.

Similarly, each time we enrich a data source to move closer to our goal, we also limit the potential scenarios for which the data can be used. If we filter the data horizontally or vertically to eliminate records or fields we don’t currently need, we are eliminating data that may be needed for some other purpose. If we merge our starter data set with additional data, we may also be adding records or fields that aren’t needed for future purposes, while increasing complexity and adversely affecting performance.


As we continue building, we see this pattern continue. We also see the pattern repeated on multiple levels, while contributing to the overall goal. At multiple points we will combine a small number of individual bricks to build a component, and then add that component to the main model to make it more complex, and more specialized. Each time – both for the component and for the whole – the process of enrichment adds value and complexity, and reduces the potential scope of reuse. When the final model is finished we have the final model we needed[2]. The value is very high, but the opportunities for reuse are very small.

The parallel is clear: when we finish building a BI application, the value is very high, but the potential for reuse is very low. The dashboards and reports, the data model with its measures, dimensions, and KPIs, the data warehouse and data lake, and all of the upstream logic and components that make up a BI solution need to be combined in specific ways to achieve specific goals. The application as a whole is specialized for a given purpose…

…but what about the components that make up the application? Where can we find the sweet spot, the perfect balance between achieved value and potential value?

Like these Lego components:


When you’re building your solution using Power BI, this is where dataflows come in.

The answer will of course differ for each context, but when designing an application, it’s important to take larger and longer-term requirements into account.

Consider this diagram[3]:

New Excel Diagram

In this simple architecture, each dataflow (represented by the lighter squares) represents a stage in the overall data processing and enrichment flow. Each one adds value toward the application, and each serves as a building block that can be further reused both in this application and in other applications with overlapping data requirements.

The nature of Power BI dataflows lends itself well to this problem – each dataflow is a collection of reusable data entities managed by the Power BI service, easily discoverable and usable by technical and business users in BI applications. The computed entities feature in particular makes this type of scenario easy to set up and manage.

At the same time, the nature of Power BI dataflows introduces challenges for this Lego-style reuse. Dataflows in Power BI are optimized to enable non-IT users, business users and analysts who are typically focused on solving their immediate data challenges without relying on support from IT. These users are less likely to be focused on bigger-picture requirements like broad reuse of the entities they create.

This is where a little process and collaboration can come in, aided by the fact that dataflows are managed by the Power BI service. Power BI administrators can monitor the service to understand what dataflows are being used most frequently and most widely, and in what contexts. With this as a starting point, they can then operationalize[4] dataflows and entities created by business users, so that they are managed and maintained by IT. Since each dataflow entity is defined by the Power Query “M” code in the entity definition, this operationalization process is likely to be simpler and easier than similar processes with other technologies.

This approach also fits in well with how many larger enterprises implement Power BI. It is common[5] for larger organizations to use both shared capacity and dedicated Premium capacity for different purposes, and often those applications deployed to Premium capacity are those that are managed by a central IT/BI team. Since computed entities are only available when using Power BI Premium[6], this approach could lend itself well to the hand-off from business users to IT.

In any event, the next time you’re building dataflow entities, pause for a moment to think about Lego bricks, and what types of bricks or components your entities and dataflows represent. And then maybe take a break to go play with your kids.

[1] All images and examples in this post are taken from the building instructions for the Lego City Starter Set. I used to have a similar Lego set in my office that I would use to tell this story in person, but I gave it away during an office move shortly before I started this blog. The moral of the story: never get rid of your Lego sets.

[2] A fire truck!!

[3] This diagram is my reformatted version of a diagram included in the Power BI dataflows whitepaper. If you haven’t read this document, you really should.

[4] Or industrialize – which term do you use?

[5] Please note that this is not a blanket recommendation. I have the advantage of talking to many top Power BI customers around the world, so I can see this approach emerging as a common pattern, but the “right” approach in a given context will always depend on the context and the goals of the organization. I personally believe it is too early to start talking about best practices for Power BI dataflows (as I write this, dataflows have been in public preview for barely three weeks) but this is one of the areas where I am most excited to see best practices start to emerge.

[6] Even though Power BI dataflows do enable reuse in other ways that do not require Premium capacity.

Positioning Power BI Dataflows (Part 2)

I didn’t plan on writing a sequel to my Positioning Power BI Dataflows post, but a few comments I’ve seen recently have made me think that one might be useful. I also didn’t plan on this article ending up quite as long as it has, but this is the direction in which it ended up needing to go.

One was a comment on my October post related to CDM folders, that was part of a discussion[1] about whether it makes sense to have data warehouses now that we have dataflows. I’d finished replying by saying “If your scenario includes the ability to add a new dimension to a data warehouse, or to add new attributes to existing dimensions, that’s probably a good direction to choose.” Darryll respectfully disagreed.


The point in Darryll’s comment that stuck with me was related to data warehouses becoming an anti-pattern, a “common response to a recurring problem that is usually ineffective and risks being highly counterproductive.” Darryll and I will probably have to agree to disagree.

Update: Darryl was kind enough to comment on this post, so please scroll down for additional context. The rest of this post remains unedited.

Big data platforms like Azure Data Lake Storage gen2 are enabling “modern data warehouse” scenarios that were not previously possible, and they’re making them more and more accessible. I don’t think there’s any argument on that front. But just because there is a new cool hammer in the toolbox, this doesn’t mean that every problem needs to be a big data nail.[2] The need for “traditional” Kimball-style data warehouse hasn’t gone away, and in my opinion isn’t likely to go away any time soon.

The other comment that prompted this post was a from Nimrod on Twitter, in response to my recent blog post about using dataflows as a way to handle slow data sources in a self-service solution when you don’t have a data warehouse.


Before I proceed I should mention that the next few paragraphs are also informed by Nimrod’s excellent essay “The Self-Service BI Hoax“, which you are strongly encouraged to read. It’s not my goal to respond to this essay in general or in specific terms, but it provides significant context about the likely thinking behind the tweet pictured above.

I’m not sure where Nimrod was going with his “local” comment, since dataflows are built and executed and managed in the Power BI cloud service, but the rest of the post is worth considering carefully, both in the context of positioning and in the context of usage.

I’ve said this many times before, and I suspect I’ll say it many times again: dataflows are not a replacement for data warehouses. I said this in the first paragraph of the post to which Nimrod was responding, and in that post the phrase was a hyperlink back to my initial post on positioning. There will be people who claim that you don’t need a data warehouse if you have dataflows – this is false. This is as false as saying that you don’t need a curated and managed set of data models because you have a self-service BI tool.

Experience has shown time and time again that self-service BI succeeds at scale[3] when it is part of an organized and professional approach to data and analytics. Without structure and management, self-service BI is too often part of the problem, rather than part of the solution. To borrow from Nimrod’s essay, “With some governance, and with a realistic understanding of what the technology can do, the data anarchy can become a data democracy.” The converse, also holds true – without that governance, anarchy is likely, and its likelihood increases as the scope of the solution increases.

Despite this, I believe that Power BI dataflows have a better chance to be part of the solution because of how they’re implemented. This is why:

  1. Dataflows are defined and managed by the Power BI service. This means that they can be discovered and understood by Power BI administrators using the Power BI admin API and the dataflows API as well. Although the management experience is not yet complete while dataflows are in preview, the fact that dataflows are defined and executed in the context of a single cloud service means that they are inherently more manageable and less anarchic than other self-service alternatives.
  2. Dataflows are self-contained and self-describing in terms of the ETL logic they implement and their data lineage. Each dataflow entity is defined by a Power Query “M” query, and the data in the entity can only result from the execution of that query. This is fundamentally different from tools like Excel, where the logic that defines a dataset is difficult to parse and understand[4], and which would need to be reverse engineered and re-implemented by a developer in order to be included in a central data model. It is also fundamentally different from other self-service data preparation technologies that load data into unmanaged locations where they can be further manipulated with file system or database CRUD operations.
  3. Dataflows lend themselves to process-driven collaboration between business and IT. With a Power BI dataflow entity, an administrator can take the query that defines the entity and reuse it in another context that supports “M” queries such as a tabular model. They can also also be operationalized as-is; any dataflow or entity created by a business user can be added to the central IT-managed data lake. The technology behind dataflows lends itself better to the types of processes that successful BI centers of excellence put in place than do many other data preparation technologies.
  4. Business users are going to prepare and use the data they need regardless of the tools that are made available to them. In an ideal world, every data need that a business user has would be fulfilled by a central IT team in a timely and predictable manner. Sadly, we do not live in this world. In most situations it’s not a matter of choosing dataflows over a professionally-designed data warehouse. It’s a matter of choosing dataflows over an Excel workbook or other self-service solution.

This final point makes me think of one[5] of the key tenants of the Kimball Method:

It’s all about the business.
I say this many times during classes and consulting. It’s the most important characteristic of the Kimball Lifecycle method: practical, uncompromising focus on the business. It infuses everything we do, and it’s the single most important message to carry forward.

A mature IT organization will help the business it supports achieve its goals in the best way it can, where “best” is situational and dependent on the many complex factors that need to be balanced in each individual context. When done properly, BI has always been about the business and not about the technology – the technology is merely the means to the end of helping the business make better decisions with better information.

And in this context, dataflows can be part of the solution, or they can be part of the problem. Like other self-service technologies, dataflows present capabilities that can be misused, and which can introduce inconsistencies and duplication across an organization’s data estate, but their design helps mitigate the entropy that self-service approaches introduce into the system. When used as part of a managed approach to governed self-service, dataflows can help remove ad hoc ETL processes, or move them into a context where IT oversight and governance is easier.

Of course, this is a very optimistic conclusion for me to reach. What I’m describing above is what organizations can do if they use dataflows in a well thought out way. It’s not something that can be taken for granted. You need to work for it. And that’s probably the most important thing to keep in mind when evaluating dataflows or any self-service tool: no tool is a silver bullet.

In my mind[6] both of the comments that inspired this post have at their root an important point in the context of positioning Power BI dataflows: you need to choose the correct tool and implement it in the correct manner in order to be successful, and you need to evaluate tools against your requirements based on their capabilities, rather than based on any sales or marketing pitches.

The next time you see someone pitching dataflows as a silver bullet, please point them here. But at the same time, when you see organizations implementing dataflows as part of a managed and governed self-service BI… I’d like to hear about that too.

[1] I won’t repeat everything here, but you can go read the comments on the post yourself if you want to have the full context.

[2] I hope that translates well. In case it doesn’t, here’s a link:

[3] I include this qualification because SSBI can indeed be successful for individuals and teams without IT oversight and involvement.

[4] If you’ve ever had a business user or consulting client give you an Excel workbook with a dozen macros and/or hundreds of VLOOKUPs, you’ll know what I mean here.

[5] I recognize that I’m cherry-picking here, but I think this is an important point to make. The Kimball Group web site has 180 design tips, and they’re all worth reading.

[6] I emphasize here that this is my opinion, because I have asked neither Nimrod nor Darryll if this is what they actually meant, and I definitely do not want to falsely portray someone else’s intent. They can correct me as needed.

Power BI Dataflows and Slow Data Sources

One advantage of data warehouses is the ability to have data from multiple data sources available for analysis without being constrained by the availability and performance characteristics of those data sources. Although Power BI dataflows are not a replacement for data warehouses in a general sense, they can provide same value to a self-service BI solution.

I’ve been using Power Query for over five years, and one consistently recurring pattern is the need to work with slow data sources such as web pages or APIs. “Slow” might mean “takes 10 minutes to load” but it could also mean “takes 10 hours” or “takes 10 days.” In each of these cases, my overall productivity ended up being closely tied to the refresh duration for the slowest data source in my solution, and this often introduced significant delay and frustration into my life. So I looked for solutions…

In the early days of Power BI, before Power BI dataflows, I would sometimes work around this problem by defining a set of Power Query queries in an Excel workbook, configuring those queries to load data into worksheets rather than into the data model. I would then use that Excel workbook as a data source for my primary queries. Essentially I was using an Excel file as a self-service data mart, since there wasn’t any similar capability in Power BI at the time. Here’s an example of the queries in a Power BI solution that I built almost five years ago[1]:


These queries are the foundation of a solution I built to track my personal fitness, back when my exercise routine revolved around lifting weights. Back before I knew that people fought with longswords. I haven’t looked at it in years, but the organization shown in the screen shot above sure looks familiar.

There’s one folder for the queries extracting data from each source system. These queries are configured as “connection only” which means the data isn’t loaded directly into the data model – these queries serve as data sources for other queries. One of the folders for source queries references my Excel “staging area” workbook. As implied above, this other workbook contains data from sources that are too slow to include in the refresh of the primary solution workbook, as well as sources that update very infrequently, so that refreshing the data on a regular basis would be unwarranted and inefficient.

There’s also one folder for the queries that define the solution’s data model. These queries are loaded and represent a simple star schema that is friendly for analytics use cases.[2]

Does this pattern look familiar?

I hope it does. This is essentially the same pattern that I’ve included in almost every dataflows presentation I’ve delivered. This one:


Dataflows make this pattern much easier to implement and maintain.

Were I to build a similar solution today[3], I would factor it in the same way. Each folder in the Excel-based solution would translate into a Power BI dataflow, and the queries in the folders would translate into entities. By having each source as its own dataflow, I can independently configure refresh schedules based on the performance and data freshness characteristics of each source. And by having the data in dataflow entities, I can easily reuse that data in multiple Power BI solutions – and enable other users in my organization to do the same[4].

If I were using Power BI Premium, I would probably build the whole solution using dataflows and computed entities, and only load the star schema into Power BI Desktop.

If I were not using Power BI Premium, I would probably build everything but the star schema using dataflows, and build the star schema using queries in Power BI Desktop that get their data from these dataflows.[5] As if the dataflows were logically a data warehouse.

Which is sort of where we came in…[6]


[1] The astute reader will realize that this is back before Power BI existed as we know it today. It feels strange to share the details of such an old solution, but this goes to show that although new tools are always being introduced, old problem patterns never seem to go away.

[2] In September 2014 I did a presentation about this solution for the PASS Data Warehousing and Business Intelligence Virtual Chapter, and the recording is still online on YouTube. I have not watched it, and don’t plan on watching it, and you should be warned before you watch it, because if I remember correctly it includes shirtless progress pictures of me.

[3] I will not do this, but only because I don’t have similarly rich sources of swordfighting-related data.

[4] This sounds like I’m implying that I have a group of analysts working on my personal fitness, doesn’t it? That’s just weird, even to me.

[5] This is another “reuse without Premium” scenario, independent of the one introduced in my earlier Power BI Dataflows – Reuse without Premium post.

[6] Imagine my disappointment when this screen shot just didn’t fit into this post. If only there was some way I could find a reason to include it somewhere…