Power BIte: Dataflows vs. datasets

This is still one of the most common dataflows questions: what’s the difference between Power BI dataflows and Power BI datasets?

For the last year I have resisted tackling this question head-on. This isn’t because it’s a bad or “dumb” question. Just the opposite – this is a very simple question, and the simpler a question is, the more complex and nuanced the answer is likely to be.

See how complex this is?
A graphical representation of the answer’s likely complexity.

If you’re a regular reader of this blog, you probably already know the answer, because I’ve answered it already. Sort of. The existing answer is distributed across dozens of posts, and if you’ve read all of them and picked up the answer along the way. But I keep hearing this question, and I keep thinking that there must be a more direct answer I could share.

Here it is, in a single, simple table[1].

Power BI dataflows Power BI datasets
Implementation CDM folder Analysis Services tabular model
Storage[2] CSV files Vertipaq
Metadata[3] Common Data Model – model.json BISM
Development Power Query Online Power Query in Power BI Desktop
Primary purpose[4] Data reuse Data analysis
Reuse[5] Acts as data source in multiple datasets Shared datasets across workspaces
Scope of reuse[6] Entity level reuse Dataset level reuse
Mashup with other data sources[7] Yes No
Used for reporting[8] Not directly Yes
Reuse outside Power BI[9] Yes, through ADLSg2 Yes, through XMLA
Data access methods[10] Import Import, DirectQuery
Connection methods[11] Import Live Connection
Row-level security No Yes
Certification and promotion Not yet Yes
What else am I missing? Please let me know! Seriously, you should let me know.

Update: I’ve added a few rows to the table after the post was originally published, to incorporate feedback from readers on differences I had missed. Thank you!

Each of the rows in this table could easily be an in-depth topic in and of itself, so if you’re looking at any of them and thinking “that’s not quite right” I might very well agree with you. There’s a lot of context and a lot of nuance here, and we’re trying to sum things up in a word or two… which is kind of the whole point.

Oh yeah, there’s a video too.[12]

I can’t wait to hear what you think!


[1] A simple table with ten footnotes.

[2] The storage aspect of dataflows and datasets is one of the most significant differences between the two. Datasets use the Vertipaq column store to load data into an optimized and highly compressed in-memory representation that is optimized for analysis. Dataflows use text files in folders, which are optimized for interoperability.

[3] The Analysis Services Tabular engine uses the BI Semantic Model (BISM) to represent its metadata. This is a metadata model originally included in SQL Server 2012 Analysis Services, and used by the Tabular engine ever since.

[4] Saying “this is the primary purpose” of any complex tool is fraught with risk, because no matter what you say, there are other valid things that remain unsaid. With this said… the big gap that dataflows close is that of self-service data preparation for the purpose of data sharing and reuse. Power BI has always had self-service data preparation through Power Query, but before dataflows the data that was prepared was “locked” in a dataset, for analysis, and not for sharing or reuse.

[5] Once you have loaded data into dataflows, authorized users can reuse entities from multiple dataflows, and use them as the building blocks for new dataflows or new datasets. Once you have loaded data into a dataset (and published it to the Power BI service) you can enable users to connect to it.

[6] With dataflows, users can pick and choose the entities they want, but a dataset can only be reused as-is.

[7] Dataflow entities can be used as data sources in the same Power BI Desktop file as other data sources, and can serve as part of a mashup or composite model, but a dataset can only be reused as-is.

[8] Although you can obviously use dataflows for reporting, you do so by first importing the data from the dataflow into a dataset.

[9] It’s interesting to point out that using your own organizational ADLSg2 account does not require Power BI Premium, but using the XMLA endpoint to connect to Power BI datasets from non-Power BI clients does.

[10] You can only import data into your dataflow entities, but tables in your dataset can import data or use DirectQuery, and a dataset can use a combination of the two.

[11] You can only import data from a dataflow into a dataset. When connecting to a shared dataset you can only use Live Connections.

[12] I’ve been thinking of making videos to supplement this blog for almost as long as I’ve been hearing the question that inspired this post. Please take a moment to share your thoughts on the video. This is something of a “soft launch” and although I have plans for a few dozen more videos already, your feedback will be a main factor in how the video series evolves.

Power BI dataflows and query folding

In a recent post I mentioned an approach for working around the import-only nature of Power BI dataflows as a data source in Power BI Desktop, and in an older post I shared information about the enhanced compute engine that’s currently available in preview.

Image by Chris Martin from Pixabay

Some recent conversations have led me to believe that I should summarize a few points about dataflows and query folding, because these existing posts don’t make them easy to find and understand.

  1. When accessing dataflow entities from Power BI Desktop, no query folding takes place, even if the enhanced compute engine is enabled.
  2. When accessing dataflow entities from other entities in the Power BI service, no query folding takes place unless the enhanced compute engine enabled.
  3. When accessing dataflow entities from other entities in the Power BI service, query folding will take place when the enhanced compute engine is enabled, because the linked entity’s query will be executed against the cached data in SQL, rather than the underlying CDM folder.

These three statements summarize how query folding works – or does not work – in Power BI dataflows today.

The Power BI team has discussed some of their plans for the enhanced compute engine, so this should change in the future [1] but as of today, the only dataflows scenario where query folding takes place is when a dataflow is backed by the enhanced compute engine is referenced by a linked entity.

I hope this helps clarify things, at least a little…


[1] I think this will be around the time the engine goes GA, but I don’t remember for sure, and I’m too lazy to re-watch the MBAS session to double check. If you watch it and let me know, I’ll gladly update this post with the details.

Quick Tip: Developing with large dataflows

I received today what is becoming a common question:

I have a customer who is using dataflows to populate a 255M row fact table, refreshing the last 2 months of data (which is working) but experiencing issues when trying to connect to that Dataflow using Power BI Desktop. Suggestions?

Power BI dataflows are an import-only data source when being used from Power BI Desktop. This means that if you’re building a model in Power BI Desktop and are using dataflows as a source, all of the data from the entities you select will be downloaded from Azure and loaded into your data model.

All of it.

Image by Daniel Kirsch from Pixabay
There’s no subtext – I just searched Pixabay for “large” and this was the result I liked best.

To exacerbate the challenge, dataflows – and the CDM folders on which they’re built – do not support query folding, so even if you filter the records in your query, that filter will be applied in Power BI Desktop, not at the sources.

If one of these entities includes very wide records and/or a very large number of records, this can result in a slow and frustrating experience in Power BI Desktop, and in some circumstances can cause the data load to fail. So what can you do?

My preferred approach is to use a pattern I’ve long used when working with other data sources[1] that don’t support DirectQuery or query folding: work with a subset of the data in Power BI Desktop, and then work with the full data set after you publish to the Power BI service.

For dataflows, it looks like this:

  1. Develop and test the dataflows in the Power BI service
  2. Add filters to the too-large dataflow entities’ queries[2] to reduce the number of records in the entities
  3. Refresh the dataflow
  4. Develop the model and reports in Power BI Desktop
  5. Publish the model and reports to the Power BI service
  6. Remove the filters from the filtered dataflow entities
  7. Refresh the dataflow
  8. Refresh the dataset

The dataflows team announced at MBAS in June that the enhanced compute engine currently in preview will be enable DirectQuery access and query folding when using dataflows from Power BI Desktop at some point – but that point is not today. Until then, I hope this quick pattern will help…


[1] Like large folders full of large files.

[2] This is typically just the one or two largest entities, not all of them.

Quick Tip: Factoring your dataflow entities

This post started as a response to this question from Mark, who was commenting on last week’s data lineage post:

How would you decide how big or how small to make each artifact in the lineage, in terms of the amount of transformations taking place inside the artifact? In my case they would only be shared with 2-3 other users.

For instance I could go all out and have every step that would previously take place in a query editor result in a new link in the data lineage chain, but that would probably be overkill.

I agree that “one step per dataflow” would be overkill, but beyond that the answer is largely “it depends.”

Image by Esi Grünhagen from Pixabay

The approach I generally take is to break the end to end data preparation down into blocks that look like this:

  1. Staging – getting the source data into the system (in this case dataflow, but could be data mart, data warehouse, data lake, etc.) with zero or minimal transformations
  2. Cleansing – correcting known data quality and format problems from the staged data
  3. Transformation 1 – getting the cleansed data into the shape required for intended downstream purposes
  4. Enrichment – adding data from other sources, which have ideally already gone through steps 1 through 3
  5. Transformation 2 – getting the cleansed and enriched data into the shape required for analysis, typically as dimensions and facts

The final step may also be performed in the queries that are used to create the final tabular model when creating a dataset in Power BI Desktop. If a given dimension is likely to be used in multiple datasets, implement it as a dataflow entity. If it isn’t, implement it as a table in your dataset.

These guidelines tend to create a moderate number of easily maintainable entities, but they’re obviously the bare minimum – take what works for you, and discard the rest.

I feel like I’m dating myself with this link[1], but I definitely recommend looking at the Kimball Group’s techniques for data warehousing and BI: resources link. Ralph Kimball and his amazing team know more about this stuff than I will ever forget (or something like that) and there’s a huge volume of guidance available. Do yourself a favor and check it out.


[1] I assume there are newer resources out there, but when I was your age it was the Kimball Method or the… synonym for highway that rhymes with method.

Choosing the right tool for the job

A few weeks back[1] I got this comment on my Power BI dataflows overview post:

choosing tool comment

This morning I started to reply to the comment, and by the time I was done I realized that it should be a blog post on its own. So here we go… but before I continue I feel compelled to share this blurb from my blog bio:

I’m a program manager on the Power BI team at Microsoft. This is my personal blog, and all posts and opinions are mine and mine alone, and if you believe my teenage children they are not shared by anyone else in the entire world, because I’m weird and embarrassing.

So… There are a few capabilities that make dataflows “pop” for me. This shouldn’t be taken as a comprehensive list, and isn’t intended to say that dataflows are better than any other data prep tool, but hopefully it will be useful:

  • Power BI dataflows build on the Power Query experience many users are already familiar with. If you know how to use Power Query in one tool, you know how to use it in another.
  • Power BI dataflows are integrated into the end-to-end Power BI service and experience. You’re not pushing data into a CSV file. You don’t need to provision and manage a database. You get strongly typed “tables” of data that can be reused by you or by other users, and it’s all in one seamless experience.
  • Power BI dataflows and CDM folders provide capabilities for bridging the low-code/no-code world of self-service BI with managed central corporate BI in Azure.
  • Power BI dataflows enable Excel-like composition of ETL processes with linked and computed entities.
  • Power BI dataflows can scale beyond the desktop and leverage the power of the cloud to become part of an end-to-end BI application.

But… This is just a list of features.

The right tool for a job depends largely on the context of that job. If you’re trying to say that one tool is better than another, you need to have project/selection/evaluation criteria that everyone[2] agrees on. Only then you can compare multiple tools against those criteria. If you can’t do that, you’re probably just having a popularity contest.

This is one of the reasons why analysts like Gartner and and Forrester play the role that they do – they define and document their criteria, and then do exhaustive research to evaluate tools against those criteria. They take great pains to make sure that their criteria align well with the needs of the industry. They evolve the criteria as the market evolves, and they update their analyses as products evolve.

If you take this type of approach, you’ll probably end up choosing the tool that’s right for the job at hand – even if it’s not the tool you had in mind when you started. It’s not always easy to convince everyone to step back and look at the big picture before making what may feel like a small decision, but when choosing tools and platforms it’s often a good way to save time and effort in the long run.


[1] July has been kicking my butt, even more so than usual this year. Not only have I not been blogging consistently, I’ve basically put everything on hold that wasn’t vitally important and critically urgent. Ugh.

[2] For a given value of “everyone.” Identifying the necessary and appropriate stakeholders is a big enough problem on its own…

Quick Tip: Implementing “IsNumeric” in Power Query

I’ve had many people tell me that they have a blog so they will have a place to look to find the information that they need, and answers to their own future questions. This isn’t typically the case for me… but this post is an exception.

Every few months I find myself searching for an IsNumeric function[1] in Power Query. Each time I end up spending a few frustrating hours before I find the solution I’m looking for, because the search terms I’m using aren’t how people who have already solved the problem have shared it.

Last night, when I was once again failing to find the solution I needed I reached out to Twitter for help, and Imke Feldman delivered:.

If you’re not sure how you’d use this solution, consider the problem I was working on last night. My source data includes a “release-date” column that can include a wide range of values, and because I do not have any influence over the source system, I need to work with what it contains. Specifically, I need to find records that contain a year, and I need to extract that year into a new numeric column.

01 data

There are a few more pattern variations in the data including two-digit years, but this sample shows the basic problem I need to solve. And in order to solve it, I need to be able to determine if the two characters at the end of the text represent a number.

Using Imke’s approach, this is what I ended up doing:

  1. Extract the last two characters from the source column
  2. When the extracted characters contain a number, convert them to numeric
  3. Standardize the resulting value to represent the correct year

This last step is specific to my particular “date as year encoded in different text formats” problem, but I’m including it here in case someone else may find it useful.

It looks like this:

02 solution

Thanks to Imke for her quick response. Hopefully there are people out there other than me who will find this useful… but I know that I will.


[1] This is how I think of a function that accepts a text value returns true if that value contains a number, and false if it does not. This is probably because of my decades-long love affair with Transact-SQL, since T-SQL has an ISNUMERIC function built in.

Are you building a BI house of cards?

Every few weeks I see someone asking about using Analysis Services as a data source for Power BI dataflows. Every time I hear this, I cringe, and then include advice something like this[1] in my response.

Using Analysis Services as a data source is an anti-pattern – a worst practice. It is not recommended, and any solution built using this pattern is likely to produce dissatisfied customers. Please strongly consider using other data sources, likely the data sources on which the AS model is built.

 

There are multiple reasons for this advice.

 

Some reasons are technical. Extraction of large volumes of data is not what an Analysis Services model is designed for. Performance for the ETL process is likely to be poor, and you’re likely end up with memory/caching issues on the Analysis Services server. Beyond this, AS models typically don’t include the IDs/surrogate keys that you need for data warehousing, so joining the AS data to other data sources will be problematic.[2]

 

For some specific examples and technical deep dives into how and why this is a bad idea, check out this excellent blog post from Shabnam Watson. The focus of the post is on SSAS memory settings, but it’s very applicable to the current discussion.

 

Some reasons for this advice are less technical, but no less important. Using analytics models as data sources for ETL processing are a strong code smell[3] (“any characteristic in the source code of a program that possibly indicates a deeper problem”) for business intelligence solutions.

 

Let’s look at a simple and familiar diagram:

 

01 good

 

There’s a reason this left-to-right flow is the standard representation of BI applications: it’s what works. Each component has specific roles and responsibilities that complement each other, and which are aligned with the technology used to implement the component. This diagram includes a set of logical “tiers” or “layers” that are common in analytics systems, and which mutually support each other to achieve the systems’ goals.
Although there are many successful variations on this theme, they all tend to have this general flow and these general layers. Consider this one, for example:

 

02 ok

This example has more complexity, but also has the same end-to-end flow as the simple one. This is pretty typical for  scenarios where a single data warehouse and analytics model won’t fulfill all requirements, so the individual data warehouses, data marts, and analytics models each contain a portion – often an overlapping portion – of the analytics data.

Let’s look at one more:

03 - trending badly

This design is starting to smell. The increased complexity and blurring of responsibilities will produce difficulties in data freshness and maintenance. The additional dependencies, and the redundant and overlapping nature of the dependencies means that any future changes will require additional investigation and care to ensure that there are no unintended side effects to the existing functionality.

As an aside, my decades of working in data and analytics suggest that this care will rarely actually be taken. Instead, this architecture will be fragile and prone to problems, and the teams that built it will not be the teams who solve those problems.

And then we have this one[4]:

04 - hard no

This is what you get when you use Analysis Services as the data source for ETL processing, whether that ETL and downstream storage is implemented in Power BI dataflows or different technologies. And this is probably the best case you’re likely to get when you go down this path. Even with just two data warehouses and two analytics models in the diagram, the complex and unnatural dependencies are obvious, and are painful to consider.

What would be better here?[5] As mentioned at the top of the post, the logical alternative is to avoid using the analytics model and to instead use the same sources that the analytics model already uses. This may require some refactoring to ensure that the duplication of logic is minimized. It may require some political or cross-team effort to get buy-in from the owners of the upstream systems. It may not be simple, or easy. But it is almost always the right thing to do.

Don’t take shortcuts to save days or weeks today that will cause you or your successors months or years to undo and repair. Don’t build a house of cards, because with each new card you add, the house is more and more likely to fall.

Update: The post above focused mainly on technical aspects of the anti-pattern, and suggests alternative recommended patterns to follow instead. It does not focus on the reasons why so many projects are pushed into the anti-pattern in the first place. Those reasons are almost always based on human – not technical – factors.

You should read this post next: http://workingwithdevs.com/its-always-a-people-problem/. It presents a delightful and succinct approach to deal with the root causes, and will put the post you just read in a different context.


[1] Something a lot like this. I copied this from a response I sent a few days ago.

[2] Many thanks to Chris Webb for some of the information I’ve paraphrased here. If you want to hear more from Chris on this subject, check out this session recording from PASS Summit 2017. The whole session is excellent; the information most relevant to this subject begins around the 26 minute mark in the recording. Chris also gets credit for pointing me to Shabnam Watson’s blog.

[3] I learned about code smells last year when I attended a session by Felienne Hermans at Craft Conference in Budapest. You can watch the session here. And you really should, because it’s really good.

[4] My eyes are itching just looking at it. It took an effort of will to create this diagram, much less share it.

[5] Yes, just about anything would be better.