Are you building a BI house of cards?

Every few weeks I see someone asking about using Analysis Services as a data source for Power BI dataflows. Every time I hear this, I cringe, and then include advice something like this[1] in my response.

Using Analysis Services as a data source is an anti-pattern – a worst practice. It is not recommended, and any solution built using this pattern is likely to produce dissatisfied customers. Please strongly consider using other data sources, likely the data sources on which the AS model is built.


There are multiple reasons for this advice.


Some reasons are technical. Extraction of large volumes of data is not what an Analysis Services model is designed for. Performance for the ETL process is likely to be poor, and you’re likely end up with memory/caching issues on the Analysis Services server. Beyond this, AS models typically don’t include the IDs/surrogate keys that you need for data warehousing, so joining the AS data to other data sources will be problematic.[2]


For some specific examples and technical deep dives into how and why this is a bad idea, check out this excellent blog post from Shabnam Watson. The focus of the post is on SSAS memory settings, but it’s very applicable to the current discussion.


Some reasons for this advice are less technical, but no less important. Using analytics models as data sources for ETL processing are a strong code smell[3] (“any characteristic in the source code of a program that possibly indicates a deeper problem”) for business intelligence solutions.


Let’s look at a simple and familiar diagram:


01 good


There’s a reason this left-to-right flow is the standard representation of BI applications: it’s what works. Each component has specific roles and responsibilities that complement each other, and which are aligned with the technology used to implement the component. This diagram includes a set of logical “tiers” or “layers” that are common in analytics systems, and which mutually support each other to achieve the systems’ goals.
Although there are many successful variations on this theme, they all tend to have this general flow and these general layers. Consider this one, for example:


02 ok

This example has more complexity, but also has the same end-to-end flow as the simple one. This is pretty typical for  scenarios where a single data warehouse and analytics model won’t fulfill all requirements, so the individual data warehouses, data marts, and analytics models each contain a portion – often an overlapping portion – of the analytics data.

Let’s look at one more:

03 - trending badly

This design is starting to smell. The increased complexity and blurring of responsibilities will produce difficulties in data freshness and maintenance. The additional dependencies, and the redundant and overlapping nature of the dependencies means that any future changes will require additional investigation and care to ensure that there are no unintended side effects to the existing functionality.

As an aside, my decades of working in data and analytics suggest that this care will rarely actually be taken. Instead, this architecture will be fragile and prone to problems, and the teams that built it will not be the teams who solve those problems.

And then we have this one[4]:

04 - hard no

This is what you get when you use Analysis Services as the data source for ETL processing, whether that ETL and downstream storage is implemented in Power BI dataflows or different technologies. And this is probably the best case you’re likely to get when you go down this path. Even with just two data warehouses and two analytics models in the diagram, the complex and unnatural dependencies are obvious, and are painful to consider.

What would be better here?[5] As mentioned at the top of the post, the logical alternative is to avoid using the analytics model and to instead use the same sources that the analytics model already uses. This may require some refactoring to ensure that the duplication of logic is minimized. It may require some political or cross-team effort to get buy-in from the owners of the upstream systems. It may not be simple, or easy. But it is almost always the right thing to do.

Don’t take shortcuts to save days or weeks today that will cause you or your successors months or years to undo and repair. Don’t build a house of cards, because with each new card you add, the house is more and more likely to fall.

Update: The post above focused mainly on technical aspects of the anti-pattern, and suggests alternative recommended patterns to follow instead. It does not focus on the reasons why so many projects are pushed into the anti-pattern in the first place. Those reasons are almost always based on human – not technical – factors.

You should read this post next: It presents a delightful and succinct approach to deal with the root causes, and will put the post you just read in a different context.

[1] Something a lot like this. I copied this from a response I sent a few days ago.

[2] Many thanks to Chris Webb for some of the information I’ve paraphrased here. If you want to hear more from Chris on this subject, check out this session recording from PASS Summit 2017. The whole session is excellent; the information most relevant to this subject begins around the 26 minute mark in the recording. Chris also gets credit for pointing me to Shabnam Watson’s blog.

[3] I learned about code smells last year when I attended a session by Felienne Hermans at Craft Conference in Budapest. You can watch the session here. And you really should, because it’s really good.

[4] My eyes are itching just looking at it. It took an effort of will to create this diagram, much less share it.

[5] Yes, just about anything would be better.

Quick Tip: Power Query column by example

It seems like everyone knows about how Power Query in Power BI Desktop lets you use web pages as data sources, and lets you build queries by entering values from the web page. This is probably because so many people use the “from web page by example” feature for demos[1].

It also seems like no one knows about the “column from examples” feature in Power Query, even though this is just as exciting, and much more useful. They’re both built on the same underlying intelligence, but this one lets you work with data from any source.

Here’s the quick overview:

  1. In the Power Query editor in Power BI Desktop, choose “Column from Examples” from the “Add Column” tab.
  2. Enter the values that the new column should have for rows that are already in your data set.
  3. Review the values that Power Query is suggesting for the other rows, and when they are all correct, choose OK, and then say “Ooooooohhhhh” when Power Query does all the work for you.

Like this[2]:

column by example

In this example, I want to extract an ID value from a URL column where the URLs all follow the same pattern. I’ve copied the ID value from the first row, and when I paste it in to the new column, Power Query immediately generates the M code to extract the ID and adds a new step the query.

= Table.AddColumn(#”Renamed Columns”, “Text After Delimiter”, each Text.AfterDelimiter([Band Page URL], “/”, 4), type text)

This code isn’t too complex. You could have written this code. I could have written it, but odds are it would have taken me a few minutes to look up the syntax for the Text.AfterDelimiter function, and to iterate a few times getting it right. Using this awesome feature I get the same code in seconds. For more complex scenarios you may need to enter more than one desired value before Power Query has enough information to get the function just right.

The best part is that the step that is added is just another query step. You can delete it, you can edit it, you can add more steps after it. This is just a fast and simple way to add the step without needing to write any code or even know what functions to use.

There have been a few occasions where Power Query couldn’t quite figure out what I needed it to do, but it hasn’t happened often, or lately. This may be because I’ve been getting better, but it’s probably because the Power Query team has continued to improve the feature since it was introduced in April 2017. That’s right – this feature has been available for over two years, and you’re just learning about it now.

Or are you[3]?

Either way, check it out the next time you’re working in Power Query, and let me know what you think.

[1] To be fair, it does make for an awesome demo.

[2] There also used to be a GIF here, but apparently my GIF-making skills need some refinement before I start using GIFs in blog posts.

[3] When I demo Power Query in Power BI Desktop to customers, maybe 10% have already seen this feature already. I suspect that the high quality demographic that this blog is certain to attract will be more aware and more engaged than the general population, so I’m genuinely curious to know if you already knew about this gem.

Power BI dataflows enhanced compute engine

This month the Power BI dataflows team announced an exciting new capability available in preview for dataflows in workspaces backed by Power BI Premium dedicated capacity[1]. This compute engine builds on the Excel-like capabilities delivered through linked and computed entities to dramatically increase refresh performance for dataflow entities that get their data from other dataflow entities.

This post won’t go deep into the technical details – I’ll defer to the folks who built the feature to share that – but it will share important highlights to answer the most common questions I’ve seen since the preview was announced.

Let’s begin with a quick review. Consider this typical scenario:

excel white

  1. Data is loaded into entities in the “Ingest from Dynamics Sales” and “Product Telemetry in Azure” dataflows from data sources external to Power BI.
  2. The “Clean and enrich sales data” and “Add Telemetry Customer Attributes” reference these dataflows as their sources, and the “Final Business View” dataflow references them.
  3. The first two dataflows are configured for scheduled refresh. The other three dataflows will automatically refresh[2] to keep the data in their entities up to date.

But what is actually happening in this scenario?

The boxes and arrows in this type of diagram could represent any number of technologies, but for Power BI dataflows the key technology is the Mashup Engine. The Mashup Engine is the runtime engine behind Power Query – it takes the Power Query “M” script and performs the extract, transform, and load (ETL) processes that the script defines. The Mashup Engine runs in Excel, in Power BI Desktop, in the Power BI on-premises data gateway, and in the Power BI service, and it is a cornerstone for data connectivity in Power BI.

A key characteristic of Power Query and the Mashup Engine is query folding. Query folding is a process by which the Mashup Engine offloads the heavy lifting of the data transformation to the data source, as enabled by the capabilities of the data source. For example, if a Power Query user built a query that extracted three columns from one table in a SQL Server database, joined them with four columns from another table in the same database, and then groups the join results by one of the columns. The Mashup Engine will automatically send the appropriate JOIN and GROUP BY statements to the server, and the server will return the results. If the same query is executed against two folders full of CSV files[4], there is no “server” to perform the required join and grouping operations, so the Mashup Engine performs them in memory.

To summarize this example, the query is largely the same, and the Mashup Engine uses query folding to choose the best location[5] for the required compute operations to be performed.

This brings us back to dataflows, CDM folders, and why the new advanced compute engine is exciting. When a Power BI dataflow is refreshed, the data produced by the Mashup Engine’s execution of the dataflow entities’ queries is stored in Azure in the CDM folder format, which uses CSV files in folders for the storage. This means that when one entity references another entity, it is accessing CSV data and is using the Mashup Engine for its processing[6]. This can result in slow refresh performance for some linked entity scenarios when the data volume or transformations can’t be optimized by the Mashup Engine.

The new advanced compute engine addresses this problem by loading the CDM folder data into an Azure SQL DB-based cache, and using query folding against this cache. This can produce significant performance improvements for the downstream refresh of computed entities. If we consider again the scenario shown above, the new advanced compute engine will change the underlying processing without requiring any change to the dataflows or entities themselves.

  1. When data is loaded into entities in the dataflows, the data is also loaded into the SQL-based cache.
  2. This may increase the refresh times for the “Ingest from Dynamics Sales” and “Product Telemetry in Azure” dataflows, because the caching is an additional operation not otherwise required.
  3. When the “Clean and enrich sales data” and “Add Telemetry Customer Attributes” and “Final Business View” dataflows refresh, the Mashup Engine can use query folding against the SQL-based cache. This can result in dramatic improvements in performance for the refresh of computed entities.

If you want to know more, you should watch this session from the Microsoft Business Applications Summit and read this blog post after you watch the session. Both of these excellent resources are from the architect responsible for much of the technical underpinnings of Power BI, and both go into more technical detail than I’ve gone into here.

As mentioned above, this post is designed to address some questions I’ve heard since this new preview capability was announced. Please let me know if this post is helpful, or if you have any questions of your own that aren’t answered here.

[1] Yes – this is a Premium-only feature.

[2] In case you didn’t click through already, this is why we call this behavior “Excel-like.” The Power BI service maintains the dependencies between dataflow entities like Excel maintains dependencies between cells and ranges.

[3] If you want to know more about query folding, check out this awesome post from MVP Matt Allington.

[4] This is one of the aspects of Power Query that are most exciting for me, as only the “source” function in the query would need to be changed. This data source agnostic nature of Power Query is enabled by query folding.

[5] Or locations, because a single query can combine data from multiple data sources.

[6] This is also why dataflows are an import-only data source in Power BI Desktop.


Power BI dataflows and CDM Sessions from MBAS 2019

Last week Microsoft held its annual Microsoft Business Applications Summit (MBAS) event in Atlanta. This two-day technical conference covers the whole Business Applications platform – including Dynamics, PowerApps, and Flow – and not just Power BI, but there was a ton of great Power BI content to be had. Now that the event is over, the session recordings and resources are available to everyone.

MBAS 2019 Banner

There’s a dedicated page on the Power BI community site with all of the sessions, but I wanted to call out a few sessions on dataflows and the Common Data Model that readers of this blog should probably watch[1].

Power BI dataflows sessions

Microsoft Power BI: Democratizing self-service data prep with dataflows

This session is something of a “deep technical introduction” to dataflows in Power BI. If you’re already familiar with dataflows a lot of this will be a repeat, but there are some gems as well.

Microsoft Power BI: Enterprise-grade BI with Power BI dataflows

This session is probably my favorite dataflows session from any conference. This is a deep dive into the dataflows architecture, including the brand-new-in-preview compute engine for performance and scale.

Common Data Model sessions

As you know, Power BI dataflows build on CDM and CDM folders. As you probably know, CDM isn’t just about Power BI – it’s a major area of investment across Azure data services as well. The session lineup at MBAS reflected this importance with three dedicated CDM sessions.

Common Data Model: All you need to know

This ironically-named session[2] provides a comprehensive overview of CDM. It’s not really everything you need, but it’s the right place to begin if you’re new to CDM and want to the big-picture view.

Microsoft Power BI: Common Data Model and Azure Data Services

This session covers how CDM and CDM folders are used in Power BI and Azure data services. If you’ve been following dataflows and CDM closely over the past six months much of this session might be review, but it’s an excellent “deep overview” nonetheless.

Microsoft Power BI: Advanced concepts in the Common Data Model

This session is probably the single best resource on CDM available today. The presenters are the key technical team behind CDM, and goes into details and concepts that aren’t available in any other presentation I’ve found. I’ve been following CDM pretty closely for the past year or more, and I learned a lot from this session. You probably will too.

Once you’re done watching these sessions, remember that there’s a huge library of technical sessions you can watch on-demand. Also some less-technical sessions.

[1] I have a list of a dozen or more sessions that I want to watch, and only a few of them are dataflows-centric. If you look through the catalog you’ll likely find some unexpected gems.

[2] If this is all you need to know, why do we have these other two sessions?

[3] Including Jeff Bernhardt, the architect behind CDM. Jeff doesn’t have the rock star reputation he deserves, but he’s been instrumental in the design and implementation of many of the products and services on which I’ve built my career. Any time Jeff is talking, I make a point to listen closely.

Power BI Dataflows and Additional Data Sources

A few recent posts have touched on different aspects of how dataflows have capabilities that aren’t exposed in the current preview user experience in Power Query Online. This is true of Power Query “M” functions, and of using custom functions, and it is also true of data sources.

Not every data source that is supported in Power BI Desktop is supported today in dataflows, but these data sources should work[1]:

  • SAP Business Warehouse
  • Azure Analysis Services
  • Google Analytics
  • Adobe Analytics
  • ODBC
  • OLE DB
  • Folder
  • SharePoint Online folder
  • SharePoint folder
  • Hadoop HDFS
  • Azure HDInsight (HDFS)
  • Hadoop file HDFS
  • Informix (beta)


For any of these data sources, you should be able to build a query in Power BI Desktop, copy it and paste it into Power Query Online and have it work, even though the data sources are not yet listed in the UX.

Let’s look at an example using the Folder source, both because the Folder data source is not yet supported in the Power Query Online user experience, and because of how to relates to yesterday’s post on using custom functions[2]. We’ll begin by choosing a folder that contains our data files.



Once we’ve connected to the folder, we’ll edit the query to load the data from all CSV files that the folder contains, and then combine the contents of the files.



Once we’re done, Power Query in Power BI Desktop will give us this:


There are three points worth mentioning in this image:

  1. The data itself – this is probably what grabs your attention, because it’s big and data
  2. The set of queries built by Power BI Desktop includes multiple queries and a custom function
  3. The main query loaded into the PBIX data model references the other query and the custom function

Now let’s take it all from Power BI Desktop and use it in a dataflow. As covered in the post on authoring Power BI dataflows in Power BI Desktop, right-clicking on the query and choosing “Copy” will copy all of the script we need. Like this:

// Demo File Source
Source = Folder.Files("C:\Demo File Source"),
#"Filtered Rows" = Table.SelectRows(Source, each Text.EndsWith([Name], ".csv")),
#"Filtered Hidden Files1" = Table.SelectRows(#"Filtered Rows", each [Attributes]?[Hidden]? <> true),
#"Invoke Custom Function1" = Table.AddColumn(#"Filtered Hidden Files1", "Transform File from Demo File Source", each #"Transform File from Demo File Source"([Content])),
#"Renamed Columns1" = Table.RenameColumns(#"Invoke Custom Function1", {"Name", "Source.Name"}),
#"Removed Other Columns1" = Table.SelectColumns(#"Renamed Columns1", {"Source.Name", "Transform File from Demo File Source"}),
#"Expanded Table Column1" = Table.ExpandTableColumn(#"Removed Other Columns1", "Transform File from Demo File Source", Table.ColumnNames(#"Transform File from Demo File Source"(#"Sample File"))),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Table Column1",{{"Source.Name", type text}, {"Date received", type date}, {"Product", type text}, {"Sub-product", type text}, {"Issue", type text}, {"Sub-issue", type text}, {"Consumer complaint narrative", type text}, {"Company public response", type text}, {"Company", type text}, {"State", type text}, {"ZIP code", type text}, {"Tags", type text}, {"Consumer consent provided?", type text}, {"Submitted via", type text}, {"Date sent to company", type date}, {"Company response to consumer", type text}, {"Timely response?", type text}, {"Consumer disputed?", type text}, {"Complaint ID", Int64.Type}})
#"Changed Type"

// Transform File from Demo File Source
Source = (#"Sample File Parameter1") => let
Source = Csv.Document(#"Sample File Parameter1",[Delimiter=",", Columns=18, Encoding=1252, QuoteStyle=QuoteStyle.Csv]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true])
#"Promoted Headers"

// Sample File
Source = Folder.Files("C:\Demo File Source"),
#"Filtered Rows" = Table.SelectRows(Source, each Text.EndsWith([Name], ".csv")),
Navigation1 = #"Filtered Rows"{0}[Content]

Isn’t that sweet?

We can then paste each of these queries into the “Blank query” template in Power Query Online. Power Query Online is smart enough to automatically prompt for a gateway and credentials. The only edit we’ll need to make is to remove the one line that’s highlighted in red[3], and then… it just works.


There’s certainly a little irony involved in using the phrase “it just works” after going through all of these steps… but it does work.

While Power BI dataflows and Power Query Online are in preview, there are likely to be some rough edges. But because dataflows are built on the same Power Query function language and connectivity platform, which means that in many scenarios you can use dataflows for tasks that are not yet fully supported in the interface.

During preview there are still functions and data sources that aren’t yet supported in dataflows, but this technique can unlock data sources and capabilities that aren’t yet exposed through the UI. Check it out.

[1] As the rest of the blog makes clear, I have not yet tested all of these. If you try any of them and they don’t work, remember this: Getting Help for Power BI Dataflows.

[2] But mainly because I don’t have any of the other listed data sources handy to use.

[3] Because this is not yet supported in dataflows.

Using Custom Functions in Power BI Dataflows

You can use custom Power Query “M” functions in Power BI dataflows, even though they’re not exposed and supported in the preview Power Query Online editor to the same extent they are supported in Power BI Desktop.[1]

As mentioned in a recent post on Authoring Power BI Dataflows in Power BI Desktop, the Power Query “M” queries that define your dataflow entities can contain a lot more than what can be created in Power Query Online. One example of this is support for custom functions in a dataflow. Functions work the same way in dataflows as they work in Power Query Desktop – there’s just not the same UX support.

Let’s see how this works. Specifically, let’s build a dataflow that contains a custom function and which invokes it in one of the dataflow entities. Here’s what we’ll do:

  1. We’ll define a custom function that accepts start and end dates, and returns a table with one row for each day between these dates. Specifically, we’ll use the date dimension approach that Matt Masson published five years ago[2], when Power Query functions were new.
  2. We’ll pull in sales order data from the SalesOrderHeader table in the AdventureWorks sample database to be an “Order” entity in the dataflow.
  3. We’ll use the min and max of the various date columns in the SalesOrderHeader table to get the parameter values to pass into the custom function. We’ll then call the custom function to build a Date entity in the dataflow.
  4. We’ll close our eyes and imagine doing the rest of the work to load other entities in the dataflow to make what we’d need to build a full star schema in a Power BI dataset, but we won’t actually do the work.

Let’s go. Since we’re just copying the code from Matt‘s blog, we’ll skip the code here, but the result in Power Query Online is worth looking at.


Even though Power Query Online doesn’t have a dedicated “create function” option, it does recognize when a query is a function, and does include a familiar UX for working with a function. You will, however, need to clear the “Enable load” option for the query, since a function can’t be loaded directly.

The Order entity is super simple – we’re just pulling in a table and removing the complex columns that Power Query adds to represent related tables in the database. Here’s the script:

Source = Sql.Database("", "adventureworks"),
SalesLT_SalesOrderHeader = Source{[Schema = "SalesLT", Item = "SalesOrderHeader"]}[Data],
#"Removed columns" = Table.RemoveColumns(SalesLT_SalesOrderHeader, {"SalesLT.Address(BillToAddressID)", "SalesLT.Address(ShipToAddressID)", "SalesLT.Customer", "SalesLT.SalesOrderDetail"})
#"Removed columns"

Now we need to put the two of them together. Let’s begin by duplicating the Order entity. If we referenced the Order entity instead of duplicating it, we would end up with a computed entity, which would require Power BI Premium capacity to refresh.

This is what the query looks like before we invoke the custom function. With all of the awesome options on the “Add Column” tab in Power BI Desktop, implementing this logic was surprisingly easy.

Source = Sql.Database("", "adventureworks"),
SalesLT_SalesOrderHeader = Source{[Schema="SalesLT",Item="SalesOrderHeader"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(SalesLT_SalesOrderHeader,{"OrderDate", "DueDate", "ShipDate"}),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Custom", each "Group by all"),
#"Grouped Rows" = Table.Group(#"Added Custom", {"Custom"}, {{"Min Order Date", each List.Min([OrderDate]), type datetime}, {"Max Order Date", each List.Max([OrderDate]), type datetime}, {"Min Due Date", each List.Min([DueDate]), type datetime}, {"Max Due Date", each List.Max([DueDate]), type datetime}, {"Min Ship Date", each List.Min([ShipDate]), type datetime}, {"Max Ship Date", each List.Max([ShipDate]), type datetime}}),
#"Changed Type" = Table.TransformColumnTypes(#"Grouped Rows",{{"Min Order Date", type date}, {"Max Order Date", type date}, {"Min Due Date", type date}, {"Max Due Date", type date}, {"Min Ship Date", type date}, {"Max Ship Date", type date}}),
#"Inserted Earliest" = Table.AddColumn(#"Changed Type", "Min Date", each List.Min({[Min Order Date], [Min Due Date], [Min Ship Date]}), type date),
#"Inserted Latest" = Table.AddColumn(#"Inserted Earliest", "Max Date", each List.Max({[Max Order Date], [Max Due Date], [Max Ship Date]}), type date),
#"Removed Other Columns1" = Table.SelectColumns(#"Inserted Latest",{"Min Date", "Max Date"})
#"Removed Other Columns1"

At the end of this interim query, we have two columns to pass in to the custom function. And once we do, it looks like this:


And here’s the final script used to define the Date entity.

Source = Sql.Database("", "adventureworks"),
SalesLT_SalesOrderHeader = Source{[Schema="SalesLT",Item="SalesOrderHeader"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(SalesLT_SalesOrderHeader,{"OrderDate", "DueDate", "ShipDate"}),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Custom", each "Group by all"),
#"Grouped Rows" = Table.Group(#"Added Custom", {"Custom"}, {{"Min Order Date", each List.Min([OrderDate]), type datetime}, {"Max Order Date", each List.Max([OrderDate]), type datetime}, {"Min Due Date", each List.Min([DueDate]), type datetime}, {"Max Due Date", each List.Max([DueDate]), type datetime}, {"Min Ship Date", each List.Min([ShipDate]), type datetime}, {"Max Ship Date", each List.Max([ShipDate]), type datetime}}),
#"Changed Type" = Table.TransformColumnTypes(#"Grouped Rows",{{"Min Order Date", type date}, {"Max Order Date", type date}, {"Min Due Date", type date}, {"Max Due Date", type date}, {"Min Ship Date", type date}, {"Max Ship Date", type date}}),
#"Inserted Earliest" = Table.AddColumn(#"Changed Type", "Min Date", each List.Min({[Min Order Date], [Min Due Date], [Min Ship Date]}), type date),
#"Inserted Latest" = Table.AddColumn(#"Inserted Earliest", "Max Date", each List.Max({[Max Order Date], [Max Due Date], [Max Ship Date]}), type date),
#"Removed Other Columns1" = Table.SelectColumns(#"Inserted Latest",{"Min Date", "Max Date"}),
#"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns1", "fn_DateTable", each fn_DateTable([Min Date], [Max Date], null)),
fn_DateTable1 = #"Invoked Custom Function"{0}[fn_DateTable],
#"Changed Type1" = Table.TransformColumnTypes(fn_DateTable1,{{"Year", Int64.Type}, {"QuarterOfYear", Int64.Type}, {"MonthOfYear", Int64.Type}, {"DayOfMonth", Int64.Type}, {"DateInt", Int64.Type}, {"DayInWeek", Int64.Type}, {"MonthName", type text}, {"MonthInCalendar", type text}, {"QuarterInCalendar", type text}, {"DayOfWeekName", type text}})
#"Changed Type1"


Most of the complexity in this approach is in the work required to get min and max values from three columns in a single table. The topic of the post – calling a custom function inside a dataflow entity definition – is trivial.

When we’re done, the list of entities only shows Order and Date, because these are the only two queries that are being loaded into the dataflow’s CDM folder storage. But the definition of the Date query includes the use of a custom function, which allows us to have rich and possibly complex functionality included in the dataflow code, and referenced by one or more entities as necessary.

[1] I was inspired to write this post when I saw this idea on If this capability is obscure enough to get its own feature request and over a dozen votes, it probably justifies a blog post.

[2] And the same code, copied and pasted, like real developers do.

Authoring Power BI Dataflows in Power BI Desktop

That title got your attention, didn’t it?

Before we move on, no, you cannot create and publish dataflow entities from Power BI Desktop today. Creating dataflows is a task you need to perform in the browser. But you can build your queries in Power BI Desktop if that is your preferred query authoring tool. Here’s how.[1]

  1. Create a query or set of queries in Power BI Desktop.
  2. Copy the query script(s) from the Power BI Desktop query editor, and paste it into a “Blank query” in Power Query Online.
  3. Rename the new queries to match your desired entity names, being careful to match the names of the source queries if there are any references between them.
  4. If necessary, disable the “load” option for queries that only implement shared logic and should not be persisted in the dataflow’s CDM folder.

That’s it.

Some of you may be asking “but why would I want to do this, when there’s already an awesome query authoring experience in the browser?”

Good question! There are three reasons why I will often use this technique:

  1. I prefer rich, non-browser-based editing tools[2], and Power BI Desktop has a polished and refined UX.
  2. The online editor doesn’t have all of the transformations in its UI compared to Power BI Desktop.
  3. The online editor doesn’t have all supported connectors exposed in the UI.

Each of these points relates to the the maturity of Power BI Desktop as a tool[3], as opposed to the relatively new Power Query Online. Power Query Online is part of the dataflows preview and is continuing to improve and expand in functionality, but Power BI Desktop has been generally available for years.

And Although I didn’t realize it until I started writing this post, Power BI Desktop actually has features that make this scenario easier than expected. Let’s look at this example. Here are the queries I’m starting with:


In the PBIX file I have three sets of queries, organized by folder:

  1. A query that references the one data source that is used in the other queries, so I can change the connection information in one place and have everything else update.
  2. Three queries that each contain minimally-transformed data from the data source, and which are not loaded into the data model.
  3. Two queries that are loaded into the data model and which are used directly for reporting.

This is a common pattern for my PBIX files. I don’t know if it counts as a best practice (especially now that Power BI has better support for parameterization than it did when I started doing things this way) but it works for me, and nicely illustrates the approach I’m following.

To move this query logic out of my PBIX file and into a dataflow definition, I first need to copy the query scripts. Power BI Desktop makes this incredibly easy. When I right-click on any folder and choose “copy”, Power BI Desktop places the scripts for all queries in the workbook – including the query names as comments – on the clipboard.


Now that I have all of the query scripts in a text editor, I can get started by creating a new dataflow, and selecting “Blank query” for my first entity.


After I paste in the query, I will un-select the “Enable load” option, and will paste in the query name as well.


Once this is done, I can repeat the process by selecting “Get data” option in the query editor, and choosing “Blank query” for each additional query.


After I repeat this process for each remaining query, my dataflow will look something like this.


And if I want to, I can even add groups to organize the queries.


This looks a lot like where we started, which is both a good thing and a bad thing. The good side is that it demonstrates how we can use a more mature and authoring experience for our interactive query development. The bad side is that it introduces additional steps into our workflow.

I expect the integration between dataflows and Desktop to only get better over time[4], but for today, there’s still an easy path if you want to use Power BI Desktop to author your dataflow entities.

As a closing word of caution, please be aware that not all data sources and functions that work in Power BI Desktop will work in dataflows. If you’re using a data source for the first time, or are using a new function[5], you’ll probably want to test things early to avoid a potentially unpleasant surprise later on.


[1] I’ve mentioned this technique in a few other posts, but I’ve heard a bunch of questions in recent days that makes me believe that the word isn’t getting out. Hopefully this post will help.

[2] I’m writing this post from the WordPress Windows app – even though the app offers nothing that the web editor does not, and actually appears to be the thinnest of wrappers around the web editor.

[3] And they all relate to the fact that Power BI Desktop is just so awesome, nothing compares to it, and although the Power Query Online editor is useful and good, it hasn’t had a team making it better every month, year after year.

[4] Please remember that this is my personal blog, and that even though I’m a member of the Power BI team, I’m not working on either dataflows or Desktop, so what I expect and what will actually happen aren’t always well aligned.

[5] Like Html.Table, which is not yet supported in dataflows. Darn it.

Power BI Dataflows and Slow Data Sources

One advantage of data warehouses is the ability to have data from multiple data sources available for analysis without being constrained by the availability and performance characteristics of those data sources. Although Power BI dataflows are not a replacement for data warehouses in a general sense, they can provide same value to a self-service BI solution.

I’ve been using Power Query for over five years, and one consistently recurring pattern is the need to work with slow data sources such as web pages or APIs. “Slow” might mean “takes 10 minutes to load” but it could also mean “takes 10 hours” or “takes 10 days.” In each of these cases, my overall productivity ended up being closely tied to the refresh duration for the slowest data source in my solution, and this often introduced significant delay and frustration into my life. So I looked for solutions…

In the early days of Power BI, before Power BI dataflows, I would sometimes work around this problem by defining a set of Power Query queries in an Excel workbook, configuring those queries to load data into worksheets rather than into the data model. I would then use that Excel workbook as a data source for my primary queries. Essentially I was using an Excel file as a self-service data mart, since there wasn’t any similar capability in Power BI at the time. Here’s an example of the queries in a Power BI solution that I built almost five years ago[1]:


These queries are the foundation of a solution I built to track my personal fitness, back when my exercise routine revolved around lifting weights. Back before I knew that people fought with longswords. I haven’t looked at it in years, but the organization shown in the screen shot above sure looks familiar.

There’s one folder for the queries extracting data from each source system. These queries are configured as “connection only” which means the data isn’t loaded directly into the data model – these queries serve as data sources for other queries. One of the folders for source queries references my Excel “staging area” workbook. As implied above, this other workbook contains data from sources that are too slow to include in the refresh of the primary solution workbook, as well as sources that update very infrequently, so that refreshing the data on a regular basis would be unwarranted and inefficient.

There’s also one folder for the queries that define the solution’s data model. These queries are loaded and represent a simple star schema that is friendly for analytics use cases.[2]

Does this pattern look familiar?

I hope it does. This is essentially the same pattern that I’ve included in almost every dataflows presentation I’ve delivered. This one:


Dataflows make this pattern much easier to implement and maintain.

Were I to build a similar solution today[3], I would factor it in the same way. Each folder in the Excel-based solution would translate into a Power BI dataflow, and the queries in the folders would translate into entities. By having each source as its own dataflow, I can independently configure refresh schedules based on the performance and data freshness characteristics of each source. And by having the data in dataflow entities, I can easily reuse that data in multiple Power BI solutions – and enable other users in my organization to do the same[4].

If I were using Power BI Premium, I would probably build the whole solution using dataflows and computed entities, and only load the star schema into Power BI Desktop.

If I were not using Power BI Premium, I would probably build everything but the star schema using dataflows, and build the star schema using queries in Power BI Desktop that get their data from these dataflows.[5] As if the dataflows were logically a data warehouse.

Which is sort of where we came in…[6]


[1] The astute reader will realize that this is back before Power BI existed as we know it today. It feels strange to share the details of such an old solution, but this goes to show that although new tools are always being introduced, old problem patterns never seem to go away.

[2] In September 2014 I did a presentation about this solution for the PASS Data Warehousing and Business Intelligence Virtual Chapter, and the recording is still online on YouTube. I have not watched it, and don’t plan on watching it, and you should be warned before you watch it, because if I remember correctly it includes shirtless progress pictures of me.

[3] I will not do this, but only because I don’t have similarly rich sources of swordfighting-related data.

[4] This sounds like I’m implying that I have a group of analysts working on my personal fitness, doesn’t it? That’s just weird, even to me.

[5] This is another “reuse without Premium” scenario, independent of the one introduced in my earlier Power BI Dataflows – Reuse without Premium post.

[6] Imagine my disappointment when this screen shot just didn’t fit into this post. If only there was some way I could find a reason to include it somewhere…


Power BI Dataflows FAQ

Photo by Matthew Brodeur on Unsplash

Q: What are Power BI dataflows?

A: Dataflows are a capability in Power BI for self-service ETL and data preparation that enable analysts and business users to define and share reusable data entities. Each dataflow is created in a Power BI workspace and can contain one or more entities. Each entity is defined by a Power Query “M” query. When the dataflow is refreshed, the queries are executed, and the entities are populated with data.

Q: Where is the data stored?

A: Data is stored in Azure storage  in the CDM folder format. Each dataflow is saved in a folder in the data lake. The folder contains one or more files per entity. If an entity does not use incremental refresh, there will be one file for the entity’s data. For entities that do use incremental refresh, there will be multiple files based on the refresh settings. The folder also contains a model.json file that has all of the metadata for the dataflow and the entities.

Q: Do I need to pay for Power BI dataflows?

A: Yes, but you don’t need to pay extra for them. Dataflows are available to Power BI Pro and Premium users.

Q: Do I need Power BI Premium to use dataflows?

A: No. Although some specific features (incremental refresh of dataflow entities, linked/computed entities, the enhanced compute engine) do require Premium, dataflows are not a Premium-only capability.

Q: Do dataflows support incremental refresh?

A: Yes. Incremental refresh can be configured on a per-entity basis. Incremental refresh is supported only in Power BI Premium.

Q: Can I use on-premises data sources with dataflows?

A: Yes. Dataflows use the same gateways used by Power BI datasets to access on-premises data sources.

Q: How do I do X with dataflows? I can do it in a query in Power BI Desktop, but I don’t see it in the dataflows query editor UI!

A: Most Power Query functionality is available in dataflows, even if it isn’t exposed through the query editor in the browser. If you have a query that works in Power BI Desktop, copy the query and paste it into Power Query Online. In most cases it will work.

Q: Do I still need a data warehouse if I use dataflows?

A: If you needed a data warehouse before Power BI dataflows, you probably still need a data warehouse. Although dataflows serve a similar logical function as a data warehouse or data mart, modern data warehouse platforms provide capabilities that dataflows do not.

Q: Do I need dataflows if I already have a data warehouse?

A: Dataflows fill a gap in data warehousing and BI tools by allowing business users and analysts to prepare and share data without needing help from IT. With dataflows, users can build a “self service data mart” in Power BI that can be used in their solutions. Because each dataflow entity is defined by a Power Query “M” query, handing off the definitions to an IT team for operationalization/industrialization is more straightforward.

Q: Do dataflows replace Azure Data Factory?

A: No. Azure Data Factory (ADF) is a hybrid data integration platform designed to support enterprise-scale ETL and data integration needs. ADF is designed for use by professional data engineers. Power BI dataflows are designed for use by analysts and business users – people familiar with the Power Query experience from Power BI Desktop and Excel – to load data into ADLSg2.

Q: With the Wrangling Data Flow in Azure Data Factory do we still need Dataflows in Power BI?

A: Probably. Power BI dataflows are a self-service data preparation tool that enable analysts and other business users who may not be comfortable using SSIS or ADF to solve data prep problems without IT involvement. This is still relevant even after ADF now includes Power Query with Wrangling Data Flows.

Q: Can I use dataflows for realtime / streaming data?

A: No. Dataflows are for batch data, not streaming data.

Q: Do dataflows replace Power BI datasets?

A: No. Power BI datasets are tabular analytic models that contain data from various sources. Power BI dataflows can be some or all of the sources used by a dataset. You cannot build a Power BI report directly against a dataflow – you need to build reports against datasets.

Q: How can I use the data in a dataflow?

A: No. Oh, wait, that doesn’t make sense – this wasn’t even a yes or no question, but I was on a roll… Anyway, you use the data in a dataflow by connecting to the Power BI dataflows connector in Power Query. This will give you a list of all workspaces, dataflows, and entities that you have permission to access, and you can use them like any other data source.

Q: Can I connect to dataflows via Direct Query?

A: Yes. When using the dataflows enhanced compute engine, you can connect to dataflows using import or DirectQuery. Otherwise, dataflows are an import-only data source.

Q: Can I use the data in dataflows in one workspace from other workspaces?

A: Yes! You can import entities from any combination of workspaces and dataflows in your PBIX file and publish it to any workspace where you have the necessary permissions.

Q: Can I use the data in a dataflow from tools other than Power BI?

A: Yes. You can configure your Power BI workspace to store dataflow data in an Azure Data Lake Storage gen2 resource that is part of your Azure subscription. Once this is done, refreshing the dataflow will create files in the CDM Folder in the location you specify. The files can then be consumed by other Azure services and applications.

Q: Why do dataflows use CSV files? Why not a cooler file format like Parquet or Avro?

A: The dataflows whitepaper answers this one, but it’s still a frequently asked question. From the whitepaper: “CSV format is the most ubiquitously supported format in Azure Data Lake and data lake tools in general, and CSV is generally the fastest and simplest to write for data producers.” You should probably read the whole thing, and not just this excerpt, because later on it says that Avro and Parquet will also be supported.

Q: Is this an official blog or official FAQ?

A: No, no. Absolutely not. Oh my goodness no. This is my personal blog, and I always suspect the dataflows team cringes when they read it. Also I don’t update it quite as often as I should.

Power Query: Connecting to Excel Workbooks in SharePoint Online

Update September 2020: This post gets so many hits every day I decided to make a video. You should watch it. It is short.

Original post:

I try to avoid using Excel as a data source when I can. Although Excel has many benefits[1], its limitations[2] make it less than ideal for enterprise BI solutions. Despite this, it feels like I end up needing to analyze data in Excel at least once a month. And every time I’m connecting from Power Query to an Excel workbook located in SharePoint Online or OneDrive for Business, I find myself exclaiming “it shouldn’t be this hard!”[3]

What’s the problem? It’s not working with the data – Power Query makes this as painless as you’d expect. The problem is referencing the XSLX file. What path do I use?

In Excel and in SharePoint, there is a “copy link” option, which will copy a link to the workbook so you can share it with others.

doesn't work

It always feels like this should work, that I should be able to paste this URL into the Get Data experience in Power Query, and if it’s been more than a few months since the last time I tried, I expect it to work. But it never does.

This works:

this works

In the Excel ribbon, click on the File tab, and on the Info screen that appears click on the label under the file name. There’s no visual indicator that this label is a menu button until you hover over it, but when you click on it, a menu appears with a “Copy path to clipboard” option.

Update May 2020: Since this post was originally published, the Excel team has made this option a little easier to find. Your Excel options may look something like this:


Clicking “Copy path” will give you a path in this format:

With this starting point, all you need to do is remove the ?web=1 query string parameter at the end of the URL, and you have the path to the workbook that Power Query needs.

Choose “Web” as the data source in the Get Data menu, and paste this URL into the Dialog:

From web

Power Query is smart enough to understand that this URL is pointing to an Excel workbook, so from this point on, you’ll get an experience where you can select tables and worksheets, and can start manipulating and analyzing your Excel data.


Life is good.

Full disclosure: Were it not for Power Query program manager Miguel Llopis, I never would have found this. During one of our dataflows sessions at Microsoft Ignite, Miguel spent a few minutes walking through the these steps at the beginning of a demo. My initial reaction was along the lines of “Why are you showing this?  This isn’t a dataflows topic!” but when I realized what he was showing it quickly changed to “Holy crap – I didn’t know you could do that!”

In any event, I’m working with data in Excel again today, and spent a few minutes trying to find this technique documented anywhere. When I couldn’t find any, this post was born… but all credit should go to Miguel.


[1] None of which will I list here. It’s not that kind of blog.

[2] Neither will the limitations be enumerated. Feel free to take my word for it, or go build a data warehouse or two that extract data from Excel, and then come back and let me know what you think.

[3] To be fair, this may be another of those “get off my lawn” moments. It’s hard to tell sometimes.