This month the Power BI dataflows team announced an exciting new capability available in preview for dataflows in workspaces backed by Power BI Premium dedicated capacity. This compute engine builds on the Excel-like capabilities delivered through linked and computed entities to dramatically increase refresh performance for dataflow entities that get their data from other dataflow entities.
This post won’t go deep into the technical details – I’ll defer to the folks who built the feature to share that – but it will share important highlights to answer the most common questions I’ve seen since the preview was announced.
Let’s begin with a quick review. Consider this typical scenario:
- Data is loaded into entities in the “Ingest from Dynamics Sales” and “Product Telemetry in Azure” dataflows from data sources external to Power BI.
- The “Clean and enrich sales data” and “Add Telemetry Customer Attributes” reference these dataflows as their sources, and the “Final Business View” dataflow references them.
- The first two dataflows are configured for scheduled refresh. The other three dataflows will automatically refresh to keep the data in their entities up to date.
But what is actually happening in this scenario?
The boxes and arrows in this type of diagram could represent any number of technologies, but for Power BI dataflows the key technology is the Mashup Engine. The Mashup Engine is the runtime engine behind Power Query – it takes the Power Query “M” script and performs the extract, transform, and load (ETL) processes that the script defines. The Mashup Engine runs in Excel, in Power BI Desktop, in the Power BI on-premises data gateway, and in the Power BI service, and it is a cornerstone for data connectivity in Power BI.
A key characteristic of Power Query and the Mashup Engine is query folding. Query folding is a process by which the Mashup Engine offloads the heavy lifting of the data transformation to the data source, as enabled by the capabilities of the data source. For example, if a Power Query user built a query that extracted three columns from one table in a SQL Server database, joined them with four columns from another table in the same database, and then groups the join results by one of the columns. The Mashup Engine will automatically send the appropriate JOIN and GROUP BY statements to the server, and the server will return the results. If the same query is executed against two folders full of CSV files, there is no “server” to perform the required join and grouping operations, so the Mashup Engine performs them in memory.
To summarize this example, the query is largely the same, and the Mashup Engine uses query folding to choose the best location for the required compute operations to be performed.
This brings us back to dataflows, CDM folders, and why the new advanced compute engine is exciting. When a Power BI dataflow is refreshed, the data produced by the Mashup Engine’s execution of the dataflow entities’ queries is stored in Azure in the CDM folder format, which uses CSV files in folders for the storage. This means that when one entity references another entity, it is accessing CSV data and is using the Mashup Engine for its processing. This can result in slow refresh performance for some linked entity scenarios when the data volume or transformations can’t be optimized by the Mashup Engine.
The new advanced compute engine addresses this problem by loading the CDM folder data into an Azure SQL DB-based cache, and using query folding against this cache. This can produce significant performance improvements for the downstream refresh of computed entities. If we consider again the scenario shown above, the new advanced compute engine will change the underlying processing without requiring any change to the dataflows or entities themselves.
- When data is loaded into entities in the dataflows, the data is also loaded into the SQL-based cache.
- This may increase the refresh times for the “Ingest from Dynamics Sales” and “Product Telemetry in Azure” dataflows, because the caching is an additional operation not otherwise required.
- When the “Clean and enrich sales data” and “Add Telemetry Customer Attributes” and “Final Business View” dataflows refresh, the Mashup Engine can use query folding against the SQL-based cache. This can result in dramatic improvements in performance for the refresh of computed entities.
If you want to know more, you should watch this session from the Microsoft Business Applications Summit and read this blog post after you watch the session. Both of these excellent resources are from the architect responsible for much of the technical underpinnings of Power BI, and both go into more technical detail than I’ve gone into here.
As mentioned above, this post is designed to address some questions I’ve heard since this new preview capability was announced. Please let me know if this post is helpful, or if you have any questions of your own that aren’t answered here.
 Yes – this is a Premium-only feature.
 In case you didn’t click through already, this is why we call this behavior “Excel-like.” The Power BI service maintains the dependencies between dataflow entities like Excel maintains dependencies between cells and ranges.
 If you want to know more about query folding, check out this awesome post from MVP Matt Allington.
 This is one of the aspects of Power Query that are most exciting for me, as only the “source” function in the query would need to be changed. This data source agnostic nature of Power Query is enabled by query folding.
 Or locations, because a single query can combine data from multiple data sources.
 This is also why dataflows are an import-only data source in Power BI Desktop.