Dataflows enhanced compute engine – will it fold?

I’ve been seeing more questions lately about the dataflows enhanced compute engine in Power BI. Although I published a video overview and although there is feature coverage in the Power BI documentation, there are a few questions that I haven’t seen readily answered online.

A lot of these questions can be phrased as “what happens when I turn on the enhanced compute engine for my Power BI Premium capacity?”

Most of my responses start with the phrase “it depends” – so let’s look at some of the factors the answer depends on.

First, let’s look at a set of connected dataflows in a Premium workspace:

This workspace has three dataflows we’re interested in:

  1. A “staging” dataflow that contains un- or minimally-transformed data from the external data source or sources
  2. A “cleansed” dataflow that uses linked and computed entities to apply data cleansing logic to the data in the staging dataflow
  3. A final dataflow that uses linked and computed entities to transform the cleansed data into a star schema for analysis

Next, let’s look at the settings for one of these dataflows:

For a given dataflow, the owner can configure the enhanced compute engine to be Off/Disabled, On/Enabled, or “Optimized” which means that the Power BI service will turn it on or off for entities in the dataflow depending on how each entity is used by other dataflows.

This dataflow-level setting, combined with how the dataflows are used, determines whether the enhanced compute engine is enabled on entities in the dataflow:

  • Disabled: The enhanced compute engine is not enabled for any entities in the dataflow
  • Optimized: The enhanced compute engine is enabled for any entity in the dataflow that is referenced by a linked entity in another dataflow
  • On: The enhanced compute engine is enabled for all entities in the dataflow

Important point: To connect to a dataflow using DirectQuery you must explicitly set the enhanced compute engine to “On” even if entities in the dataflow are being referenced by linked entities.

Now let’s put these two factors together, considering that all three dataflows in the first image are using the default setting: Optimized.

  1. The “staging” dataflow uses the enhanced compute engine for any entities that are referenced by linked entities in another dataflow
  2. The “cleansed” dataflow uses the enhanced compute engine for any entities that are referenced by linked entities in another dataflow
  3. The final dataflow does not use the enhanced compute engine because none of its entities is referenced by linked entities in another dataflow[2]

Now we’re ready to revisit the actual question we’re trying to answer[3]: “what happens when I turn on the enhanced compute engine for my Power BI Premium capacity?”

Once the enhanced compute engine is enabled in the Power BI Premium capacity settings, and the dataflow settings and configuration (as illustrated above) dictate that the engine is used for a given dataflow, this is what happens:

  • When the dataflow is refreshed, the Power Query for each entity is executed, and the output of the query is persisted in the dataflow’s CDM folder as CSV data and JSON metadata
  • For any entity for which the enhanced compute engine is enabled, the output of the entity’s Power Query is also loaded into a table in a SQL database instance managed by the Power BI service

This second bullet is where the magic happens. Because the data is now in a storage format that includes a compute engine, supported queries can use the SQL cache instead of the underlying CSV files, and get the increased performance that comes with query folding.

Having the data in SQL also means that the dataflow can server as a DirectQuery data source – without the enhanced compute engine a dataflow can only be used as an Import mode data source.

The next logical question is “what exactly do you mean by supported queries?”

These queries are supported, meaning that they will use the SQL data if the enhanced compute engine is enabled for a dataflow:

  • Dataflow refresh against a dataflow with the enhanced compute engine enabled – for example, the “Cleansed” dataflow or “Final” dataflow in the image above
  • Authoring in Power BI Desktop, when using DirectQuery mode
  • User activity in the BI service, when using DirectQuery mode

These queries are not supported, and will always use the CSV data even if the enhanced compute engine is enabled for a dataflow:

  • Editing a dataflow in Power Query Online
  • Authoring in Power BI Desktop, when using Import mode
  • Dataset refresh in the Power BI service, when using Import mode

The final question[4] is “what Power Query operations will actually fold against the SQL data and take full advantage of the compute capabilities?”

The answer is in this blog post from Cristian Petculescu, the architect of dataflows and much of Power BI. Cristian enumerates well over 100 M constructs and the SQL to which they fold so I’m not going to try to summarize them all here. Go take a look for yourself if you need more information than what’s in this post.

Was this helpful? Did you learn anything new?

If you have any more questions about the enhanced compute engine, please let me know!


[1] This pattern of separating dataflows based on the logical role of the data preparation logic they implement is a general best practice, in addition to aligning well with dataflows best practices.

[2] If we wanted the enhanced compute engine to be used for this final dataflow even though it is not referenced by any linked entities, we would need to change the setting from “Optimized” to “On.”

[3] Did you still remember what the question was? I copied it here because I’d forgotten and was concerned you might have too. It’s September 2020 and the world is on fire, which can make it surprisingly difficult to think…

[4] Which almost no one ever actually asks me, but which fits into the theme of the post, so I’m including it here for completeness.

5 thoughts on “Dataflows enhanced compute engine – will it fold?

  1. Pingback: Query Folding and the Power BI Dataflows Enhanced Compute Engine – Curated SQL

  2. Hananto

    Hi Matthew,

    thank you for this insightful information. Last time I read your article about DataFlows vs. Shared Dataset: https://ssbipolar.com/2019/10/14/power-bite-dataflows-vs-datasets/.

    Since my data source (AWS Redshift) has million of records and sadly does not support incremental refresh on the Dataset directly, I used an incremental refresh on the DataFlow. It worked fine as the query folding took place. Then I tried to set up an incremental refresh on the Dataset based on the DataFlow and the query is not folded because the data from the DataFlow is stored as flat file.

    I thought with DataFlow via DirectQuery (with enhanced compute engine) might solve this issue. However – as you mentioned in the article – “Dataset refresh in the Power BI service, when using Import mode” can’t utilize Query Folding.

    Is it somehow still possible to find a solution / workaround for this issue? or should I consider moving the data to a different data source which supports query folding ( + incremental refresh ) natively like SQL Server / SQL Synapse?

    Thanks a lot in advance

    Best regards,

    Hananto

    Like

    1. Hi Hananto – As of today there is no timeline I can share for when dataset refresh might be able to take advantage of the dataflows compute engine, you may want to look at a solution like Azure Synapse Analytics.

      Like

  3. Pingback: Power BI Dataflows FAQ – BI Polar

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s