Quick Tip: Developing with large dataflows

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

I received today what is becoming a common question:

I have a customer who is using dataflows to populate a 255M row fact table, refreshing the last 2 months of data (which is working) but experiencing issues when trying to connect to that Dataflow using Power BI Desktop. Suggestions?

Power BI dataflows are an import-only data source when being used from Power BI Desktop. This means that if you’re building a model in Power BI Desktop and are using dataflows as a source, all of the data from the entities you select will be downloaded from Azure and loaded into your data model.

All of it.

Image by Daniel Kirsch from Pixabay
There’s no subtext – I just searched Pixabay for “large” and this was the result I liked best.

To exacerbate the challenge, dataflows – and the CDM folders on which they’re built – do not support query folding, so even if you filter the records in your query, that filter will be applied in Power BI Desktop, not at the sources.

If one of these entities includes very wide records and/or a very large number of records, this can result in a slow and frustrating experience in Power BI Desktop, and in some circumstances can cause the data load to fail. So what can you do?

My preferred approach is to use a pattern I’ve long used when working with other data sources[1] that don’t support DirectQuery or query folding: work with a subset of the data in Power BI Desktop, and then work with the full data set after you publish to the Power BI service.

For dataflows, it looks like this:

  1. Develop and test the dataflows in the Power BI service
  2. Add filters to the too-large dataflow entities’ queries[2] to reduce the number of records in the entities
  3. Refresh the dataflow
  4. Develop the model and reports in Power BI Desktop
  5. Publish the model and reports to the Power BI service
  6. Remove the filters from the filtered dataflow entities
  7. Refresh the dataflow
  8. Refresh the dataset

The dataflows team announced at MBAS in June that the enhanced compute engine currently in preview will be enable DirectQuery access and query folding when using dataflows from Power BI Desktop at some point – but that point is not today. Until then, I hope this quick pattern will help…


[1] Like large folders full of large files.

[2] This is typically just the one or two largest entities, not all of them.

17 thoughts on “Quick Tip: Developing with large dataflows

  1. Pingback: Dataflows in Power BI – BI Polar

  2. Craig Bryden

    Thanks Matthew. This wont work if the dataflow in question is already being used by other datasets that are now “Production”. Any ideas for that scenario?

    Like

    1. That’s definitely more complicated. For that you would probably need to build a dataflow with linked/computed entities that include the filters, develop in Desktop using those filtered entities as the source, and then manually edit the M script in for the queries in Desktop before publishing to the service.

      The general approach is similar, but it’s definitely more step intensive…

      Like

      1. Alex Dupler

        This seems like a great place to employ parameters. The M script to reference a dataflow is mostly GUIDs, but the last step is something like this:
        FactTable = #”{GUID}”{[entity=”FactTable”]}[Data]

        if you replace “FactTable” with a parameter you can switch between two different entities within the same Dataflow. You could probably even switch between workspaces and dataflows by parameterizing the GUIDs, but I’m not an expert in that.

        FactTable = #”{GUID}”{[entity=Parameter]}[Data]

        So what you’d do is have your Fact Table Dataflow have both a production and a filtered Dev Fact Table with a parameter that holds the name of both tables. This makes switching the .pbix back and forth between the dev and the full fact table even easier.

        I feel like you should even be able to configure this parameter AFTER publication in the power bi service so you never have to do a full local refresh. However, when I tested this, the parameter was greyed out to prevent editing in the service. Apparently you can only ” edit some parameter settings in the Power BI service — not parameters that restrict the available data, but parameters that define and describe acceptable values.” https://docs.microsoft.com/en-us/power-bi/service-parameters#review-and-edit-parameters-in-power-bi-service

        Liked by 1 person

  3. jeffshieldsdev

    I’m just getting started with Dataflows in Power BI Premium. In addition to my “SourceData_Ingestion” Dataflows, I create a non-linked Dataflow with simply:

    let
    Source = #”SourceData_Ingestion”,
    #”Kept top rows” = Table.FirstN(Source, 10000)
    in
    #”Kept top rows”

    And save as “SourceData_Ingestion_Sample” to enable easier modeling in Power BI Desktop.

    Users use this Dataflow when exploring or modeling, and then simply remove “_Sample” from the Power Query M when they’re ready to leverage the whole set.

    Like

  4. René Imthorn

    Matthew,

    If you have the output of dataflows in a dedicated area in ADLSgen2, a model.json file with metadata is created to link to the latest snapshot of your data.

    Is it possible to approach this latest data via de json meta data file with Azure Databricks or Factory to do some ML or data science with it?

    René Imthorn

    Like

  5. Rick

    Matthew, this is a great approach. But sometimes when you edit a dataflow removing filters an error message appears saying that data refresh is more than 2 hours and the refresh fails. Is there any workaround for it to?
    Thank you in advance!

    Like

  6. Pingback: Power BI dataflows and query folding – BI Polar

  7. I didn’t catch when listening on the MBSA video that also PBI Desktop was to be supported, but this was good news. It then looks like PBI Dataflows are to be backed-up by Spark clusters the same way as Azure Data Factory, any confirmation available on this?

    Like

  8. Pingback: Power BI dataflows November updates – BI Polar

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s