Last week I saw this scenario cross my screen:
I’m trying to use dataflows with DirectQuery. I’ve pushed all the data into a dataflow, around 1.5 billion rows, and am trying to DirectQuery all the data. Performance is quite slow.
There’s a lot here to unpack, but I want to start with this:
As covered in this recent post on the dataflows enhanced compute engine, it is possible to use dataflows in Power BI as a DirectQuery data source. But… it’s probably not going to deliver awesome performance when you’re querying billions of rows, over and over, every time a user interacts with visuals on published reports.
Many of the conversations I have that involve DirectQuery expose a lack of awareness of what DirectQuery actually does. Here’s a quote from the Power BI docs (emphasis is mine):
DirectQuery mode is an alternative to Import mode. Models developed in DirectQuery mode don’t import data. Instead, they consist only of metadata defining the model structure. When the model is queried, native queries are used to retrieve data from the underlying data source.
Here’s something that isn’t in the article, but which I believe probably should be:
DirectQuery often seems like an attractive option because it leaves the data in the database where it already is, and doesn’t require importing the data into the Tabular engine in the Power BI service. It can feel like you’re avoiding the effort and complexity that can come with designing a high-performance tabular model, but what you’re actually doing is moving the complexity and effort somewhere else.
When you’re designing a tabular model that will have a small data volume and/or a small number of users, you can probably get away without paying too much attention to best practices. But as your data and user base grow, you you need to invest effort and expertise into designing and implementing a model that will scale and perform as needed.
The same factors are true for scaling with DirectQuery – it’s just a different set of effort and expertise that needs to be acquired and applied. My colleague Kasper de Jonge wrote at length about this a few years ago, and his post links to other, deeper, sources as well. At the risk of oversimplifying, there are two key factors that you need to take into consideration when using DirectQuery:
- Database design – the schema accessed via DirectQuery needs to be optimized for the types of queries it will process
- Index design – the database needs appropriate indexes to ensure that most common queries are fulfilled by data that’s already in memory, without the IO cost of hitting a disk
These factors fall broadly into the category of performance tuning and optimization, and the specific steps and details will depend on the data source. SQL Server, Oracle, SAP HANA… each database platform has its own nuances. There are lots of resources available online, like this awesome post from Brent Ozar… but if you pay attention you’ll see this 3-page post is mainly just a collection of links to other more detailed articles, and that it probably represents hundreds of pages of content. And this is just one resource for one database.
Using DirectQuery at scale means you need to be an expert on the database where your data lives, and you need to put that expertise to work tuning and optimizing the database. If you don’t do this, you cannot expect DirectQuery to perform well as your data volume and number of users grow.
At this point you may be asking what any of this has to do with dataflows in Power BI. The answer, of course, is everything.
The dataflows enhanced compute engine improves performance for multiple scenarios by loading dataflow entity data into a SQL-based cache. Queries against the dataflow can then be fulfilled by reading from the cache instead of reading from CDM folders in Azure storage – this enables both query folding and DirectQuery.
One thing that the enhanced compute engine does not provide is direct access to the SQL database that’s doing the caching. Not only can you not define indexes, you cannot use any of the SQL Server tools or techniques that Brent’s performance tuning post calls out. The implementation of the compute engine is a black box, and it does not provide any mechanism for monitoring, tuning, or optimization.
This means that while the enhanced compute engine is a simple solution to common problems, expecting dataflows DirectQuery over billions of rows to perform well is likely to result in the same disappointment as when expecting DirectQuery to perform well at scale over any un-optimized database.
None of this to say that DirectQuery – including DirectQuery over dataflows – can’t be part of a scalable BI solution. Power BI and Azure provide capabilities like composite models and aggregations, as well as the recently-announced Power BI performance accelerator for Azure Synapse Analytics, each of which is a valuable addition to your modern BI toolkit.
But if you expect to flip a switch and have awesome performance with DirectQuery against dataflows with billions of records…
 I know there’s a typo in the meme. It should be driving me crazy, but for some reason it fits.
 Yes, I just linked to the SQLBI home page, rather than to a specific article. This is the point – it’s complicated, and there is a body of knowledge that needs to be acquired and applied. If you and your team don’t have that expertise, you either need to develop it or bring in experts who have.
 It’s worth pointing out that the compute engine does create columnstore indexes automatically. This will provide improved performance for many queries, but does not reach the level of targeted performance tuning for specific workloads.
 I hope this gets named PBIPA4ASA, but so far no one on the branding team is returning my calls.