The best Power BI roadmap

The Power BI team at Microsoft publishes a “release plan,” which is essentially the public product roadmap. Anyone can use it to understand what new capabilities and improvements are planned, and when they’re expected to be released[1].

One challenge with the official release plan comes from the fact that it is a set of online documents, and that for each “release wave[2]” there is a new set of docs – it’s not always clear where to look for the latest information on a given feature.

It’s clear now. You look here: https://aka.ms/pbireleaseplan

This link will take you to an interactive Power BI report built by Microsoft Technical Specialist and community superhero Alex Powers.

Using this report you can browse and explore by release month, or you can use keyword search to find specific features you’re interested in. The report spans multiple release waves (as I write this post it includes features from October 2019 to March 2021) and for each feature there’s a link to the documentation that includes all publicly available detail.

I use this report almost every day. It’s not actually new, but it is awesome. You should use it, and share it with everyone you know… or at least the people you know who care about Power BI.


[1] Ohhh…. Now that i type that sentence, I understand why we call it a release plan. Suddenly that makes sense.

[2] There are two release waves per year, and they correspond to the planning and execution semesters used by the product teams.

DirectQuery and dataflows in Power BI

Last week I saw this scenario cross my screen:

I’m trying to use dataflows with DirectQuery. I’ve pushed all the data into a dataflow, around 1.5 billion rows, and am trying to DirectQuery all the data. Performance is quite slow.

There’s a lot here to unpack, but I want to start with this[1]:

As covered in this recent post on the dataflows enhanced compute engine, it is possible to use dataflows in Power BI as a DirectQuery data source. But… it’s probably not going to deliver awesome performance when you’re querying billions of rows, over and over, every time a user interacts with visuals on published reports.

Many of the conversations I have that involve DirectQuery expose a lack of awareness of what DirectQuery actually does. Here’s a quote from the Power BI docs (emphasis is mine):

DirectQuery mode is an alternative to Import mode. Models developed in DirectQuery mode don’t import data. Instead, they consist only of metadata defining the model structure. When the model is queried, native queries are used to retrieve data from the underlying data source.

Here’s something that isn’t in the article, but which I believe probably should be:

DirectQuery often seems like an attractive option because it leaves the data in the database where it already is, and doesn’t require importing the data into the Tabular engine in the Power BI service. It can feel like you’re avoiding the effort and complexity that can come with designing a high-performance tabular model, but what you’re actually doing is moving the complexity and effort somewhere else.

When you’re designing a tabular model that will have a small data volume and/or a small number of users, you can probably get away without paying too much attention to best practices. But as your data and user base grow, you you need to invest effort and expertise into designing and implementing a model[2] that will scale and perform as needed.

The same factors are true for scaling with DirectQuery – it’s just a different set of effort and expertise that needs to be acquired and applied. My colleague Kasper de Jonge wrote at length about this a few years ago, and his post links to other, deeper, sources as well. At the risk of oversimplifying, there are two key factors that you need to take into consideration when using DirectQuery:

  1. Database design – the schema accessed via DirectQuery needs to be optimized for the types of queries it will process
  2. Index design – the database needs appropriate indexes to ensure that most common queries are fulfilled by data that’s already in memory, without the IO cost of hitting a disk

These factors fall broadly into the category of performance tuning and optimization, and the specific steps and details will depend on the data source. SQL Server, Oracle, SAP HANA… each database platform has its own nuances. There are lots of resources available online, like this awesome post from Brent Ozar… but if you pay attention you’ll see this 3-page post is mainly just a collection of links to other more detailed articles, and that it probably represents hundreds of pages of content. And this is just one resource for one database.

Using DirectQuery at scale means you need to be an expert on the database where your data lives, and you need to put that expertise to work tuning and optimizing the database. If you don’t do this, you cannot expect DirectQuery to perform well as your data volume and number of users grow.

At this point you may be asking what any of this has to do with dataflows in Power BI. The answer, of course, is everything.

The dataflows enhanced compute engine improves performance for multiple scenarios by loading dataflow entity data into a SQL-based cache. Queries against the dataflow can then be fulfilled by reading from the cache instead of reading from CDM folders in Azure storage – this enables both query folding and DirectQuery.

One thing that the enhanced compute engine does not provide is direct access to the SQL database that’s doing the caching. Not only can you not define indexes[3], you cannot use any of the SQL Server tools or techniques that Brent’s performance tuning post calls out. The implementation of the compute engine is a black box, and it does not provide any mechanism for monitoring, tuning, or optimization.

This means that while the enhanced compute engine is a simple solution to common problems, expecting dataflows DirectQuery over billions of rows to perform well is likely to result in the same disappointment as when expecting DirectQuery to perform well at scale over any un-optimized database.

None of this to say that DirectQuery – including DirectQuery over dataflows – can’t be part of a scalable BI solution. Power BI and Azure provide capabilities like composite models and aggregations, as well as the recently-announced Power BI performance accelerator for Azure Synapse Analytics[4], each of which is a valuable addition to your modern BI toolkit.

But if you expect to flip a switch and have awesome performance with DirectQuery against dataflows with billions of records…


[1] I know there’s a typo in the meme. It should be driving me crazy, but for some reason it fits.

[2] Yes, I just linked to the SQLBI home page, rather than to a specific article. This is the point – it’s complicated, and there is a body of knowledge that needs to be acquired and applied. If you and your team don’t have that expertise, you either need to develop it or bring in experts who have.

[3] It’s worth pointing out that the compute engine does create columnstore indexes automatically. This will provide improved performance for many queries, but does not reach the level of targeted performance tuning for specific workloads.

[4] I hope this gets named PBIPA4ASA, but so far no one on the branding team is returning my calls.

Data Culture: Community champions

What would an epic battle be without champions?

Lost. The epic battle would be lost without champions.

Don’t let this happen to you battle to build a data culture. Instead, find your champions, recognize and thank them, and give them the tools they need to rally their forces and lead them to victory.

Let’s do this!!

Despite what the nice short video[1] may lead you to believe, it’s not absolutely necessary to provide your data culture champions with literal swords[2]. But it is vital that you arm[3] them with the resources and connections they need to be successful.

In any community there will be people who step up to go the extra mile, to learn more than they need to know, and to do more than they are asked. These people are your champions, but they can’t do it all on their own. In the long term champions will succeed or fail based on the support they get from the center of excellence.

With support from the BI COE, champions can help a small central team scale their reach and impact. Champions typically become the primary point of contact for their teams and business groups, sharing information and answering questions. They demonstrate the art of the possible, and put technical concepts into the context and language that their business peers understand.

This is just what they do – this is what makes them champions.

An organization that’s actively working to build a data culture will recognize and support these activities. And if an organization does not…


[1] This video is about 1/3 as long as the last video in the series. You’re very welcome.

[2] But why take chances, am I right?

[3] See what I did there? I shouldn’t be allowed to write blog posts this close to bedtime.

Automatically refresh dataset when dataflow refresh completes

Back in August I highlighted the new dataflows PowerShell script repo on GitHub. These scripts provide an accelerated starting point to working with the dataflows REST APIs and to automate common dataflows tasks.

This week the dataflows team has released two new REST APIs for managing dataflows transactions (think “refresh”) and a new parameterized PowerShell script to make using the APIs easier.

You can find the APIs documented here and the PowerShell script here.

The most exciting thing about these APIs and scripts[1] is that they enable one of the most frequently requested capabilities: you can now trigger a dataset refresh (or the refresh of a dataflow in another workspace) when a dataflow refresh completes.

Boom.

If you’re reading this and are thinking “that’s all well and good, but I don’t want to use an API or PowerShell[2]” please don’t fret. The Power BI team is working on non-API-based experiences to make the end-to-end refresh experience simpler and more complete. I can’t share dates or details here before they’re publicly announced, but I did want to share proactively before anyone asked about a code-free UX for refresh orchestration.

Update December 2020: There’s now a Power Automate connector that will let you do this without writing any code.


[1] And the best thing to happen in 2020 so far, if James Ward is to be believed.

[2] Or even “this should be ‘clicky-clicky-draggy-droppy’ and not ‘typey-typey-scripty-codey’.”

Data Culture: All BI apps are not created equal

Every app is a unique snowflake.

From a distance many of them look the same, and even up close they tend to look similar, but each one is unique – and you cannot treat them all the same.

This post and video take a closer[1] look at the topic introduced when we looked at picking your battles for a successful data culture. Where that post and video looked at more general concepts, this video looks at specific techniques and examples used by successful enterprise Power BI customers around the world.

I won’t attempt to reproduce here everything that’s in the video, but I do want to share two diagrams[2] that represent how one organization has structured their community of practice uses Power BI, and how their Power BI COE supports and enables it. I chose this example because it hews closely to the standard successful approach I see with dozens of large organizations building their data culture around Power BI, but also puts the generic approach into a specific and real context.

This first diagram shows the “rings” of BI within the organization, with personal BI at the outside and enterprise BI on the inside. Each ring represents a specific point on the more control / less control spectrum introduced in earlier videos, and demonstrates how one large organization thinks about the consistent and well-defined criteria and responsibilities represented by points on that spectrum.

This second diagram “explodes” the inner ring to show how a given application may mature. This organization has well-defined entry points for self-service BI authors to engage with the central BI team to promote and operationalize reports and data that originate with business authors, and a well-defined path for each app to follow… but they also understand that not every app will follow the path to the end. Some apps don’t need to be fully IT-supported solutions, because their usage, impact, and value doesn’t justify the up-front and ongoing work this would require. Some do, because they’re more important.

It depends.

And the key factor that this organization – and other successful organizations like them – realizes, is that they can put in place processes like the ones illustrated above that examine the factors on which it depends for them, and take appropriate action.

On a case by case, app by app basis.

Because one size will never fit all.


[1] And longer – at nearly 23 minutes, this is by far the longest video in the series.

[2] If you’re ever feeling impostor syndrome please remember that I created these diagrams using SmartArt in PowerPoint, looked at them, and exclaimed “that looks great!” before publishing them publicly where thousands of people would likely see them.

Session resources: Patterns for adopting dataflows in Power BI

This morning I presented a new webinar for the Istanbul Power BI user group, covering one of my favorite subjects: common patterns for successfully using and adopting dataflows in Power BI.

This session represents an intersection of my data culture series in that it presents lessons learned from successful enterprise customers, and my dataflows series in that… in that it’s about dataflows. I probably didn’t need to point out that part.

The session slides can be downloaded here: 2020-09-23 – Power BI Istanbul – Patterns for adopting dataflows in Power BI

The session recording is available for on-demand viewing. The presentation is around 50 minutes, with about 30 minutes of dataflows-centric Q&A at the end. Please check it out, and share it with your friends!

 

Refreshing only upstream dataflows

tl;dr: If you want to refresh a dataflow without refreshing any downstream dataflows that depend on it, just clear the “Enable Load” setting from any linked entities that reference it. This will remove the dependencies that the Power BI service looks for, without breaking any of the downstream dataflow logic. Win!

At the end of a long day I got an email asking for help[1]. The email included a screen shot from the lineage view of a Power BI workspace, some context about working to troubleshoot a problem, and the question “We want to refresh this dataflow, and not have it refresh the downstream dataflows. Is this possible?”

I almost said no, and then I remembered this post and realized the answer was “yes, sort of.”

Composable ETL with linked and computed entities ensure that when an upstream dataflow is refreshed, any downstream dataflow will automatically refresh as well – all without any explicit configuration required. The dataflows engine in the Power BI service just looks at the dependencies and everything just works.[2]

This is great, until you need it to work differently, and the “no explicit configuration” also means “no real way to turn it off.” So long as there are linked entities referencing entities in the current dataflow, refreshing the current dataflow will cause the dataflows containing those linked entities to refresh as well.

Fortunately, that 2018-era blog post illustrates that clearing the “enable load” setting for a linked entity also clears the metadata that the dataflows engine looks at to build the dependencies graph used for composable ETL.

So I send off a quick reply, “Try this – I haven’t tested it end to end, but it should work,” and gave it a quick test because at this point I was curious.

This was my starting point: three dataflows, all related.

When I refresh the first dataflow in the chain, the next two refresh as expected.

To break that refresh chain I can just edit the second dataflow.

The only change I need to make is this one: clearing that “Enable load” check box.

Once this is done, the lineage view looks a little more like this, and I can refresh the upstream dataflow in isolation.

Once the troubleshooting is done, all I need to do to get back to work is to re-enable load on those temporarily disabled linked entities. Boom!

I doubt this technique will get used all that often, but it looks like it worked today. As I was finishing up this post I got an email confirmation that this solved the problem it needed to solve, and the person who had emailed asking for help is now unblocked.

Life is good.

Look, he’s emailing again, probably offering to buy me a coffee or a nice new sword to say thank you. Oh, he wants to show me other problems he’s run into, not offering to buy me anything at all.

Oh.


[1] Typically I respond to emails asking for help with a perfunctory redirection to another location, with a statement about how my schedule and responsibilities don’t allow me to scale by replying to one-off email support requests. The person sending this particular mail got a pass because they have consistently gone out of their way to help the Power BI community, and because they asked an interesting question with all of the information I needed to respond. It also helped that the email came in at the end of the day when my brain was too burned to start the next tasks on my list. I would definitely not recommend trying to use email me asking for help. Definitely not.

[2] Because of cloud magic, I think. Maybe metadata too, but isn’t metadata a kind of magic as well?

Data Culture: Getting stakeholder buy-in

Have you ever heard the term “Excel hell”?

Odds are, if you’re reading this blog you’ve heard it, and possibly lived it once or twice. If not, you may want to take a minute to search online and discover what the internet has to say about it.

Managed self-service BI with Power BI is one way to escape or avoid Excel hell, but any self-service data tool brings with it some of the same risks that Excel brings. Power BI introduces guardrails to make it easier to manage the work that self-service authors produce, but wouldn’t it be nice to address the problem at its root, and prevent unwanted content from being shared in the first place?

The video introduces a set of techniques to gain explicit stakeholder buy-in. For the enterprise Power BI customers I work with, these steps are usually prerequisites to getting a Pro license and permission to publish to the Power BI service, but they may be required for other activities as well.

  1. Ask for it – rather than automatically issuing licenses in bulk, issue licenses only to users who explicitly request them
  2. Sign a data covenant – require users to sign “terms of use” for working with data in ways that align with the goals of the organization.
  3. Take a test – require users to take and pass a simple[1] test
  4. Complete training – require users to attend Dashboard in a Day or similar introductory training

None of these barriers is designed to keep anyone from getting access to the tools and data they need. They’re designed to make people work for it.

As mentioned in earlier posts and videos, humans are complicated and tricky, but most people value what they earn more than they value what they’re given. And if someone works to earn the ability to publish and share data and reports, they’re more likely to think twice before they publish something and forget it.

This is a small but effective step that can reduce the ongoing effort required to manage and govern the content published to the Power BI service. And if you put the entry point[2] to requesting access in your central portal, you’ll be helping reinforce the behaviors that will make your data culture grow, right from the beginning.


[1] Emphasis here on “simple” – every company I talked to who used this approach designed the test so that anyone could pass it.

[2] Power App, Customer Voice form, ServiceNow ticket, whatever fits your processes and requirements.

Dataflows enhanced compute engine – will it fold?

I’ve been seeing more questions lately about the dataflows enhanced compute engine in Power BI. Although I published a video overview and although there is feature coverage in the Power BI documentation, there are a few questions that I haven’t seen readily answered online.

A lot of these questions can be phrased as “what happens when I turn on the enhanced compute engine for my Power BI Premium capacity?”

Most of my responses start with the phrase “it depends” – so let’s look at some of the factors the answer depends on.

First, let’s look at a set of connected dataflows in a Premium workspace:

This workspace has three dataflows we’re interested in:

  1. A “staging” dataflow that contains un- or minimally-transformed data from the external data source or sources
  2. A “cleansed” dataflow that uses linked and computed entities to apply data cleansing logic to the data in the staging dataflow
  3. A final dataflow that uses linked and computed entities to transform the cleansed data into a star schema for analysis

Next, let’s look at the settings for one of these dataflows:

For a given dataflow, the owner can configure the enhanced compute engine to be Off/Disabled, On/Enabled, or “Optimized” which means that the Power BI service will turn it on or off for entities in the dataflow depending on how each entity is used by other dataflows.

This dataflow-level setting, combined with how the dataflows are used, determines whether the enhanced compute engine is enabled on entities in the dataflow:

  • Disabled: The enhanced compute engine is not enabled for any entities in the dataflow
  • Optimized: The enhanced compute engine is enabled for any entity in the dataflow that is referenced by a linked entity in another dataflow
  • On: The enhanced compute engine is enabled for all entities in the dataflow

Important point: To connect to a dataflow using DirectQuery you must explicitly set the enhanced compute engine to “On” even if entities in the dataflow are being referenced by linked entities.

Now let’s put these two factors together, considering that all three dataflows in the first image are using the default setting: Optimized.

  1. The “staging” dataflow uses the enhanced compute engine for any entities that are referenced by linked entities in another dataflow
  2. The “cleansed” dataflow uses the enhanced compute engine for any entities that are referenced by linked entities in another dataflow
  3. The final dataflow does not use the enhanced compute engine because none of its entities is referenced by linked entities in another dataflow[2]

Now we’re ready to revisit the actual question we’re trying to answer[3]: “what happens when I turn on the enhanced compute engine for my Power BI Premium capacity?”

Once the enhanced compute engine is enabled in the Power BI Premium capacity settings, and the dataflow settings and configuration (as illustrated above) dictate that the engine is used for a given dataflow, this is what happens:

  • When the dataflow is refreshed, the Power Query for each entity is executed, and the output of the query is persisted in the dataflow’s CDM folder as CSV data and JSON metadata
  • For any entity for which the enhanced compute engine is enabled, the output of the entity’s Power Query is also loaded into a table in a SQL database instance managed by the Power BI service

This second bullet is where the magic happens. Because the data is now in a storage format that includes a compute engine, supported queries can use the SQL cache instead of the underlying CSV files, and get the increased performance that comes with query folding.

Having the data in SQL also means that the dataflow can server as a DirectQuery data source – without the enhanced compute engine a dataflow can only be used as an Import mode data source.

The next logical question is “what exactly do you mean by supported queries?”

These queries are supported, meaning that they will use the SQL data if the enhanced compute engine is enabled for a dataflow:

  • Dataflow refresh against a dataflow with the enhanced compute engine enabled – for example, the “Cleansed” dataflow or “Final” dataflow in the image above
  • Authoring in Power BI Desktop, when using DirectQuery mode
  • User activity in the BI service, when using DirectQuery mode

These queries are not supported, and will always use the CSV data even if the enhanced compute engine is enabled for a dataflow:

  • Editing a dataflow in Power Query Online
  • Authoring in Power BI Desktop, when using Import mode
  • Dataset refresh in the Power BI service, when using Import mode

The final question[4] is “what Power Query operations will actually fold against the SQL data and take full advantage of the compute capabilities?”

The answer is in this blog post from Cristian Petculescu, the architect of dataflows and much of Power BI. Cristian enumerates well over 100 M constructs and the SQL to which they fold so I’m not going to try to summarize them all here. Go take a look for yourself if you need more information than what’s in this post.

Was this helpful? Did you learn anything new?

If you have any more questions about the enhanced compute engine, please let me know!


[1] This pattern of separating dataflows based on the logical role of the data preparation logic they implement is a general best practice, in addition to aligning well with dataflows best practices.

[2] If we wanted the enhanced compute engine to be used for this final dataflow even though it is not referenced by any linked entities, we would need to change the setting from “Optimized” to “On.”

[3] Did you still remember what the question was? I copied it here because I’d forgotten and was concerned you might have too. It’s September 2020 and the world is on fire, which can make it surprisingly difficult to think…

[4] Which almost no one ever actually asks me, but which fits into the theme of the post, so I’m including it here for completeness.

Data Culture: Now you’re thinking with portal

In an ideal world, everyone knows where to find the resources and tools they need to be successful.

We don’t live in that world.

I’m not even sure we can see that world from here. But if we could see it, we’d be seeing it through a portal[1].

One of the most common themes from my conversations with enterprise Power BI customers is that organizations that are successfully building and growing their data cultures have implemented portals where they share the resources, tools, and information that their users need. These mature companies also treat their portal as a true priority – the portal is a key part of their strategy, not an afterthought.

This is why:

In every organization of non-trivial size there are obstacles that keep people from finding and using the resources, information, and data they need.

Much of the time people don’t know what they need, nor do they know what’s available. They don’t know what questions to ask[2], much less know where to go to get the answers. This isn’t their fault – it’s a natural consequence of working in a complex environment that changes over time on many different dimensions.

As I try to do in these accompanying-the-video blog posts I will let the video speak for itself, but there are a few key points I want to emphasize here as well.

  1. You need a place where people can go for all of the resources created and curated by your center of excellence
  2. You need to engage with your community of practice to ensure that you’re providing the resources they need, and not just the resources you think they need
  3. You need to keep directing users to the portal, again and again and again, until it becomes habit and they start to refer their peers

The last point is worth emphasizing and explaining. If community members don’t use the portal, it won’t do what you need it to do, and you won’t get the return you need on your investments.

Users will continue to use traditional “known good” channels to get information – such as sending you an email or IM – if you let them. You need to not let them.


[1] See what I did there?

[2] Even though they will often argue vehemently against this fact.