Data Culture: Measuring Success

Building a data culture is hard. It involves technology and people, each of which is complicated enough on its own. When you combine them[1] they get even harder together. Building a data culture takes time, effort, and money – and because it takes time, you don’t always know if the effort and money you’re investing will get you to where you need to go.

Measuring the success of your efforts can be as hard as the efforts themselves.

Very often the work involved in building a data culture doesn’t start neatly and cleanly with a clearly defined end state. It often starts messily, with organic bottom-up efforts combining with top-down efforts over a period of change that’s driven as much by external forces as by any single decision to begin. This means that measuring success – and defining what “success” means – happens while the work is being done.

Measuring usage is the easiest approach, but it’s not really measuring success. Does having more reports or more users actually produce more value?

For many organizations[2], success is measured in the bottom line –  is the investment in building a data culture delivering the expected return from a financial perspective?

Having a data culture can make financial sense in two ways: it can reduce costs, and it can increase revenue.

Organizations often reduce costs by simplifying their data estate. This could involve standardizing on a single BI tool, or at least minimizing the number of tools used, migrating from older tools before they’re retired and decommissioned. This reduces costs directly by eliminating licensing expenses, and reduces costs indirectly by reducing the effort required for training, support, and related tasks. Measuring cost reduction can be straightforward – odds are someone is already tracking the IT budget – and measuring the reduction in the utilization of legacy tools can also take advantage of existing usage reporting.

Organizations can increase revenue by building more efficient, data-driven business processes. This is harder to measure. Typically this involves instrumenting the business processes in question, and proactively building the processes to correlate data culture efforts to business process outcomes.

In the video I mention the work of several enterprise Power BI customers who have build Power BI apps for their information workers and salespeople. These apps provide up-to-date data and insights for employees who would otherwise need to rely on days- or weeks-old batch data delivered via email or printout. By tracking which employees are using what aspects of the Power BI apps, the organizations can correlate this usage with the business outcomes of the employees’ work[3]. If a person or team’s efficiency increases as data usage increases, it’s hard to argue with that sort of success.

But.. this post and video assume that you have actually set explicit goals. Have you? If you haven’t defined that strategy, you definitely want to check out next week’s video…

[1] Especially since you usually have organizational politics thrown into the mix for good measure, and that never makes things any simpler.

[2] I originally typed “most organizations” but I don’t have the data to support that assertion. This is true of most of the mature enterprise organizations that I’ve worked with, but I suspect that for a broader population, most organizations don’t actually measure – they just cross their fingers and do what they can.

[3] Odds are someone is already tracking things like sales, so the “business outcomes” part of this approach might be simpler than you might otherwise assume. Getting access to the data and incorporating it in a reporting solution may not be straightforward, but it’s likely the data itself already exists for key business processes.

Data Culture: Experts and Expertise

Power BI lets business users solve more and more problems without requiring deep BI and data expertise. This is what self-service business intelligence is all about, as we saw when we looked at a brief history of business intelligence.

At other points in this series we also looked at how each app needs to be treated like the unique snowflake that it is, that successful data cultures have well-defined roles and responsibilities, and that sometimes you need to pick your battles and realize that some apps and data don’t need the management and care that others do.

But some apps do.

Some BI solutions are too important to let grow organically through self-service development. Sometimes you need true BI experts who can design, implement, and support applications that will scale to significant data volumes and number of concurrent users.

In this video we look at a specific approach taken by the BI team at Microsoft that developed the analytic platform used by Microsoft finance[1].

This is one specific approach, but it demonstrates a few fundamental facts that can be overlooked too easily:

  • Building an enterprise BI solution is building enterprise software, and it requires the rigor and discipline that building enterprise software demands
  • Each delivery team has dedicated teams of experts responsible for their part of the whole
  • Each business group with data and BI functionality included in the solution pays for what they get, with both money and personnel

Organizations that choose to ignore the need for experts tend to build sub-optimal solutions that fail to deliver on stakeholder expectations. These solutions are often replaced much sooner than planned, and the people responsible for their implementation are often replaced at the same time[2].

This isn’t the right place to go into the details of what sort of expertise you’ll need, because there’s too much to cover, and because the details will depend on your goals and your starting point. In my opinion the best place to go for more information is the Power BI whitepaper on Planning a Power BI Enterprise Deployment. This free resources delivers 250+ pages of wisdom from authors Melissa Coates and Chris Webb. You probably don’t need to read all of it, but odds are you will probably want to once you get started…

After this video and post were completed but before they were published, this story hit the global news wire: Botched Excel import may have caused loss of 15,841 UK COVID-19 cases | Ars Technica (

Wow. Although I am generally a big fan of Ars Technica’s journalism, I need to object to the sub-headline: “Lost data was reportedly the result of Excel’s limit of 1,048,576 rows.”

Yeah, no. The lost data was not the result of a  capability that has been well-known and documented for over a decade. The lost data was a result of using non-experts to do a job that experts should have done.

Choosing the wrong tool for a given job is often a symptom of not including experts and their hard-earned knowledge at the phases of a project where that expertise could have set everything up for success. This is just one example of many. Don’t let this happen to you.

[1] If you’re interested in a closer look at the Microsoft Finance COE approach, please check out this article in the Power BI guidance documentation.

[2] If you’ve been a consultant for very long, you’ve probably seen this pattern more than once. A client calls you in to replace or repair a system that never really worked, and all of the people who built it are no longer around.

The best Power BI roadmap

The Power BI team at Microsoft publishes a “release plan,” which is essentially the public product roadmap. Anyone can use it to understand what new capabilities and improvements are planned, and when they’re expected to be released[1].

One challenge with the official release plan comes from the fact that it is a set of online documents, and that for each “release wave[2]” there is a new set of docs – it’s not always clear where to look for the latest information on a given feature.

It’s clear now. You look here:

This link will take you to an interactive Power BI report built by Microsoft Technical Specialist and community superhero Alex Powers.

Using this report you can browse and explore by release month, or you can use keyword search to find specific features you’re interested in. The report spans multiple release waves (as I write this post it includes features from October 2019 to March 2021) and for each feature there’s a link to the documentation that includes all publicly available detail.

I use this report almost every day. It’s not actually new, but it is awesome. You should use it, and share it with everyone you know… or at least the people you know who care about Power BI.

[1] Ohhh…. Now that i type that sentence, I understand why we call it a release plan. Suddenly that makes sense.

[2] There are two release waves per year, and they correspond to the planning and execution semesters used by the product teams.

DirectQuery and dataflows in Power BI

Last week I saw this scenario cross my screen:

I’m trying to use dataflows with DirectQuery. I’ve pushed all the data into a dataflow, around 1.5 billion rows, and am trying to DirectQuery all the data. Performance is quite slow.

There’s a lot here to unpack, but I want to start with this[1]:

As covered in this recent post on the dataflows enhanced compute engine, it is possible to use dataflows in Power BI as a DirectQuery data source. But… it’s probably not going to deliver awesome performance when you’re querying billions of rows, over and over, every time a user interacts with visuals on published reports.

Many of the conversations I have that involve DirectQuery expose a lack of awareness of what DirectQuery actually does. Here’s a quote from the Power BI docs (emphasis is mine):

DirectQuery mode is an alternative to Import mode. Models developed in DirectQuery mode don’t import data. Instead, they consist only of metadata defining the model structure. When the model is queried, native queries are used to retrieve data from the underlying data source.

Here’s something that isn’t in the article, but which I believe probably should be:

DirectQuery often seems like an attractive option because it leaves the data in the database where it already is, and doesn’t require importing the data into the Tabular engine in the Power BI service. It can feel like you’re avoiding the effort and complexity that can come with designing a high-performance tabular model, but what you’re actually doing is moving the complexity and effort somewhere else.

When you’re designing a tabular model that will have a small data volume and/or a small number of users, you can probably get away without paying too much attention to best practices. But as your data and user base grow, you you need to invest effort and expertise into designing and implementing a model[2] that will scale and perform as needed.

The same factors are true for scaling with DirectQuery – it’s just a different set of effort and expertise that needs to be acquired and applied. My colleague Kasper de Jonge wrote at length about this a few years ago, and his post links to other, deeper, sources as well. At the risk of oversimplifying, there are two key factors that you need to take into consideration when using DirectQuery:

  1. Database design – the schema accessed via DirectQuery needs to be optimized for the types of queries it will process
  2. Index design – the database needs appropriate indexes to ensure that most common queries are fulfilled by data that’s already in memory, without the IO cost of hitting a disk

These factors fall broadly into the category of performance tuning and optimization, and the specific steps and details will depend on the data source. SQL Server, Oracle, SAP HANA… each database platform has its own nuances. There are lots of resources available online, like this awesome post from Brent Ozar… but if you pay attention you’ll see this 3-page post is mainly just a collection of links to other more detailed articles, and that it probably represents hundreds of pages of content. And this is just one resource for one database.

Using DirectQuery at scale means you need to be an expert on the database where your data lives, and you need to put that expertise to work tuning and optimizing the database. If you don’t do this, you cannot expect DirectQuery to perform well as your data volume and number of users grow.

At this point you may be asking what any of this has to do with dataflows in Power BI. The answer, of course, is everything.

The dataflows enhanced compute engine improves performance for multiple scenarios by loading dataflow entity data into a SQL-based cache. Queries against the dataflow can then be fulfilled by reading from the cache instead of reading from CDM folders in Azure storage – this enables both query folding and DirectQuery.

One thing that the enhanced compute engine does not provide is direct access to the SQL database that’s doing the caching. Not only can you not define indexes[3], you cannot use any of the SQL Server tools or techniques that Brent’s performance tuning post calls out. The implementation of the compute engine is a black box, and it does not provide any mechanism for monitoring, tuning, or optimization.

This means that while the enhanced compute engine is a simple solution to common problems, expecting dataflows DirectQuery over billions of rows to perform well is likely to result in the same disappointment as when expecting DirectQuery to perform well at scale over any un-optimized database.

None of this to say that DirectQuery – including DirectQuery over dataflows – can’t be part of a scalable BI solution. Power BI and Azure provide capabilities like composite models and aggregations, as well as the recently-announced Power BI performance accelerator for Azure Synapse Analytics[4], each of which is a valuable addition to your modern BI toolkit.

But if you expect to flip a switch and have awesome performance with DirectQuery against dataflows with billions of records…

[1] I know there’s a typo in the meme. It should be driving me crazy, but for some reason it fits.

[2] Yes, I just linked to the SQLBI home page, rather than to a specific article. This is the point – it’s complicated, and there is a body of knowledge that needs to be acquired and applied. If you and your team don’t have that expertise, you either need to develop it or bring in experts who have.

[3] It’s worth pointing out that the compute engine does create columnstore indexes automatically. This will provide improved performance for many queries, but does not reach the level of targeted performance tuning for specific workloads.

[4] I hope this gets named PBIPA4ASA, but so far no one on the branding team is returning my calls.

Data Culture: Community champions

What would an epic battle be without champions?

Lost. The epic battle would be lost without champions.

Don’t let this happen to you battle to build a data culture. Instead, find your champions, recognize and thank them, and give them the tools they need to rally their forces and lead them to victory.

Let’s do this!!

Despite what the nice short video[1] may lead you to believe, it’s not absolutely necessary to provide your data culture champions with literal swords[2]. But it is vital that you arm[3] them with the resources and connections they need to be successful.

In any community there will be people who step up to go the extra mile, to learn more than they need to know, and to do more than they are asked. These people are your champions, but they can’t do it all on their own. In the long term champions will succeed or fail based on the support they get from the center of excellence.

With support from the BI COE, champions can help a small central team scale their reach and impact. Champions typically become the primary point of contact for their teams and business groups, sharing information and answering questions. They demonstrate the art of the possible, and put technical concepts into the context and language that their business peers understand.

This is just what they do – this is what makes them champions.

An organization that’s actively working to build a data culture will recognize and support these activities. And if an organization does not…

[1] This video is about 1/3 as long as the last video in the series. You’re very welcome.

[2] But why take chances, am I right?

[3] See what I did there? I shouldn’t be allowed to write blog posts this close to bedtime.

Automatically refresh dataset when dataflow refresh completes

Back in August I highlighted the new dataflows PowerShell script repo on GitHub. These scripts provide an accelerated starting point to working with the dataflows REST APIs and to automate common dataflows tasks.

This week the dataflows team has released two new REST APIs for managing dataflows transactions (think “refresh”) and a new parameterized PowerShell script to make using the APIs easier.

You can find the APIs documented here and the PowerShell script here.

The most exciting thing about these APIs and scripts[1] is that they enable one of the most frequently requested capabilities: you can now trigger a dataset refresh (or the refresh of a dataflow in another workspace) when a dataflow refresh completes.


If you’re reading this and are thinking “that’s all well and good, but I don’t want to use an API or PowerShell[2]” please don’t fret. The Power BI team is working on non-API-based experiences to make the end-to-end refresh experience simpler and more complete. I can’t share dates or details here before they’re publicly announced, but I did want to share proactively before anyone asked about a code-free UX for refresh orchestration.

[1] And the best thing to happen in 2020 so far, if James Ward is to be believed.

[2] Or even “this should be ‘clicky-clicky-draggy-droppy’ and not ‘typey-typey-scripty-codey’.”

Data Culture: All BI apps are not created equal

Every app is a unique snowflake.

From a distance many of them look the same, and even up close they tend to look similar, but each one is unique – and you cannot treat them all the same.

This post and video take a closer[1] look at the topic introduced when we looked at picking your battles for a successful data culture. Where that post and video looked at more general concepts, this video looks at specific techniques and examples used by successful enterprise Power BI customers around the world.

I won’t attempt to reproduce here everything that’s in the video, but I do want to share two diagrams[2] that represent how one organization has structured their community of practice uses Power BI, and how their Power BI COE supports and enables it. I chose this example because it hews closely to the standard successful approach I see with dozens of large organizations building their data culture around Power BI, but also puts the generic approach into a specific and real context.

This first diagram shows the “rings” of BI within the organization, with personal BI at the outside and enterprise BI on the inside. Each ring represents a specific point on the more control / less control spectrum introduced in earlier videos, and demonstrates how one large organization thinks about the consistent and well-defined criteria and responsibilities represented by points on that spectrum.

This second diagram “explodes” the inner ring to show how a given application may mature. This organization has well-defined entry points for self-service BI authors to engage with the central BI team to promote and operationalize reports and data that originate with business authors, and a well-defined path for each app to follow… but they also understand that not every app will follow the path to the end. Some apps don’t need to be fully IT-supported solutions, because their usage, impact, and value doesn’t justify the up-front and ongoing work this would require. Some do, because they’re more important.

It depends.

And the key factor that this organization – and other successful organizations like them – realizes, is that they can put in place processes like the ones illustrated above that examine the factors on which it depends for them, and take appropriate action.

On a case by case, app by app basis.

Because one size will never fit all.

[1] And longer – at nearly 23 minutes, this is by far the longest video in the series.

[2] If you’re ever feeling impostor syndrome please remember that I created these diagrams using SmartArt in PowerPoint, looked at them, and exclaimed “that looks great!” before publishing them publicly where thousands of people would likely see them.

Session resources: Patterns for adopting dataflows in Power BI

This morning I presented a new webinar for the Istanbul Power BI user group, covering one of my favorite subjects: common patterns for successfully using and adopting dataflows in Power BI.

This session represents an intersection of my data culture series in that it presents lessons learned from successful enterprise customers, and my dataflows series in that… in that it’s about dataflows. I probably didn’t need to point out that part.

The session slides can be downloaded here: 2020-09-23 – Power BI Istanbul – Patterns for adopting dataflows in Power BI

The session recording is available for on-demand viewing. The presentation is around 50 minutes, with about 30 minutes of dataflows-centric Q&A at the end. Please check it out, and share it with your friends!


Refreshing only upstream dataflows

tl;dr: If you want to refresh a dataflow without refreshing any downstream dataflows that depend on it, just clear the “Enable Load” setting from any linked entities that reference it. This will remove the dependencies that the Power BI service looks for, without breaking any of the downstream dataflow logic. Win!

At the end of a long day I got an email asking for help[1]. The email included a screen shot from the lineage view of a Power BI workspace, some context about working to troubleshoot a problem, and the question “We want to refresh this dataflow, and not have it refresh the downstream dataflows. Is this possible?”

I almost said no, and then I remembered this post and realized the answer was “yes, sort of.”

Composable ETL with linked and computed entities ensure that when an upstream dataflow is refreshed, any downstream dataflow will automatically refresh as well – all without any explicit configuration required. The dataflows engine in the Power BI service just looks at the dependencies and everything just works.[2]

This is great, until you need it to work differently, and the “no explicit configuration” also means “no real way to turn it off.” So long as there are linked entities referencing entities in the current dataflow, refreshing the current dataflow will cause the dataflows containing those linked entities to refresh as well.

Fortunately, that 2018-era blog post illustrates that clearing the “enable load” setting for a linked entity also clears the metadata that the dataflows engine looks at to build the dependencies graph used for composable ETL.

So I send off a quick reply, “Try this – I haven’t tested it end to end, but it should work,” and gave it a quick test because at this point I was curious.

This was my starting point: three dataflows, all related.

When I refresh the first dataflow in the chain, the next two refresh as expected.

To break that refresh chain I can just edit the second dataflow.

The only change I need to make is this one: clearing that “Enable load” check box.

Once this is done, the lineage view looks a little more like this, and I can refresh the upstream dataflow in isolation.

Once the troubleshooting is done, all I need to do to get back to work is to re-enable load on those temporarily disabled linked entities. Boom!

I doubt this technique will get used all that often, but it looks like it worked today. As I was finishing up this post I got an email confirmation that this solved the problem it needed to solve, and the person who had emailed asking for help is now unblocked.

Life is good.

Look, he’s emailing again, probably offering to buy me a coffee or a nice new sword to say thank you. Oh, he wants to show me other problems he’s run into, not offering to buy me anything at all.


[1] Typically I respond to emails asking for help with a perfunctory redirection to another location, with a statement about how my schedule and responsibilities don’t allow me to scale by replying to one-off email support requests. The person sending this particular mail got a pass because they have consistently gone out of their way to help the Power BI community, and because they asked an interesting question with all of the information I needed to respond. It also helped that the email came in at the end of the day when my brain was too burned to start the next tasks on my list. I would definitely not recommend trying to use email me asking for help. Definitely not.

[2] Because of cloud magic, I think. Maybe metadata too, but isn’t metadata a kind of magic as well?

Dataflows enhanced compute engine – will it fold?

I’ve been seeing more questions lately about the dataflows enhanced compute engine in Power BI. Although I published a video overview and although there is feature coverage in the Power BI documentation, there are a few questions that I haven’t seen readily answered online.

A lot of these questions can be phrased as “what happens when I turn on the enhanced compute engine for my Power BI Premium capacity?”

Most of my responses start with the phrase “it depends” – so let’s look at some of the factors the answer depends on.

First, let’s look at a set of connected dataflows in a Premium workspace:

This workspace has three dataflows we’re interested in:

  1. A “staging” dataflow that contains un- or minimally-transformed data from the external data source or sources
  2. A “cleansed” dataflow that uses linked and computed entities to apply data cleansing logic to the data in the staging dataflow
  3. A final dataflow that uses linked and computed entities to transform the cleansed data into a star schema for analysis

Next, let’s look at the settings for one of these dataflows:

For a given dataflow, the owner can configure the enhanced compute engine to be Off/Disabled, On/Enabled, or “Optimized” which means that the Power BI service will turn it on or off for entities in the dataflow depending on how each entity is used by other dataflows.

This dataflow-level setting, combined with how the dataflows are used, determines whether the enhanced compute engine is enabled on entities in the dataflow:

  • Disabled: The enhanced compute engine is not enabled for any entities in the dataflow
  • Optimized: The enhanced compute engine is enabled for any entity in the dataflow that is referenced by a linked entity in another dataflow
  • On: The enhanced compute engine is enabled for all entities in the dataflow

Important point: To connect to a dataflow using DirectQuery you must explicitly set the enhanced compute engine to “On” even if entities in the dataflow are being referenced by linked entities.

Now let’s put these two factors together, considering that all three dataflows in the first image are using the default setting: Optimized.

  1. The “staging” dataflow uses the enhanced compute engine for any entities that are referenced by linked entities in another dataflow
  2. The “cleansed” dataflow uses the enhanced compute engine for any entities that are referenced by linked entities in another dataflow
  3. The final dataflow does not use the enhanced compute engine because none of its entities is referenced by linked entities in another dataflow[2]

Now we’re ready to revisit the actual question we’re trying to answer[3]: “what happens when I turn on the enhanced compute engine for my Power BI Premium capacity?”

Once the enhanced compute engine is enabled in the Power BI Premium capacity settings, and the dataflow settings and configuration (as illustrated above) dictate that the engine is used for a given dataflow, this is what happens:

  • When the dataflow is refreshed, the Power Query for each entity is executed, and the output of the query is persisted in the dataflow’s CDM folder as CSV data and JSON metadata
  • For any entity for which the enhanced compute engine is enabled, the output of the entity’s Power Query is also loaded into a table in a SQL database instance managed by the Power BI service

This second bullet is where the magic happens. Because the data is now in a storage format that includes a compute engine, supported queries can use the SQL cache instead of the underlying CSV files, and get the increased performance that comes with query folding.

Having the data in SQL also means that the dataflow can server as a DirectQuery data source – without the enhanced compute engine a dataflow can only be used as an Import mode data source.

The next logical question is “what exactly do you mean by supported queries?”

These queries are supported, meaning that they will use the SQL data if the enhanced compute engine is enabled for a dataflow:

  • Dataflow refresh against a dataflow with the enhanced compute engine enabled – for example, the “Cleansed” dataflow or “Final” dataflow in the image above
  • Authoring in Power BI Desktop, when using DirectQuery mode
  • User activity in the BI service, when using DirectQuery mode

These queries are not supported, and will always use the CSV data even if the enhanced compute engine is enabled for a dataflow:

  • Editing a dataflow in Power Query Online
  • Authoring in Power BI Desktop, when using Import mode
  • Dataset refresh in the Power BI service, when using Import mode

The final question[4] is “what Power Query operations will actually fold against the SQL data and take full advantage of the compute capabilities?”

The answer is in this blog post from Cristian Petculescu, the architect of dataflows and much of Power BI. Cristian enumerates well over 100 M constructs and the SQL to which they fold so I’m not going to try to summarize them all here. Go take a look for yourself if you need more information than what’s in this post.

Was this helpful? Did you learn anything new?

If you have any more questions about the enhanced compute engine, please let me know!

[1] This pattern of separating dataflows based on the logical role of the data preparation logic they implement is a general best practice, in addition to aligning well with dataflows best practices.

[2] If we wanted the enhanced compute engine to be used for this final dataflow even though it is not referenced by any linked entities, we would need to change the setting from “Optimized” to “On.”

[3] Did you still remember what the question was? I copied it here because I’d forgotten and was concerned you might have too. It’s September 2020 and the world is on fire, which can make it surprisingly difficult to think…

[4] Which almost no one ever actually asks me, but which fits into the theme of the post, so I’m including it here for completeness.