Roche’s Maxim of Data Transformation

According to the internet, a maxim is a succinct formulation of a fundamental principle, general truth, or rule of conduct.[1] Maxims tend to relate to common situations and topics that are understandable by a broad range of people.

Topics like data transformation.

Update June 16 2021: The video from my June 11 DataMinutes presentation is now available, so if you prefer visual content, the video might be a good place to start.

Roche’s Maxim of Data Transformation[2] states:

Data should be transformed as far upstream as possible, and as far downstream as necessary.

In this context “upstream” means closer to where the data is originally produced, and “downstream” means closer to where the data is consumed.

By transforming data closer to its ultimate source costs can be reduced, and the value added through data transformation can be applied to a greater range of uses. The farther downstream a given transformation is applied, the more expensive it tends to be – often because operations are performed more frequently – and the smaller the scope of potential value through reuse.

I’ve been using this guideline for many years, but I only started recognizing it as a maxim in the past year or so. The more I work with enterprise Power BI customers the more I realize how true and how important it is – and how many common problems could be avoided if more people thought about it when building data solutions.

Please note that this maxim is generalizable to data solutions implementing using any tools or technology. The examples below focus on Power BI because that’s where I spend my days, but these principles apply to every data platform I have used or seen used.

In day-to-day Power BI conversations, perhaps the most common question to which Roche’s Maxim applies is about where to implement a given unit of logic: “Should I do this in DAX or in Power Query?”

Short answer: Do it in Power Query.

If you’re ever faced with this question, always default to Power Query if Power Query is capable of doing what you need – Power Query is farther upstream. Performing data transformation in Power Query ensures that when your dataset is refreshed the data is loaded into the data model in the shape it needs to be in. Your report logic will be simplified and thus easier to maintain, and will likely perform better[3] because the Vertipaq engine will need to do less work as users interact with the report.[4]

But what if you need data transformation logic that depends on the context of the current user interacting with the report – things like slicers and cross-filtering? This is the perfect job for a DAX measure, because Power Query doesn’t have access to the report context. Implementing this logic farther downstream in DAX makes sense because it’s necessary.

Another common question to which Roche’s Maxim applies is also about where to implement a given unit of logic: “Should I do this in Power BI or in the data warehouse?”

Short answer: Do it in the data warehouse.

If you’re ever faced with this question, always default to transforming the data into its desired shape when loading it into the data warehouse – the data warehouse is farther upstream. Performing data transformation when loading the data warehouse ensures that any analytics solution that uses the data has ready access to what it needs – and that every solution downstream of the warehouse is using a consistent version of the data.

From a performance perspective, it is always better to perform a given data transformation as few times as possible, and it is best to not need to transform data at all.[5] Data transformation is a costly operation – transforming data once when loading into a common location like a data warehouse, data mart, or data lake, is inherently less costly than transforming it once for every report, app, or solution that uses that common location.

A much less common question to which Roche’s Maxim applies might be “What about that whole ‘not transforming at all’ pattern you mentioned a few paragraphs back – how exactly does that dark magic work?”

Short answer: Have the data already available in the format you need it to be in.

That short answer isn’t particularly useful, so here are two brief stories to illustrate what I mean.

Many years ago I was working with an oil & gas company in an engagement related to master data management. This company had a fundamental data problem: the equipment on their drilling platforms around the world was not standardized, and different meters reported the same production data differently. These differences in measurement meant that all downstream reporting and data processing could only take place using the least common denominator across their global set of meters… and this was no longer good enough. To solve the problem, they were standardizing on new meters everywhere, and updating their data estate to take advantage of the new hardware. My jaw dropped when I learned that the cost of upgrading was upwards of one hundred million dollars… which was a lot of money at the time.

Much more recently I was working with a retail company with over 5,000 locations across North America. They had similar challenges with similar root causes: their stores did not have consistent point of sale (POS) hardware[6], which meant that different stores produced different data and produced some common data at different grain, and analytics could only take place using the least common denominator data from all stores. Their solution was also similar: they upgraded all POS systems in all stores. I don’t have a dollar amount to put with this investment, but it was certainly significant – especially in an industry where margins are traditionally small and budgets traditionally very conservative.

Both of these stories illustrate organizations taking Roche’s Maxim to the extreme: they transformed their key data literally as far upstream as possible, by making the necessary changes to produce the data in its desired form.[7]

Each of these stories included both technical and non-technical factors. The technical factors revolve around data. The non-technical factors revolve around money. Each company looked at the cost and the benefit and decided that the benefit was greater. They implemented an upstream change that will benefit every downstream system and application, which will simplify their overall data estate, and which corrects a fundamental structural problem in their data supply chain that could only be mitigated, not corrected, by any downstream change.

There’s one additional part of Roche’s Maxim that’s worth elaborating on – what does “necessary” mean? This post has looked at multiple scenarios that emphasize the “as far upstream as possible” part of the maxim – what about the “as far downstream as necessary” part?

Some factors for pushing transformations downstream are technical, like the DAX context example above. Other technical factors might be the availability of data – you can’t produce a given output unless you have all necessary inputs. Others may be organizational – if data is produced by a 3rd party, your ability to apply transformations before a given point may be constrained more by a contract than by technology.

Still other factors may be situational and pragmatic – if team priorities and available resources prevent you from implementing a unit of data transformation logic in the data warehouse, it may be necessary to implement it in your Power BI solution in order to meet project deadlines and commitments.

These are probably the most frustrating types of “necessary” factors, but they’re also some of the most common. Sometimes you need to deliver a less-than-ideal solution and incur technical debt that you would prefer to avoid. The next time you find yourself in such a situation, keep this maxim in mind, and remember that even though it may be necessary to move that data transformation logic downstream today, tomorrow is another day, with different constraints and new opportunities.

Callout: This may be my maxim, but this isn’t the first blog post on the topic. Stuart Box from the UK data consultancy BurningSuit blogged back in March and was in first with this excellent article.


[1] Or a men’s magazine. I really really wanted to use this more pop-culture meaning to make a “DQ” joke playing on the men’s magazine “GQ” but after watching this post languish in my drafts for many months and this joke not even beginning to cohere, I decided I should probably just let it go and move on.

But I did not let it go. Not really.

[2] If you think that sounds pretentious when you read it, imagine how it feels typing it in.

[3] The performance benefit here is not always obvious when working with smaller data volumes, but will become increasingly obvious as the data volume increases. And since the last thing you want to do in this situation is to retrofit your growing Power BI solution because you made poor decisions early on, why not refer to that maxim the next time you’re thinking about adding a calculated column?

[4] This post only spent a week or so in draft form, but during this week I watched an interesting work email conversation unfold. A Power BI customer was experiencing unexpected performance issues related to incremental refresh of a large dataset, and a DAX calculated column on a table with hundreds of millions of records was part of the scenario. The email thread was between members of the engineering and CAT teams, and a few points jumped out at me, including one CAT member observing “in my experience, calculated columns on large tables [can] increase processing times and also can greatly increase the time of doing a process recalc… it also depends on the complexity of the calculated column.”

I don’t have enough knowledge of the Veripaq engine’s inner workings to jump into the conversation myself, but I did sip my coffee and smile to myself before moving on with my morning. I checked back in on the conversation later on, and saw that a Power BI  group engineering manager (GEM) had shared this guidance, presented here with his approval:

“From a pure perf standpoint, its true that we can say:

  • The most efficient approach for a large fact table is to have all the columns be present in the source table (materialized views also might work), so that no extra processing is necessary during the import operation (either in Mashup or in DAX)
  • The next most efficient approach for a large fact table is usually going to be to have the computation be part of the M expression, so that it only needs to be evaluated for the rows in the partitions being processed
  • DAX calculated columns are a great option for flexibility and are particularly useful for dimension tables, but will be the least efficient compared to the above two options for large fact tables”

That sounds pretty familiar, doesn’t it? The GEM effectively summarized Roche’s Maxim, including specific guidance for the specific customer scenario. The details will differ from context to context, but I have never found a scenario to which the maxim did not apply.

Yes, this is a challenge for you to tell me where and how I’m wrong.

[5] Just as Sun Tzu said “To fight and conquer in all our battles is not supreme excellence; supreme excellence consists in breaking the enemy’s resistance without fighting,” supreme excellence in data transformation is not needing to transform the data at all.

[6] That’s “cash registers” to the less retail inclined readers.

[7] If you feel inclined to point out that in each of these stories there is additional data transformation taking place farther downstream, I won’t argue. You are almost certainly correct… but the Maxim still holds, as the key common transformations have been offloaded into the most upstream possible component in the data supply chain. Like a boss.[8]

[8] Like a supremely excellent[5] boss.

Data Governance and Self-Service Business Intelligence

When you hear someone say that governance and self-service BI don’t go together, or some variation on the tired old “Power BI doesn’t do data governance” trope, you should immediately be skeptical.

During a recent Guy in a Cube live stream there was a great discussion about self-service BI and data governance, and about how in most larger organizations Power BI is used for self-service and non-self-service BI workloads. The discussion starts around the 27:46 mark in the recording if you’re interested.

As is often the case, this discussion sparked my writing muse and I decided to follow up with a brief Twitter thread to share a few thoughts that didn’t fit into the stream chat. That brief thread turned out to be much larger and quite different than what I expected… big enough to warrant its own blog post. This post.

Please consider this well-known quote: “No plan survives contact with the enemy.”

Please also feel encouraged to read this fascinating history of the quote and its attributions on the Quote Investigator web site.

In his 1871 essay Helmuth von Moltke called out an obvious truth: battle is inherently unpredictable, and once enemy contact is made a successful commander must respond to actual conditions on the ground –  not follow a plan that is more outdated with every passing minute.

At the same time, that commander must have and must adhere to strategic goals for the engagement. Without these goals, how could they react and respond and plan as the reality of the conflict changes constantly and unpredictably?

Implementing managed self-service business intelligence – self-service BI hand-in-hand with data governance – exhibits many of the same characteristics.

Consider a battlefield, where one force has overwhelming superiority: More soldiers, more artillery, more tanks, and a commanding position of the terrain. The commander of that force knows that any enemy who faces him on this field will fail. The enemy knows this too.

And because the enemy knows this, they will not enter the field to face that superior force. They will fade away, withdraw from direct conflict, and strike unexpectedly, seeking out weaknesses and vulnerabilities. This is the nature of asymmetric warfare.

The commander of the more powerful force probably knows this too, and will act accordingly. The smart commander will present opportunities that their enemies will perceive as easily exploitable weaknesses, to draw them in and thus to bring that overwhelming force to bear.

And this brings us naturally back to the topic of data governance, self-service business intelligence, and dead Prussian field marshals.

Seriously.

In many large organizations, the goal of the data governance group is to ensure that data is never used improperly, and to mitigate (often proactively and aggressively mitigate) the risk of improper use.

In many large organizations, the data governance group has an overwhelming battlefield advantage. They make the rules. They define the processes. They grant or deny access to the data. No one gets in without their say-so, and woe unto any business user who enters that field of battle, and tries to get access to data that is under the protection of this superior force.

Of course, the business users know this. They’re outgunned and outmanned, and they know the dire fate that awaits them if they try to run the gauntlet that the data governance team has established. Everyone they know who has ever tried has failed.

So they go around it. They rely on the tried and true asymmetric tactics of self-service BI. The CSV export. The snapshot. The Excel files and SharePoint lists with manually-entered data.

Rather than facing the data governance group and their overwhelming advantages, they build a shadow BI solution.

These veteran business users choose not to join a battle they’re doomed to lose.

They instead seek and find the weak spots. They achieve their goals despite all of the advantages and resources that the data governance group has at their disposal.

Every time. Business users always find a way.

This is where a savvy data governance leader can learn from the battlefield. Just as a military commander can draw in their opponents and then bring their superior forces to bear, the data governance group can present an attractive and irresistible target to draw in business users seeking data.

This is the path to managed self-service business intelligence… and where the whole military analogy starts to break down. Even though data governance and self-service BI have different priorities and goals, these groups should not and must not be enemies. They need to be partners for either to succeed.

Managed self-service BI succeeds when it is easier for business users to get access to the data they need by working within the processes and systems established by the data governance group, rather than circumventing them.[1]

Managed self-service BI succeeds when the data governance group enables processes and systems to give business users the access they need to the data they need, while still maintaining the oversight and control required for effective governance.

Managed self-service BI succeeds when the data governance group stops saying “no” by default, and instead says “yes, and” by default.

  • Yes you can get access to this data, and these are the prerequisites you must meet.
  • Yes you can get access to this data, and these are the allowed scenarios for proper use.
  • Yes you can get access to this data, and these are the resources to make it easy for you to succeed.

What business user would choose to build their own shadow BI solution that requires manual processes and maintenance just to have an incomplete and outdated copy when they could instead have access to the real data they need – the complete, trusted, authoritative, current data they need – just by following a few simple rules?[2]

Managed self-service BI succeeds when the data governance group provides business users with the access they need to the data they need to do their jobs, while retaining the oversight and control the data governance group needs to keep their jobs.

This is a difficult balancing act, but there are well-known patterns to help organizations of any size succeed.

At this point you may be asking yourself what this has to do with plans not surviving contact with the enemy. Everything. It has everything to do with this pithy quote.

The successful data governance group will have a plan, and that plan will be informed by well-understood strategic goals. The plan is the plan, but the plan is made to change as the battle ebbs and flows. The strategy does not change moment to moment or day to day.

So as more business users engage, and as the initial governance plan shows its gaps and inadequacies, the data governance group changes the plan, keeping it aligned with the strategy and informed by the reality of the business.

This is a difficult balancing act, but it is being successfully performed by scores of enterprise organizations around the world using Power BI. Each organization finds the approach and the balance that best achieves their goals.

Although this post has used a martial metaphor to help engage the reader, this is not the best mental model to take away. Data governance and self-service business intelligence are not at war, even though they are often in a state of conflict or friction.

The right mental model is of a lasting peace, with shared goals and ongoing tradeoffs and compromises as each side gives and takes, and contributes to those shared goals.

This is what a successful data culture looks like: a lasting peace.

Multiple people replied to the original Twitter thread citing various challenges to succeeding with managed self-service business intelligence, balancing SSBI with effective data governance. Each of those challenges highlights the importance of the effective partnership between parties, and the alignment of business and IT priorities into shared strategic goals and principals that allow everyone to succeed together.

If you want to explore these concepts further and go beyond the highlights in this post, please feel encouraged to check out the full “Building a Data Culture with Power BI” series of posts and videos. Acknowledging the fact that data governance and self-service BI go beautifully together is just the beginning.


[1] This is really important

[2] Yes, yes, we all know that guy. Sometimes the data governance team needs the old stick for people who don’t find the new carrot attractive enough.. but those people tend to be in the minority if you use the right carrot.

 

Webcast: Unleashing Your Personal Superpower

Last week I delivered a presentation for the Data Platform Women In Tech‘s Mental Health and Wellness Day event.

The recording for my “Unleashing Your Personal Superpower” session is now online:

I hope you’ll watch the recording[1], but here’s a summary just in case:

  • Growth often results from challenge
  • Mental health issues like anxiety and depression present real challenges that can produce “superpowers” – skills that most people don’t have, and which can grow from the day-to-day experience of living with constant challenge
  • Recognizing and using these “superpowers” isn’t always easy – you need to be honest with yourself and the people around you, which in turn depends on being in a place of trust and safety to do so

In the presentation I mainly use an X-Men metaphor, and suggest that my personal superpowers are:

  1. Fear: Most social interactions[2] are deeply stressful for me, so to compensate I over-prepare and take effective notes for things I need to remember or actions I need to take
  2. Confusion: I don’t really understand how other people’s brains work, or the relationship between my actions and their reactions – to compensate I have developed techniques for effective written and verbal communication to eliminate ambiguity and drive clarity
  3. Chaos: My mind is made of chaos[3], which causes all sorts of challenges – to compensate I have developed a “process reflex” to understand complex problems and implement processes to address or mitigate them

I wrap up the session with a quick mention of the little-known years before Superman joined the Justice League, which he spent as a Kryptonite delivery guy, and absolutely hated his life. Once he found a team where he could use his strengths and not need to always fight to overcome his weaknesses, he was much happier and effective.

In related news, if I could only get these Swedes to return my calls, I’m thinking of forming a new superhero team…


[1] And the rest of the session recordings, because it was a great event.

[2] Think “work meetings” for starters and “work social events” for an absolute horror show.

[3] I have a draft blog post from two years ago that tries to express this, but I doubt I will ever actually finish and publish it…