Roche’s Maxim of Data Transformation

According to the internet, a maxim is a succinct formulation of a fundamental principle, general truth, or rule of conduct.[1] Maxims tend to relate to common situations and topics that are understandable by a broad range of people.

Topics like data transformation.

Roche’s Maxim of Data Transformation[2] states:

Data should be transformed as far upstream as possible, and as far downstream as necessary.

In this context “upstream” means closer to where the data is originally produced, and “downstream” means closer to where the data is consumed.

By transforming data closer to its ultimate source costs can be reduced, and the value added through data transformation can be applied to a greater range of uses. The farther downstream a given transformation is applied, the more expensive it tends to be – often because operations are performed more frequently – and the smaller the scope of potential value through reuse.

I’ve been using this guideline for many years, but I only started recognizing it as a maxim in the past year or so. The more I work with enterprise Power BI customers the more I realize how true and how important it is – and how many common problems could be avoided if more people thought about it when building data solutions.

Please note that this maxim is generalizable to data solutions implementing using any tools or technology. The examples below focus on Power BI because that’s where I spend my days, but these principles apply to every data platform I have used or seen used.

In day-to-day Power BI conversations, perhaps the most common question to which Roche’s Maxim applies is about where to implement a given unit of logic: “Should I do this in DAX or in Power Query?”

Short answer: Do it in Power Query.

If you’re ever faced with this question, always default to Power Query if Power Query is capable of doing what you need – Power Query is farther upstream. Performing data transformation in Power Query ensures that when your dataset is refreshed the data is loaded into the data model in the shape it needs to be in. Your report logic will be simplified and thus easier to maintain, and will likely perform better[3] because the Vertipaq engine will need to do less work as users interact with the report.[4]

But what if you need data transformation logic that depends on the context of the current user interacting with the report – things like slicers and cross-filtering? This is the perfect job for a DAX measure, because Power Query doesn’t have access to the report context. Implementing this logic farther downstream in DAX makes sense because it’s necessary.

Another common question to which Roche’s Maxim applies is also about where to implement a given unit of logic: “Should I do this in Power BI or in the data warehouse?”

Short answer: Do it in the data warehouse.

If you’re ever faced with this question, always default to transforming the data into its desired shape when loading it into the data warehouse – the data warehouse is farther upstream. Performing data transformation when loading the data warehouse ensures that any analytics solution that uses the data has ready access to what it needs – and that every solution downstream of the warehouse is using a consistent version of the data.

From a performance perspective, it is always better to perform a given data transformation as few times as possible, and it is best to not need to transform data at all.[5] Data transformation is a costly operation – transforming data once when loading into a common location like a data warehouse, data mart, or data lake, is inherently less costly than transforming it once for every report, app, or solution that uses that common location.

A much less common question to which Roche’s Maxim applies might be “What about that whole ‘not transforming at all’ pattern you mentioned a few paragraphs back – how exactly does that dark magic work?”

Short answer: Have the data already available in the format you need it to be in.

That short answer isn’t particularly useful, so here are two brief stories to illustrate what I mean.

Many years ago I was working with an oil & gas company in an engagement related to master data management. This company had a fundamental data problem: the equipment on their drilling platforms around the world was not standardized, and different meters reported the same production data differently. These differences in measurement meant that all downstream reporting and data processing could only take place using the least common denominator across their global set of meters… and this was no longer good enough. To solve the problem, they were standardizing on new meters everywhere, and updating their data estate to take advantage of the new hardware. My jaw dropped when I learned that the cost of upgrading was upwards of one hundred million dollars… which was a lot of money at the time.

Much more recently I was working with a retail company with over 5,000 locations across North America. They had similar challenges with similar root causes: their stores did not have consistent point of sale (POS) hardware[6], which meant that different stores produced different data and produced some common data at different grain, and analytics could only take place using the least common denominator data from all stores. Their solution was also similar: they upgraded all POS systems in all stores. I don’t have a dollar amount to put with this investment, but it was certainly significant – especially in an industry where margins are traditionally small and budgets traditionally very conservative.

Both of these stories illustrate organizations taking Roche’s Maxim to the extreme: they transformed their key data literally as far upstream as possible, by making the necessary changes to produce the data in its desired form.[7]

Each of these stories included both technical and non-technical factors. The technical factors revolve around data. The non-technical factors revolve around money. Each company looked at the cost and the benefit and decided that the benefit was greater. They implemented an upstream change that will benefit every downstream system and application, which will simplify their overall data estate, and which corrects a fundamental structural problem in their data supply chain that could only be mitigated, not corrected, by any downstream change.

There’s one additional part of Roche’s Maxim that’s worth elaborating on – what does “necessary” mean? This post has looked at multiple scenarios that emphasize the “as far upstream as possible” part of the maxim – what about the “as far downstream as necessary” part?

Some factors for pushing transformations downstream are technical, like the DAX context example above. Other technical factors might be the availability of data – you can’t produce a given output unless you have all necessary inputs. Others may be organizational – if data is produced by a 3rd party, your ability to apply transformations before a given point may be constrained more by a contract than by technology.

Still other factors may be situational and pragmatic – if team priorities and available resources prevent you from implementing a unit of data transformation logic in the data warehouse, it may be necessary to implement it in your Power BI solution in order to meet project deadlines and commitments.

These are probably the most frustrating types of “necessary” factors, but they’re also some of the most common. Sometimes you need to deliver a less-than-ideal solution and incur technical debt that you would prefer to avoid. The next time you find yourself in such a situation, keep this maxim in mind, and remember that even though it may be necessary to move that data transformation logic downstream today, tomorrow is another day, with different constraints and new opportunities.

Callout: This may be my maxim, but this isn’t the first blog post on the topic. Stuart Box from the UK data consultancy BurningSuit blogged back in March and was in first with this excellent article.


Update June 2021: The video from my June 11 DataMinutes presentation is now available, so if you prefer visual content, this video might be a good place to start.

 

Update May 2022: The fine folks at Greyskull Analytics have added a wonderful t-shirt to their online store. If you want to look cooler than I ever will, you might want to head on over to https://greyskullanalytics.com/ to order yours.

 

Update August 2022: There also a video from SQLBits, in case you want a slightly longer version presented to a live audience.

 

Update October 2022: I had the pleasure of visiting Patrick’s cube, and we recorded a video together. You should check it out.


[1] Or a men’s magazine. I really really wanted to use this more pop-culture meaning to make a “DQ” joke playing on the men’s magazine “GQ” but after watching this post languish in my drafts for many months and this joke not even beginning to cohere, I decided I should probably just let it go and move on.

But I did not let it go. Not really.

[2] If you think that sounds pretentious when you read it, imagine how it feels typing it in.

[3] The performance benefit here is not always obvious when working with smaller data volumes, but will become increasingly obvious as the data volume increases. And since the last thing you want to do in this situation is to retrofit your growing Power BI solution because you made poor decisions early on, why not refer to that maxim the next time you’re thinking about adding a calculated column?

[4] This post only spent a week or so in draft form, but during this week I watched an interesting work email conversation unfold. A Power BI customer was experiencing unexpected performance issues related to incremental refresh of a large dataset, and a DAX calculated column on a table with hundreds of millions of records was part of the scenario. The email thread was between members of the engineering and CAT teams, and a few points jumped out at me, including one CAT member observing “in my experience, calculated columns on large tables [can] increase processing times and also can greatly increase the time of doing a process recalc… it also depends on the complexity of the calculated column.”

I don’t have enough knowledge of the Veripaq engine’s inner workings to jump into the conversation myself, but I did sip my coffee and smile to myself before moving on with my morning. I checked back in on the conversation later on, and saw that a Power BI  group engineering manager (GEM) had shared this guidance, presented here with his approval:

“From a pure perf standpoint, its true that we can say:

  • The most efficient approach for a large fact table is to have all the columns be present in the source table (materialized views also might work), so that no extra processing is necessary during the import operation (either in Mashup or in DAX)
  • The next most efficient approach for a large fact table is usually going to be to have the computation be part of the M expression, so that it only needs to be evaluated for the rows in the partitions being processed
  • DAX calculated columns are a great option for flexibility and are particularly useful for dimension tables, but will be the least efficient compared to the above two options for large fact tables”

That sounds pretty familiar, doesn’t it? The GEM effectively summarized Roche’s Maxim, including specific guidance for the specific customer scenario. The details will differ from context to context, but I have never found a scenario to which the maxim did not apply.

Yes, this is a challenge for you to tell me where and how I’m wrong.

[5] Just as Sun Tzu said “To fight and conquer in all our battles is not supreme excellence; supreme excellence consists in breaking the enemy’s resistance without fighting,” supreme excellence in data transformation is not needing to transform the data at all.

[6] That’s “cash registers” to the less retail inclined readers.

[7] If you feel inclined to point out that in each of these stories there is additional data transformation taking place farther downstream, I won’t argue. You are almost certainly correct… but the Maxim still holds, as the key common transformations have been offloaded into the most upstream possible component in the data supply chain. Like a boss.[8]

[8] Like a supremely excellent[5] boss.

Session resources: Patterns for adopting dataflows in Power BI

Important: This post was written and published in 2020, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be more of an historical record and less of a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

This morning I presented a new webinar for the Istanbul Power BI user group, covering one of my favorite subjects: common patterns for successfully using and adopting dataflows in Power BI.

This session represents an intersection of my data culture series in that it presents lessons learned from successful enterprise customers, and my dataflows series in that… in that it’s about dataflows. I probably didn’t need to point out that part.

The session slides can be downloaded here: 2020-09-23 – Power BI Istanbul – Patterns for adopting dataflows in Power BI

The session recording is available for on-demand viewing. The presentation is around 50 minutes, with about 30 minutes of dataflows-centric Q&A at the end. Please check it out, and share it with your friends!

 

Dataflows, or Data Flows?

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

I don’t hear this question as often as I used to[1], but I still hear it: What’s the difference between dataflows in Power BI and Data Flows in Azure Data Factory?

I’ve already written extensively on the Power BI side of things, and now the awesome Merrill Aldrich from BlueGranite has published an excellent overview of the ADF feature on the BlueGranite blog. You should check it out here: https://www.blue-granite.com/blog/ssis-inspired-visual-data-transformation-comes-to-azure-data-factory

I’m not going to summarize the similarities and differences in this post, but after you’ve read Merrill’s article, I’d love to hear your specific questions.


[1] This may or may not be because I’ve been on vacation for the past three weeks.