Roche’s Maxim of Data Transformation

According to the internet, a maxim is a succinct formulation of a fundamental principle, general truth, or rule of conduct.[1] Maxims tend to relate to common situations and topics that are understandable by a broad range of people.

Topics like data transformation.

Roche’s Maxim of Data Transformation[2] states:

Data should be transformed as far upstream as possible, and as far downstream as necessary.

In this context “upstream” means closer to where the data is originally produced, and “downstream” means closer to where the data is consumed.

By transforming data closer to its ultimate source costs can be reduced, and the value added through data transformation can be applied to a greater range of uses. The farther downstream a given transformation is applied, the more expensive it tends to be – often because operations are performed more frequently – and the smaller the scope of potential value through reuse.

I’ve been using this guideline for many years, but I only started recognizing it as a maxim in the past year or so. The more I work with enterprise Power BI customers the more I realize how true and how important it is – and how many common problems could be avoided if more people thought about it when building data solutions.

Please note that this maxim is generalizable to data solutions implementing using any tools or technology. The examples below focus on Power BI because that’s where I spend my days, but these principles apply to every data platform I have used or seen used.

In day-to-day Power BI conversations, perhaps the most common question to which Roche’s Maxim applies is about where to implement a given unit of logic: “Should I do this in DAX or in Power Query?”

Short answer: Do it in Power Query.

If you’re ever faced with this question, always default to Power Query if Power Query is capable of doing what you need – Power Query is farther upstream. Performing data transformation in Power Query ensures that when your dataset is refreshed the data is loaded into the data model in the shape it needs to be in. Your report logic will be simplified and thus easier to maintain, and will likely perform better[3] because the Vertipaq engine will need to do less work as users interact with the report.[4]

But what if you need data transformation logic that depends on the context of the current user interacting with the report – things like slicers and cross-filtering? This is the perfect job for a DAX measure, because Power Query doesn’t have access to the report context. Implementing this logic farther downstream in DAX makes sense because it’s necessary.

Another common question to which Roche’s Maxim applies is also about where to implement a given unit of logic: “Should I do this in Power BI or in the data warehouse?”

Short answer: Do it in the data warehouse.

If you’re ever faced with this question, always default to transforming the data into its desired shape when loading it into the data warehouse – the data warehouse is farther upstream. Performing data transformation when loading the data warehouse ensures that any analytics solution that uses the data has ready access to what it needs – and that every solution downstream of the warehouse is using a consistent version of the data.

From a performance perspective, it is always better to perform a given data transformation as few times as possible, and it is best to not need to transform data at all.[5] Data transformation is a costly operation – transforming data once when loading into a common location like a data warehouse, data mart, or data lake, is inherently less costly than transforming it once for every report, app, or solution that uses that common location.

A much less common question to which Roche’s Maxim applies might be “What about that whole ‘not transforming at all’ pattern you mentioned a few paragraphs back – how exactly does that dark magic work?”

Short answer: Have the data already available in the format you need it to be in.

That short answer isn’t particularly useful, so here are two brief stories to illustrate what I mean.

Many years ago I was working with an oil & gas company in an engagement related to master data management. This company had a fundamental data problem: the equipment on their drilling platforms around the world was not standardized, and different meters reported the same production data differently. These differences in measurement meant that all downstream reporting and data processing could only take place using the least common denominator across their global set of meters… and this was no longer good enough. To solve the problem, they were standardizing on new meters everywhere, and updating their data estate to take advantage of the new hardware. My jaw dropped when I learned that the cost of upgrading was upwards of one hundred million dollars… which was a lot of money at the time.

Much more recently I was working with a retail company with over 5,000 locations across North America. They had similar challenges with similar root causes: their stores did not have consistent point of sale (POS) hardware[6], which meant that different stores produced different data and produced some common data at different grain, and analytics could only take place using the least common denominator data from all stores. Their solution was also similar: they upgraded all POS systems in all stores. I don’t have a dollar amount to put with this investment, but it was certainly significant – especially in an industry where margins are traditionally small and budgets traditionally very conservative.

Both of these stories illustrate organizations taking Roche’s Maxim to the extreme: they transformed their key data literally as far upstream as possible, by making the necessary changes to produce the data in its desired form.[7]

Each of these stories included both technical and non-technical factors. The technical factors revolve around data. The non-technical factors revolve around money. Each company looked at the cost and the benefit and decided that the benefit was greater. They implemented an upstream change that will benefit every downstream system and application, which will simplify their overall data estate, and which corrects a fundamental structural problem in their data supply chain that could only be mitigated, not corrected, by any downstream change.

There’s one additional part of Roche’s Maxim that’s worth elaborating on – what does “necessary” mean? This post has looked at multiple scenarios that emphasize the “as far upstream as possible” part of the maxim – what about the “as far downstream as necessary” part?

Some factors for pushing transformations downstream are technical, like the DAX context example above. Other technical factors might be the availability of data – you can’t produce a given output unless you have all necessary inputs. Others may be organizational – if data is produced by a 3rd party, your ability to apply transformations before a given point may be constrained more by a contract than by technology.

Still other factors may be situational and pragmatic – if team priorities and available resources prevent you from implementing a unit of data transformation logic in the data warehouse, it may be necessary to implement it in your Power BI solution in order to meet project deadlines and commitments.

These are probably the most frustrating types of “necessary” factors, but they’re also some of the most common. Sometimes you need to deliver a less-than-ideal solution and incur technical debt that you would prefer to avoid. The next time you find yourself in such a situation, keep this maxim in mind, and remember that even though it may be necessary to move that data transformation logic downstream today, tomorrow is another day, with different constraints and new opportunities.

Callout: This may be my maxim, but this isn’t the first blog post on the topic. Stuart Box from the UK data consultancy BurningSuit blogged back in March and was in first with this excellent article.


Update June 2021: The video from my June 11 DataMinutes presentation is now available, so if you prefer visual content, this video might be a good place to start.

 

Update May 2022: The fine folks at Greyskull Analytics have added a wonderful t-shirt to their online store. If you want to look cooler than I ever will, you might want to head on over to https://greyskullanalytics.com/ to order yours.

 

Update August 2022: There also a video from SQLBits, in case you want a slightly longer version presented to a live audience.

 

Update October 2022: I had the pleasure of visiting Patrick’s cube, and we recorded a video together. You should check it out.


[1] Or a men’s magazine. I really really wanted to use this more pop-culture meaning to make a “DQ” joke playing on the men’s magazine “GQ” but after watching this post languish in my drafts for many months and this joke not even beginning to cohere, I decided I should probably just let it go and move on.

But I did not let it go. Not really.

[2] If you think that sounds pretentious when you read it, imagine how it feels typing it in.

[3] The performance benefit here is not always obvious when working with smaller data volumes, but will become increasingly obvious as the data volume increases. And since the last thing you want to do in this situation is to retrofit your growing Power BI solution because you made poor decisions early on, why not refer to that maxim the next time you’re thinking about adding a calculated column?

[4] This post only spent a week or so in draft form, but during this week I watched an interesting work email conversation unfold. A Power BI customer was experiencing unexpected performance issues related to incremental refresh of a large dataset, and a DAX calculated column on a table with hundreds of millions of records was part of the scenario. The email thread was between members of the engineering and CAT teams, and a few points jumped out at me, including one CAT member observing “in my experience, calculated columns on large tables [can] increase processing times and also can greatly increase the time of doing a process recalc… it also depends on the complexity of the calculated column.”

I don’t have enough knowledge of the Veripaq engine’s inner workings to jump into the conversation myself, but I did sip my coffee and smile to myself before moving on with my morning. I checked back in on the conversation later on, and saw that a Power BI  group engineering manager (GEM) had shared this guidance, presented here with his approval:

“From a pure perf standpoint, its true that we can say:

  • The most efficient approach for a large fact table is to have all the columns be present in the source table (materialized views also might work), so that no extra processing is necessary during the import operation (either in Mashup or in DAX)
  • The next most efficient approach for a large fact table is usually going to be to have the computation be part of the M expression, so that it only needs to be evaluated for the rows in the partitions being processed
  • DAX calculated columns are a great option for flexibility and are particularly useful for dimension tables, but will be the least efficient compared to the above two options for large fact tables”

That sounds pretty familiar, doesn’t it? The GEM effectively summarized Roche’s Maxim, including specific guidance for the specific customer scenario. The details will differ from context to context, but I have never found a scenario to which the maxim did not apply.

Yes, this is a challenge for you to tell me where and how I’m wrong.

[5] Just as Sun Tzu said “To fight and conquer in all our battles is not supreme excellence; supreme excellence consists in breaking the enemy’s resistance without fighting,” supreme excellence in data transformation is not needing to transform the data at all.

[6] That’s “cash registers” to the less retail inclined readers.

[7] If you feel inclined to point out that in each of these stories there is additional data transformation taking place farther downstream, I won’t argue. You are almost certainly correct… but the Maxim still holds, as the key common transformations have been offloaded into the most upstream possible component in the data supply chain. Like a boss.[8]

[8] Like a supremely excellent[5] boss.

26 thoughts on “Roche’s Maxim of Data Transformation

  1. I’m going to add “Roche’s Maxim of Data Transformation” to quotes I display in my workspaces…

    The other two are:
    1. The most dangerous phrase in the language is “we’ve always done it this way”
    2. “Autonomy, Mastery and Purpose” (Daniel Pink)

    Like

    1. One of these days I will dig up the My Little Pony: Friendship Is Magic episode where the Mane Six take on the organizational stasis implied by your first quote, and then implement a series of process improvements at a local hospitality business and turn it into a blog post.

      I swear I’m not joking.

      (6 to 8 Wikipedia pages later)

      Apparently it’s this one:

      S06E10: “Applejack’s “Day” Off”

      Rarity wishes to spend time with Applejack at the spa, but Applejack is too busy to make good on her promises. Twilight and Spike agree to do Applejack’s chores by her exact instructions, freeing up an hour for Applejack to spend with Rarity. To Rarity’s annoyance, Applejack spends most of this time fixing the spa’s plumbing, which has been accidentally causing larger problems around the spa. Applejack returns to the farm to find that Twilight and Spike have not yet completed their task due to too many unnecessary steps in Applejack’s instructions. Twilight, Spike and Rarity help Applejack streamline her workload, finally allowing her to take the rest of the day off.

      https://en.wikipedia.org/wiki/My_Little_Pony:_Friendship_Is_Magic_(season_6)

      Like

  2. Pingback: 3 reasons to use views instead of tables in Power BI! - Data Mozart

  3. Pingback: Data Connectors - SQL Database - BI-Lingual Analytics

  4. Pingback: 3 reasons to use views instead of tables in Power BI! – Data Science Austria

  5. Pingback: Roche’s Maxim of Data Transformation – Lake Data Insights

  6. Pingback: Power Query! Uggg - Where do you start? | Power BI

  7. Pingback: Join me in London for SQLBits – March 8 through March 12 – BI Polar

  8. Very nice Maxim (and nice to have a phrase for it), Matthew.
    Transforming data as far upstream as possible is really the way to go.

    In some cases, it is even relevant that the transformation takes place in the source system, because it is relevant there. An example of that is a financial daily result that an office signs off on. The result of this is the base for a daily report with data from all our offices.

    In some cases, it is not possible to complete the transformation in the datawarehouse, but you need a step in the Power BI dataset or the tabular model (or whatever you use to expose data for reporting). An example is when you want to calculate the velocity of something, i.e. the amount produced over time, then you need to calculate in the tabular model or Power BI dataset, because you would need to sum the amount produced over the sum of time.

    I would recommend to almost never create transformations or measures in Power BI reports, but try to keep the measures in the data model. In a few cases, where you are exploring new possibilities for transforms or you really only need the transforms in this one report, it may be OK.

    Maxim to the max.

    Liked by 1 person

  9. Pingback: Thank you for sticking around – 200th post! – BI Polar

  10. Pingback: Chris Webb's BI Blog: Why Not All SQL Queries Can Be Used As A Source In Power BI DirectQuery Mode Chris Webb's BI Blog

  11. Pingback: Why Not All SQL Queries Can Be Sourced in Energy BI DirectQuery Mode Chris Webb's BI Weblog - hapidzfadli

  12. Pingback: Migrando datos y análisis a Power BI

  13. Pingback: Creating Date and/or Time Tables with SQL - Greyskull Analytics

  14. Pingback: how to swiftly take over Power Query. - www.tackytech.blog

  15. Pingback: Dataflows with benefits – BI Polar

  16. Pingback: Data data everywhere, now what? - Talino

  17. Pingback: Getting Data Into Shape for Reporting with Power BI | Paul Turley's SQL Server BI Blog

  18. Pingback: Row Level Security, Object Level Security, Data Masking: What Are the Business Use Cases? - Olivier Travers

  19. Pingback: Getting Data Into Shape for Reporting with Power BI - Local Vancouver News

  20. Pingback: Synapse Fundamentals for Tuning – Computed Columns – Daniel Crawford's Data Blog

  21. Pingback: Roche’s Maxim of Community – BI Polar

  22. Pingback: How to show a result when there is no data in Power BI | by Salvatore Cagliari | Jan, 2023 - Techno Blender

  23. Pingback: Build Once – Add Metadata – SQLServerCentral

  24. Pingback: Ein modernes GA4-Reporting mit Power BI und BigQuery - mediaworx bloggt

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s