Microsoft Fabric has only been in preview for a week, and I’ve already written one post that covers data governance – do we really need another one already?
Dave’s excellent question and comment[1] got me thinking about why OneLake feels so important to him (and to me) even though Fabric is so much more than any one part – even a part as central as OneLake. The more I thought about it, the more the pieces fell into place in my mind, and the more I found myself thinking about one of my favorite quotes[2]:
A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work.
Please take a minute to reflect on this quote. Ask yourself if Fabric is a complex system that works, what is the simple system that works? We’ll come back to that.
One of the most underappreciated benefits of Power BI as a managed SaaS data platform has been the “managed” part. When you create a report, dataset, dataflow, or other item in Power BI, the Power BI service knows everything about it. Power BI is the authoritative system for all items it contains, which means that Power BI can answer questions related to lineage (where does the data used by this report come from?) and impact analysis (where is the data in this dataset used?) and compliance (who has permissions to access this report?) and more.
If you’ve ever tried to authoritatively answer questions like these for a system of any non-trivial scope, you know how hard it is. Power BI has made this information increasingly available to administrators, through logs and APIs, and the community has built a wide range of free and paid solutions to help admins turn this information into insights and action. Even more excitingly, Power BI keeps getting better and better even as the newer parts of Fabric seem to be getting all of the attention.
What what does all this have to do with Fabric and OneLake and simple systems?
For data governance and enablement, Power BI is the simple system that works. OneLake is the mechanism through which the additional complexity of Fabric builds on the success of Power BI. Before the introduction of Fabric, the scope of Power BI was typically limited to the “final mile” of the data supply chain. There is a universe of upstream complexity that includes transactional systems, ETL/ELT/data preparation systems, data warehouses, lakes, and lakehouses, and any number of related building blocks. Having accessible insights into the Power BI tenant is great, but its value is constrained by the scope of the tenant and its contents.
All Fabric workloads use OneLake as their default data location. OneLake represents the biggest single step forward in moving from simpler to more complex, because it is the big expansion in the SaaS foundation shared by all Fabric workloads new and old. Because of Fabric, and because OneLake is the heart of Fabric, governance teams can now get more of the things they love about Power BI for more parts of the data estate.
Why should your governance team be excited about Microsoft Fabric? They should be excited because Fabric has the potential of making their lives much easier. Just as Fabric can help eliminate the complexity of integration, it can also help reduce the complexity of governance.
[1] Yes, we have Dave to thank and/or blame for this post.
[2] This massive pearl of wisdom is from The Systems Bible by John Gall. I first encountered it in the early 90s in the introduction to an OOP textbook, and have been inspired by it ever since. This quote should be familiar to anyone who has ever heard me talk about systems and/or data culture.
DALL-E prompt “power bi tenant settings administrator” because I couldn’t think of a better image to use
Until now, there hasn’t been a way to programmatically monitor tenant settings. Administrators needed to manually review and document settings to produce user documentation or complete audits. Now the GetTenantSettings API enables administrators to get a JSON response with all tenant settings and their values. With this information you can more easily and reliably share visibility into tenant settings for all of the processes where you need them.
If you’re a visual learner, check out this excellent video from Robert Hawker at Meloro Analytics that walks through using and understanding the API.
That’s it. That’s the post. I almost missed this important announcement with all of the other news this week – and I wanted to make sure you didn’t miss it too.
[1] If you haven’t attended one of our past events, we’re both going to be in Dublin in less than two weeks, and I will be in Copenhagen in September. Given the way our schedules are looking, we don’t expect to have any more in-person appearances before the end of the year. If you’ve been waiting for an event closer to you, you’ll probably be waiting until 2024 or later.
The data internet this week is awash with news and information about Microsoft Fabric. My Introducing Microsoft Fabric post on Tuesday got just under ten thousand views in the first 24 hours, which I believe is a record for this blog.
Even more exciting than the numbers are the comments. Bike4thewin replied with this excellent comment and request:
I would love to hear your thought on how to adopt this on Enterprise level and what could be the best practices to govern the content that goes into OneLake. In real life, I’m not sure you want everyone in the organisation to be able to do all of this without compromising Data Governance and Data Quality.
There’s a lot to unpack here, so please understand that this post isn’t a comprehensive answer to all of these topics – it’s just my thoughts as requested.
In the context of enterprise adoption, all of the guidance in the Power BI adoption roadmap and my video series on building a data culture applies to Fabric and OneLake. This guidance has always been general best practices presented through the lens of Power BI, and most of it is equally applicable to the adoption of other self-service data tools. Start there, knowing that although some of the details will be different, this guidance is about the big picture more than it is about the details.
In the context of governance, let’s look at the Power BI adoption roadmap again, this time focusing on the governance article. To paraphrase this article[1], the goal of successful governance is not to prevent people from working with data. The goal should be to make it as easy as possible for people to work with data while aligning that work with the goals and culture of the organization.
Since I don’t know anything about the goals or culture that inform Bike4thewin’s question, I can’t respond to them directly.. but reading between the lines I think I see an “old school” perspective on data governance rearing its head. I think that part of this question is really “how do I keep specific users from working with specific data, beyond using security controls on the data sources?”
The short answer is you probably shouldn’t, even if you could. Although saying “no” used to work sometimes, no matter what your technology stack is, saying “yes, and” is almost always the better approach. This post on data governance and self-service BI[2] provides the longer answer.
As you’re changing the focus of your governance efforts to be more about enabling the proper use of data, Fabric and OneLake can help.
Data in OneLake can be audited and monitored using the same tools and techniques you use today for other items in your Power BI tenant. This is a key capability of Fabric as a SaaS data platform – the data in Fabric can be more reliably understood than data in general, because of the SaaS foundation.
The more you think about the “OneDrive for data” tagline for OneLake, the more it makes sense. Before OneDrive[3], people would store their documents anywhere and everywhere. Important files would be stored on users’ hard drives, or on any number of file servers that proliferated wildly. Discovering a given document was typically a combination of tribal knowledge and luck, and there were no reliable mechanisms to manage or govern the silos and the sprawl. Today, organizations that have adopted OneDrive have largely eliminated this problem – documents get saved in OneDrive, where they can be centrally managed, governed, and secured.
To make things even more exciting, the user experience is greatly improved. People can choose to save their documents in other locations, but by default every Office application saves to OneDrive by default, and documents in OneDrive can be easily discovered, accessed, and shared by the people who need to work with them, and easily monitored and governed by the organization. People still create and use the documents they need, and there are still consistent security controls in place, but the use of a central managed SaaS service makes things better.
Using OneLake has the potential to deliver the same type of benefits for data that OneDrive delivers for documents. I believe that when we’re thinking about what users do with OneLake we shouldn’t be asking “what additional risk is involved by letting users do the things they’re already doing, but in a new tool?” Instead, we should ask “how we enable users to do the things they’re already doing using a platform that provides greater visibility to administrators?”
In addition to providing administrator capabilities for auditing and monitoring, OneLake also includes capabilities to data professionals who need to discover and understand data. The Power BI data hub[4] has been renamed the OneLake data hub in Fabric, and allows users to discover data in the lake for which they already have permissions, or which the owners have marked as discoverable.
The combination of OneLake and the OneLake data hub provide compelling benefits for data governance: it’s easier for users to discover and use trusted data without creating duplicates, and it’s easier for administrators to understand who is doing what with what data.
I’ll close with two quick additional points:
Right before we announced Fabric, the Power BI team announced the preview of new admin monitoring capabilities for tenant administrators. I haven’t had the chance to explore these new capabilities, but they’re designed to make the administrative oversight easier than ever.
I haven’t mentioned data quality, even though it’s part of the comment to which this post is responding. Data quality is a big and complicated topic, and I don’t think I can do it justice in a timely manner… so I’m going to take a pass on this one for now.
Thanks so much for the awesome comments and questions!
[1] And any number of posts (1 | 2 | 3 | 4 | 5 | 6 | 7 | …) on this site as well.
[2] The linked post is from exactly two years ago, as I write this new post. What are the odds?
[3] In this context I’m thinking specifically about OneDrive for Business, not the consumer OneDrive service.
[4] The data hub was originally released in preview in late 2020, and has been improving since then. It’s one of the hidden gems in Power BI, and is a powerful tool for data discovery… but I guess since I haven’t blogged about it before now, I guess I can’t complain too loudly when people don’t know it exists.
Power BI includes capabilities to enable users to understand the content they own, and how different items relate to each other. Sometimes you may need a custom “big picture” view that built-in features don’t deliver, and this is where the Scanner API comes in.
No, not this kind of scanner
The Power BI Scanner API is a subset of the broader Power BI Admin API. It’s designed to be a scalable, asynchronous tool for administrators to extract metadata for the contents of their Power BI tenant[1]. For an introduction to the Scanner API, check out this blog post from when it was introduced in December 2020.
The Power BI team has been updating the Scanner API since it was released. This week they announced some significant new capabilities added to the API, so administrators can get richer and more complete metadata, including:
Scheduled refresh settings for datasets, dataflows, and datamarts – this will make it easier for administrators to review their refresh schedules and identify problems and hotspots that may have undesired effects.
Additional RDL data source properties – this will make it easier for administrators to understand paginated reports and the data sources they use.
Additional “sub-artifact” metadata for datasets – this will make it easier for administrators to understand table- and query-level configuration including row-level security and parameters.
The Scanner API is a vital tool for any organization that wants to deeply understand how Power BI is being used, with a goal of enabling and guiding adoption and building a data culture. These updates represent an incremental but meaningful evolution of the tool. If you’re already using the Scanner API, you may want to look at how to include this new metadata in your scenario. If you’re not yet using the Scanner API, maybe now is the time to begin…
[1] One of the key scenarios enabled by the Scanner API is integration with Microsoft Purview and third party data catalog tools like Collibra. When these tools “scan” Power BI to populate their catalogs, they’re calling this API.
At the 2022 PASS Community Data Summit this November, I’m thrilled to be co-presenting a full-day pre-conference session with the one and only Melissa Coates [blog | Twitter | LinkedIn]. We’ll be presenting our all-day session live and in-person in Seattle on Tuesday, November 15, 2022.
What’s the Session?
The Hitchhiker’s Guide to Adopting Power BI in Your Organization
What’s the Session About?
The Power BI Adoption Roadmap is a collection of best practices and suggestions for getting more value from your data and your investment in Power BI. The Power BI Adoption Roadmap is freely available to everyone — but not everyone is really ready to start their journey without a guide. Melissa and I will be your guides…while you’re hitchhiking…on the road…to reach the right destination…using the roadmap. (You get it now, right?!?)
We’ll do an end-to-end tour of the Power BI Adoption Roadmap. During the session we’ll certainly talk about all of the key areas (like data culture, executive sponsorship, content ownership and management, content delivery scope, center of excellence, mentoring and user enablement, community of practice, user support, data governance, and system oversight).
Smart Power BI architecture decisions are important – but there’s so much more to a successful Power BI implementation than just the tools and technology. It’s the non-technical barriers, related to people and processes, that are often the most challenging. Self-service BI also presents constant challenges related to balancing control and oversight with freedom and flexibility. Implementing Power BI is a journey, and it takes time. Our goal is to give you plenty of ideas for how you can get more value from your data by using Power BI in the best ways.
We promise this won’t be a boring day merely regurgitating what you can read online. We’ll share lessons learned from customers, what works, what to watch out for, and why. There will be ample opportunity for Q&A, so you can get your questions answered and hear what challenges that other organizations are facing. This will be a highly informative and enjoyable day for you to attend either in-person or virtually.
Who is the Target Audience?
To get the most from this pre-conference session: You need to be familiar with the Power BI Adoption Roadmap and the Power BI Implementation Planning guidance. You should have professional experience working with Power BI (or other modern self-service BI tools), preferably at a scope larger than a specific team. Although deep technical knowledge about Power BI itself isn’t required, but the more you know about Power BI and its use, the more you’ll walk away with from this session.
She wrote it and emailed it to me and I shamelessly[1] stole it, which may be why there haven’t been any footnotes[2]. I even stole the banner image[3].
[1] With her permission, of course. [2] Until these ones. [3] Yes, Jeff. Stealing from Melissa is a Principal-level behavior.
To successfully implement managed self-service business intelligence at any non-trivial scale, you need data governance. To build and nurture a successful data culture, data governance is an essential part of the success.
Despite this fact, and despite the obvious value that data governance can provide, data governance has a bad reputation. Many people – likely including the leaders you need to be your ally if you’re working to build a data culture in your organization – have had negative experiences with data governance in the past, and now react negatively when the topic of data governance is raised.
Stakeholders’ past experiences can make your job much more difficult as you attempt to work with them to enable managed self-service within an organization. Governance is what they need. Governance is what you want to help them achieve. When you say “governance” you’re thinking about erecting guardrails rather than roadblocks, about making it easier for people to do the right things when working with the right data… but that’s not what they hear.
The label shouldn’t really matter – but it does.
Data governance, and building a data culture in general, is as much about people as it is about processes and technology, and that means effective communication is key. Effective communication requires a shared vocabulary, and a shared understanding of the meaning of key words.
My idea when starting to write this post was to propose “Data Culture Enablement” as the more friendly label, but as I was searching around for related content[1], I found that Dan Sutherland of EY proposed something simpler: Data Enablement.
Dan’s post is coming at things from a different angle, but it’s clear he has a similar goal in mind: “emphasiz[ing] empowerment, innovation and instant business value consumption.”
While I was searching, I found a few more interesting articles out there – they’re all well worth your time to read if you’ve made it this far:
R. Danes at SiliconANGLE highlights the value of data governance beyond the classic “thou shalt not” use cases. In this article she quotes the group chief data officer at ING Bank N.V. who says “Governance is just a horrible word,” and highlights how “People have really negative connotations associated with it.” ING may be the most mature organization I’ve ever engaged with in the context of data governance and metadata, so this is a fascinating quote from a very well-informed source.
Randy Bean and Thomas Davenport from Harvard Business Review[2] have researched how companies are failing in their efforts to become data-driven, and cite business leaders who recommend “trying to implement agile methods in key programs, while avoiding terms like ‘data governance’ that have a negative connotation for many executives.”
People think governance is about somebody with a big stick but it’s not. It’s about getting people to communicate and talk about their data and being in a position to ask for what they need with their data. The people on the other end need to understand they have a responsibility to meet that requirement if possible.
But when asked explicitly about coming up with a new term, she replied “For a while, I toyed with the idea of starting a campaign to re-name it but I didn’t think it was worth adding to the confusion surrounding the term by coming up with another title.”
She’s probably right.
But because of the negative connotations that the term “data governance” carries today, we should all exercise care when we use it. We should be careful to ensure that the meaning we’re trying to convey is clearly received – regardless of the terms we’re using.
That feels like an ending, but I’m not done. I want to close with story.
Five years ago I was working on a data governance product, and as part of that work I talked with lots of customers and with Microsoft customer-facing employees. In those conversations I frequently heard people say things to the effect of “don’t use the ‘G word’ when you’re meeting with this leader – if you say ‘governance’ he’s going to stop listening and the meeting will end.” This didn’t happen in every conversation, but it happened in many.
Last month I hosted a multi-day NDA event with senior technical and business stakeholders responsible for adopting and implementing Power BI and Azure Synapse at some of the biggest companies in the world. The event was focused on Power BI and Synapse, but the customer representatives kept bringing up the importance of data governance in session after session, and conversation after conversation[3]. It was like night and day compared to the conversations I had when I was trying to get people to care about governance.
Has the world changed that much, or am I just talking to different people now?
I think it’s probably a bit of both. The world has definitely changed – more and more organizations are recognizing the value and importance of data and are increasingly treating data as an asset that needs to be managed and curated. But these days I also engage mostly with organizations that are atypically mature, and are further along on their data culture journey than most. With this selection bias, it’s probably not surprising that I’m having a different experience.
I’ll close with a question and a thought: Who are you talking to about data governance?
[1] I’m always terrified that I will post something that someone else has already posted. On more than one occasion I’ve completed and published a new post, only to find out that I had written the same thing months or years earlier on this blog. Sigh.
[2] I’m pretty sure I’ve referenced this post before. It’s good enough to reference again. You should read it again.
[3] Thankfully we had also included some key folks from the Azure Purview team as well.
One of the key success factors for organizations to thrive today is adopting modern self-service business intelligence tools and transforming their businesses to become more agile, more automated, and more data driven. For years technology vendors and industry analysts have thrown around the term “digital transformation” to broadly describe this phenomenon, and technology has matured to catch up with the hype.
I use the term “hype” here deliberately. In my experience the term “digital transformation” has been thrown around in the same way as the terms “cloud” and “big data” were thrown around, just a few years later. The cynical part of my brain initially categorized it as “marketing bullshit,” but the intervening years have shown me that this wasn’t actually the case. Digital transformation is real, and it’s a key driver for a successful data culture with real executive support.
Over the past few years I’ve had hundreds of conversations with executives, decision-makers, and business and technical leaders from hundreds of enterprise Power BI customer organizations. These are real people working to solve real problems, and they come from a broad range of industries, geographies, and levels of maturity. I learned a lot from these conversations, and have done my best to help Power BI improve based on what I learned[1], but when I step back and look at the bigger picture there’s a significant trend that emerges.
Stakeholders from organizations that adopt Power BI[2] as part of a digital transformation describe more mature data cultures, and a greater return on their investments in data and analytics.
As you can probably imagine, once I saw this correlation[3], I kept seeing it again and again. And I started looking more closely at digital transformation as part of my ongoing work around data culture. Two of the most interesting resources I’ve found are articles from the Harvard Business Review, which may not be the first place you think to look when you’re thinking about Power BI and data culture topics… but these two articles provide important food for thought.
The first article is almost six years old – it’s from 2015, and focuses on The Company Cultures That Help (or Hinder) Digital Transformation. In the article, author Jane McConnell describes five of the most difficult obstacles that prevent organizations from adopting the changes required by a digital transformation:
(Please feel strongly encouraged to click through and read the whole article – it’s well worth your time, and goes into topics I won’t attempt to summarize here.)
I suspect these challenges sound as depressingly familiar to you as they do to me. These obstacles weren’t new in 2015, and they’re not new now – but they’re also not going away.
Jane McConnell goes on to identify what characteristics are shared by organizations that have overcome these obstacles and are succeeding with their digital transformations. The alignment between her conclusions and this blog’s guidance for Building a data culture with Power BI is striking[4]:
A strong, shared sense of purpose alleviates many obstacles, especially those of internal politics. When an organization has a clearly defined strategy, it is easier for everyone to align their work towards those strategic goals, and to justify that work in the face of opposition.
Distributed decision-making gives people at the edges of organizations a voice in digital transformation. Although there is a need for centralized decision-making and control for some aspects of a data culture (data sources, applications, policies, etc.) the real power of managed self-service BI comes from letting IT do what IT does best, and letting business experts make informed business decisions without undue governance.
Organizations that are responsive to the influence of the external world are more likely to understand the value digital can bring. My customer engagements don’t provide any insight to share on this specific point, but I suspect this is significant too. Organizations that are not responsive to external factors are unlikely to make it onto my calendar for those strategic conversations.
In her conclusion, Jane McConnell suggests that readers who see these obstacles in their way should “find ways to transform your work culture using digital as a lever.” In the context of the Harvard Business Review’s target audience, this advice makes a lot of sense.[5] If you are a senior business leader, shaping the work culture is something you are empowered and expected to do. If you’re not, this is where having an engaged and committed executive sponsor will come in handy. If you don’t already have that sponsor, framing your conversations using the vocabulary of digital transformation may help in ways that talking about data culture might not.
(As with the article discussed above, please feel encouraged to click through and read this one too. Both articles are written by professionals with significant experience, and an informed strategic perspective.)
This article starts off with a excellent statement of fact:
Whether their larger goal is to achieve digital transformation, “compete on analytics,” or become “AI-first,” embracing and successfully managing data in all its forms is an essential prerequisite.
It then goes on to inventory some of the ways that organizations are failing to deliver on this essential prerequisite, including “72% of survey participants report that they have yet to forge a data culture.”[6]
I’ll let you read the source article for more numbers and details, but there is one more quote I want to share:
93% of respondents identify people and process issues as the obstacle.
If you’ve attended any of my “Building a data culture with Power BI” presentations, you’ll know that I break it down into two main sections: the easy part, and the hard part. Spoiler alert: The easy part is technology. The hard part is people.
The article by Bean and Davenport includes a lot of insights and ideas, but not a lot of hope. They’ve talked to senior data leaders who are trying various approaches to build data cultures within their enterprise organizations, but they all see a long march ahead, with hard work and few quick wins. Technology is a vital part of the transformation, but people and culture is necessary as well.
Building a successful data culture requires top-down and bottom-up change. If you’re in a position of authority where you can directly influence your organization’s culture, it’s time to roll up your sleeves and get to work. If you’re not, it’s time to start thinking about the changes your can make yourself – but it’s also time to start thinking about how using the vocabulary of digital transformation might help you reach the senior leaders whose support you need.
[1] This is your periodic reminder that although I am a member of the Power BI Customer Advisory Team at Microsoft, and although I regularly blog about topics related to Power BI, this is my personal blog and everything I write is my personal perspective and does not necessarily represent the views of my employer or anyone other than me.
[4] The bolded text in this list is taken from the HBR article; the rest of the text is from me.
[5] Even if the use of “digital” is sooooo 2015.
[6] Now I wish that I had found this article before I started my data culture series, because I definitely would have used “forge” instead of “build” as the verb everywhere.
According to the internet, a maxim is a succinct formulation of a fundamental principle, general truth, or rule of conduct.[1] Maxims tend to relate to common situations and topics that are understandable by a broad range of people.
Topics like data transformation.
Roche’s Maxim of Data Transformation[2] states:
Data should be transformed as far upstream as possible, and as far downstream as necessary.
In this context “upstream” means closer to where the data is originally produced, and “downstream” means closer to where the data is consumed.
By transforming data closer to its ultimate source costs can be reduced, and the value added through data transformation can be applied to a greater range of uses. The farther downstream a given transformation is applied, the more expensive it tends to be – often because operations are performed more frequently – and the smaller the scope of potential value through reuse.
I’ve been using this guideline for many years, but I only started recognizing it as a maxim in the past year or so. The more I work with enterprise Power BI customers the more I realize how true and how important it is – and how many common problems could be avoided if more people thought about it when building data solutions.
Please note that this maxim is generalizable to data solutions implementing using any tools or technology. The examples below focus on Power BI because that’s where I spend my days, but these principles apply to every data platform I have used or seen used.
In day-to-day Power BI conversations, perhaps the most common question to which Roche’s Maxim applies is about where to implement a given unit of logic: “Should I do this in DAX or in Power Query?”
Short answer: Do it in Power Query.
If you’re ever faced with this question, always default to Power Query if Power Query is capable of doing what you need – Power Query is farther upstream. Performing data transformation in Power Query ensures that when your dataset is refreshed the data is loaded into the data model in the shape it needs to be in. Your report logic will be simplified and thus easier to maintain, and will likely perform better[3] because the Vertipaq engine will need to do less work as users interact with the report.[4]
But what if you need data transformation logic that depends on the context of the current user interacting with the report – things like slicers and cross-filtering? This is the perfect job for a DAX measure, because Power Query doesn’t have access to the report context. Implementing this logic farther downstream in DAX makes sense because it’s necessary.
Another common question to which Roche’s Maxim applies is also about where to implement a given unit of logic: “Should I do this in Power BI or in the data warehouse?”
Short answer: Do it in the data warehouse.
If you’re ever faced with this question, always default to transforming the data into its desired shape when loading it into the data warehouse – the data warehouse is farther upstream. Performing data transformation when loading the data warehouse ensures that any analytics solution that uses the data has ready access to what it needs – and that every solution downstream of the warehouse is using a consistent version of the data.
From a performance perspective, it is always better to perform a given data transformation as few times as possible, and it is best to not need to transform data at all.[5] Data transformation is a costly operation – transforming data once when loading into a common location like a data warehouse, data mart, or data lake, is inherently less costly than transforming it once for every report, app, or solution that uses that common location.
A much less common question to which Roche’s Maxim applies might be “What about that whole ‘not transforming at all’ pattern you mentioned a few paragraphs back – how exactly does that dark magic work?”
Short answer: Have the data already available in the format you need it to be in.
That short answer isn’t particularly useful, so here are two brief stories to illustrate what I mean.
Many years ago I was working with an oil & gas company in an engagement related to master data management. This company had a fundamental data problem: the equipment on their drilling platforms around the world was not standardized, and different meters reported the same production data differently. These differences in measurement meant that all downstream reporting and data processing could only take place using the least common denominator across their global set of meters… and this was no longer good enough. To solve the problem, they were standardizing on new meters everywhere, and updating their data estate to take advantage of the new hardware. My jaw dropped when I learned that the cost of upgrading was upwards of one hundred million dollars… which was a lot of money at the time.
Much more recently I was working with a retail company with over 5,000 locations across North America. They had similar challenges with similar root causes: their stores did not have consistent point of sale (POS) hardware[6], which meant that different stores produced different data and produced some common data at different grain, and analytics could only take place using the least common denominator data from all stores. Their solution was also similar: they upgraded all POS systems in all stores. I don’t have a dollar amount to put with this investment, but it was certainly significant – especially in an industry where margins are traditionally small and budgets traditionally very conservative.
Both of these stories illustrate organizations taking Roche’s Maxim to the extreme: they transformed their key data literally as far upstream as possible, by making the necessary changes to produce the data in its desired form.[7]
Each of these stories included both technical and non-technical factors. The technical factors revolve around data. The non-technical factors revolve around money. Each company looked at the cost and the benefit and decided that the benefit was greater. They implemented an upstream change that will benefit every downstream system and application, which will simplify their overall data estate, and which corrects a fundamental structural problem in their data supply chain that could only be mitigated, not corrected, by any downstream change.
There’s one additional part of Roche’s Maxim that’s worth elaborating on – what does “necessary” mean? This post has looked at multiple scenarios that emphasize the “as far upstream as possible” part of the maxim – what about the “as far downstream as necessary” part?
Some factors for pushing transformations downstream are technical, like the DAX context example above. Other technical factors might be the availability of data – you can’t produce a given output unless you have all necessary inputs. Others may be organizational – if data is produced by a 3rd party, your ability to apply transformations before a given point may be constrained more by a contract than by technology.
Still other factors may be situational and pragmatic – if team priorities and available resources prevent you from implementing a unit of data transformation logic in the data warehouse, it may be necessary to implement it in your Power BI solution in order to meet project deadlines and commitments.
These are probably the most frustrating types of “necessary” factors, but they’re also some of the most common. Sometimes you need to deliver a less-than-ideal solution and incur technical debt that you would prefer to avoid. The next time you find yourself in such a situation, keep this maxim in mind, and remember that even though it may be necessary to move that data transformation logic downstream today, tomorrow is another day, with different constraints and new opportunities.
Update June 2021: The video from my June 11 DataMinutes presentation is now available, so if you prefer visual content, this video might be a good place to start.
Update May 2022: The fine folks at Greyskull Analytics have added a wonderful t-shirt to their online store. If you want to look cooler than I ever will, you might want to head on over to https://greyskullanalytics.com/ to order yours.
Update August 2022: There also a video from SQLBits, in case you want a slightly longer version presented to a live audience.
Update October 2022: I had the pleasure of visiting Patrick’s cube, and we recorded a video together. You should check it out.
[1] Or a men’s magazine. I really really wanted to use this more pop-culture meaning to make a “DQ” joke playing on the men’s magazine “GQ” but after watching this post languish in my drafts for many months and this joke not even beginning to cohere, I decided I should probably just let it go and move on.
But I did not let it go. Not really.
[2] If you think that sounds pretentious when you read it, imagine how it feels typing it in.
[3] The performance benefit here is not always obvious when working with smaller data volumes, but will become increasingly obvious as the data volume increases. And since the last thing you want to do in this situation is to retrofit your growing Power BI solution because you made poor decisions early on, why not refer to that maxim the next time you’re thinking about adding a calculated column?
[4] This post only spent a week or so in draft form, but during this week I watched an interesting work email conversation unfold. A Power BI customer was experiencing unexpected performance issues related to incremental refresh of a large dataset, and a DAX calculated column on a table with hundreds of millions of records was part of the scenario. The email thread was between members of the engineering and CAT teams, and a few points jumped out at me, including one CAT member observing “in my experience, calculated columns on large tables [can] increase processing times and also can greatly increase the time of doing a process recalc… it also depends on the complexity of the calculated column.”
I don’t have enough knowledge of the Veripaq engine’s inner workings to jump into the conversation myself, but I did sip my coffee and smile to myself before moving on with my morning. I checked back in on the conversation later on, and saw that a Power BI group engineering manager (GEM) had shared this guidance, presented here with his approval:
“From a pure perf standpoint, its true that we can say:
The most efficient approach for a large fact table is to have all the columns be present in the source table (materialized views also might work), so that no extra processing is necessary during the import operation (either in Mashup or in DAX)
The next most efficient approach for a large fact table is usually going to be to have the computation be part of the M expression, so that it only needs to be evaluated for the rows in the partitions being processed
DAX calculated columns are a great option for flexibility and are particularly useful for dimension tables, but will be the least efficient compared to the above two options for large fact tables”
That sounds pretty familiar, doesn’t it? The GEM effectively summarized Roche’s Maxim, including specific guidance for the specific customer scenario. The details will differ from context to context, but I have never found a scenario to which the maxim did not apply.
Yes, this is a challenge for you to tell me where and how I’m wrong.
[5] Just as Sun Tzu said “To fight and conquer in all our battles is not supreme excellence; supreme excellence consists in breaking the enemy’s resistance without fighting,” supreme excellence in data transformation is not needing to transform the data at all.
[6] That’s “cash registers” to the less retail inclined readers.
[7] If you feel inclined to point out that in each of these stories there is additional data transformation taking place farther downstream, I won’t argue. You are almost certainly correct… but the Maxim still holds, as the key common transformations have been offloaded into the most upstream possible component in the data supply chain. Like a boss.[8]
When you hear someone say that governance and self-service BI don’t go together, or some variation on the tired old “Power BI doesn’t do data governance” trope, you should immediately be skeptical.
During a recent Guy in a Cube live stream there was a great discussion about self-service BI and data governance, and about how in most larger organizations Power BI is used for self-service and non-self-service BI workloads. The discussion starts around the 27:46 mark in the recording if you’re interested.
As is often the case, this discussion sparked my writing muse and I decided to follow up with a brief Twitter thread to share a few thoughts that didn’t fit into the stream chat. That brief thread turned out to be much larger and quite different than what I expected… big enough to warrant its own blog post. This post.
Please consider this well-known quote: “No plan survives contact with the enemy.”
In his 1871 essay Helmuth von Moltke called out an obvious truth: battle is inherently unpredictable, and once enemy contact is made a successful commander must respond to actual conditions on the ground – not follow a plan that is more outdated with every passing minute.
At the same time, that commander must have and must adhere to strategic goals for the engagement. Without these goals, how could they react and respond and plan as the reality of the conflict changes constantly and unpredictably?
Implementing managed self-service business intelligence – self-service BI hand-in-hand with data governance – exhibits many of the same characteristics.
Consider a battlefield, where one force has overwhelming superiority: More soldiers, more artillery, more tanks, and a commanding position of the terrain. The commander of that force knows that any enemy who faces him on this field will fail. The enemy knows this too.
And because the enemy knows this, they will not enter the field to face that superior force. They will fade away, withdraw from direct conflict, and strike unexpectedly, seeking out weaknesses and vulnerabilities. This is the nature of asymmetric warfare.
The commander of the more powerful force probably knows this too, and will act accordingly. The smart commander will present opportunities that their enemies will perceive as easily exploitable weaknesses, to draw them in and thus to bring that overwhelming force to bear.
And this brings us naturally back to the topic of data governance, self-service business intelligence, and dead Prussian field marshals.
Seriously.
In many large organizations, the goal of the data governance group is to ensure that data is never used improperly, and to mitigate (often proactively and aggressively mitigate) the risk of improper use.
In many large organizations, the data governance group has an overwhelming battlefield advantage. They make the rules. They define the processes. They grant or deny access to the data. No one gets in without their say-so, and woe unto any business user who enters that field of battle, and tries to get access to data that is under the protection of this superior force.
Of course, the business users know this. They’re outgunned and outmanned, and they know the dire fate that awaits them if they try to run the gauntlet that the data governance team has established. Everyone they know who has ever tried has failed.
So they go around it. They rely on the tried and true asymmetric tactics of self-service BI. The CSV export. The snapshot. The Excel files and SharePoint lists with manually-entered data.
Rather than facing the data governance group and their overwhelming advantages, they build a shadow BI solution.
These veteran business users choose not to join a battle they’re doomed to lose.
They instead seek and find the weak spots. They achieve their goals despite all of the advantages and resources that the data governance group has at their disposal.
Every time. Business users always find a way.
This is where a savvy data governance leader can learn from the battlefield. Just as a military commander can draw in their opponents and then bring their superior forces to bear, the data governance group can present an attractive and irresistible target to draw in business users seeking data.
This is the path to managed self-service business intelligence… and where the whole military analogy starts to break down. Even though data governance and self-service BI have different priorities and goals, these groups should not and must not be enemies. They need to be partners for either to succeed.
Managed self-service BI succeeds when it is easier for business users to get access to the data they need by working within the processes and systems established by the data governance group, rather than circumventing them.[1]
Managed self-service BI succeeds when the data governance group enables processes and systems to give business users the access they need to the data they need, while still maintaining the oversight and control required for effective governance.
Managed self-service BI succeeds when the data governance group stops saying “no” by default, and instead says “yes, and” by default.
Yes you can get access to this data, and these are the prerequisites you must meet.
Yes you can get access to this data, and these are the allowed scenarios for proper use.
Yes you can get access to this data, and these are the resources to make it easy for you to succeed.
What business user would choose to build their own shadow BI solution that requires manual processes and maintenance just to have an incomplete and outdated copy when they could instead have access to the real data they need – the complete, trusted, authoritative, current data they need – just by following a few simple rules?[2]
Managed self-service BI succeeds when the data governance group provides business users with the access they need to the data they need to do their jobs, while retaining the oversight and control the data governance group needs to keep their jobs.
This is a difficult balancing act, but there are well-known patterns to help organizations of any size succeed.
At this point you may be asking yourself what this has to do with plans not surviving contact with the enemy. Everything. It has everything to do with this pithy quote.
The successful data governance group will have a plan, and that plan will be informed by well-understood strategic goals. The plan is the plan, but the plan is made to change as the battle ebbs and flows. The strategy does not change moment to moment or day to day.
So as more business users engage, and as the initial governance plan shows its gaps and inadequacies, the data governance group changes the plan, keeping it aligned with the strategy and informed by the reality of the business.
Although this post has used a martial metaphor to help engage the reader, this is not the best mental model to take away. Data governance and self-service business intelligence are not at war, even though they are often in a state of conflict or friction.
The right mental model is of a lasting peace, with shared goals and ongoing tradeoffs and compromises as each side gives and takes, and contributes to those shared goals.
This is what a successful data culture looks like: a lasting peace.
Multiple people replied to the original Twitter thread citing various challenges to succeeding with managed self-service business intelligence, balancing SSBI with effective data governance. Each of those challenges highlights the importance of the effective partnership between parties, and the alignment of business and IT priorities into shared strategic goals and principals that allow everyone to succeed together.
If you want to explore these concepts further and go beyond the highlights in this post, please feel encouraged to check out the full “Building a Data Culture with Power BI” series of posts and videos. Acknowledging the fact that data governance and self-service BI go beautifully together is just the beginning.
[1] This is really important
[2] Yes, yes, we all know that guy. Sometimes the data governance team needs the old stick for people who don’t find the new carrot attractive enough.. but those people tend to be in the minority if you use the right carrot.
No, not that one. Imagine walking a nicer restaurant than the one you thought of at first. A lotnicer.
Even nicer than this.
Imagine walking into a 3-star Michelin-rated best-in-the-world restaurant, the kind of place where you plan international travel around reservations, the kind of place where the chef’s name is whispered in a kind of hushed awe by other chefs around the world.
Now imagine being seated and then insisting that the chef cook a specific dish in a specific way, because that’s what you’re used to eating, because you know what you like and what you want.
I’ll just leave this here for no particular reason.
In this situation, one of three things is likely to happen:
The chef will give you what you ask for, and your dining experience will be diminished because your request was granted.
The chef will ask you to leave.
The chef will instruct someone else to ask you to leave.[1]
Let’s step back from the culinary context of this imaginary scenario, and put it into the context of software development and BI.
Imagine a user emailing a developer or software team[2] and insisting that they need a feature developed that works in some specific way. “Just make it do this!” or maybe “It should be exactly like <legacy software feature> but <implemented in new software>!!”
I can’t really imagine the restaurant scene playing out – who would spend all that money on a meal just to get what they could get anywhere? But I don’t need to imagine the software scene playing out, because I’ve seen it day after day, month after month for decades, despite the fact that even trivial software customization can be more expensive than a world-class meal. I’ve also been on both sides of the conversation – and I probably will be again.
When you have a problem, you are the expert on the problem. You know it inside and out, because it’s your problem. You’ve probably tried to solve it – maybe you’ve tried multiple solutions before you asked for help. And while you were trying those ineffective solution approaches, you probably thought of what a “great” solution might look like.
So when you ask for help, you ask for the solution you thought of.
This is bad. Really bad.
“Give me this solution” or “give me this feature” is the worst thing to ask for. Because while you may be the expert on your problem, you’re not an expert on the solution. If you were, you wouldn’t be asking for help in the first place.
And to make matters worse, most of the people on the receiving end aren’t the IT equivalents of 3-star Michelin-rated chefs. They’re line cooks, and they give you what you asked for because they don’t know any better. And because the customer is always right, right?
Yeah, nah.
As a software professional, it’s your job to solve your customers’ problems, and to do so within constraints your customers probably know nothing about, and within an often-complex context your customers do not understand[3]. If you simply deliver what the customer asks for, you’ve missed the point, and missed an opportunity to truly solve the fundamental problem that needs to be solved.
If you’re a BI professional, every project and every feature request brings with it an opportunity. It’s the opportunity to ask questions.
Why do you need this?
When will you use it?
What are you doing today without the thing you’re asking for?
When will this be useful?
Who else will use it?[4]
As a software or BI professional, you’re the expert on the solution, just as your customer is the expert on the problem. You know where logic can be implemented, and the pros and cons of each option. You know where the right data will come from, and how it will need to be transformed. You know what’s a quick fix and what will require a lot of work – and might introduce undesirable side-effects or regressions in other parts of the solution.
With this expertise, you’re in the perfect position to ask the right questions to help you understand the problem that needs to be solved. You’re in the perfect position to take the answers to your questions and to turn them into what your customer really needs… which is often very different from what they’re asking for.
You don’t need to ask these questions every time. You may not even need to ask questions of your customers most of the time[5]. But if you’re asking these questions of yourself each time you’re beginning new work – and asking questions of your customers as necessary – the solutions you deliver will be better for it.
And when you find yourself on the requesting side (for example, when you find yourself typing into ideas.powerbi.com) you’re in the perfect position to provide information about the problem you need solved – not just the solution you think you need. Why not give it a try?
This is a complex topic. I started writing this post almost 100 years ago, way back in February 2020[6]. I have a lot more that I want to say, but instead of waiting another hundred years I’ll wrap up now and save more thoughts for another post or two.
If you’ve made it this far and you’re interested in more actual best practices, please read Lean Customer Development by Cindy Alvarez. This book is very accessible, and although it is targeted more at startups and commercial software teams it contains guidance and practices that can be invaluable for anyone who needs to deliver solutions to someone else’s problems.
[1] This seems like the most likely outcome to me.
[2] This could be a commercial software team or “the report guy” in your IT department. Imagine what works for you.
[3] If you’re interested in a fun and accessible look at how the Power BI team decides what features to build, check out this 2019 presentation from Power BI PM Will Thompson. It’s only indirectly related to this post, but it’s a candid look at some of the “often-complex context” in which Power BI is developed.
[4] Please don’t focus too much on these specific questions. They might be a good starting point, but they’re just what leaped to mind as I was typing, not a well-researched list of best practice questions or anything of the sort.
[5] If you’re a BI developer maintaining a Power BI application for your organization, you may have already realized that asking a ton of questions all the time may not be appreciated by the people paying your salary, so please use your own best judgment here.
[6] This probably explains why I so casually mentioned the idea of walking into a restaurant. I literally can’t remember the last time I was in a restaurant. Do restaurants actually exist? Did they ever?