Positioning Power BI Dataflows (Part 2)

I didn’t plan on writing a sequel to my Positioning Power BI Dataflows post, but a few comments I’ve seen recently have made me think that one might be useful. I also didn’t plan on this article ending up quite as long as it has, but this is the direction in which it ended up needing to go.

One was a comment on my October post related to CDM folders, that was part of a discussion[1] about whether it makes sense to have data warehouses now that we have dataflows. I’d finished replying by saying “If your scenario includes the ability to add a new dimension to a data warehouse, or to add new attributes to existing dimensions, that’s probably a good direction to choose.” Darryll respectfully disagreed.

2018-12-01_13-43-32

The point in Darryll’s comment that stuck with me was related to data warehouses becoming an anti-pattern, a “common response to a recurring problem that is usually ineffective and risks being highly counterproductive.” Darryll and I will probably have to agree to disagree.

Update: Darryl was kind enough to comment on this post, so please scroll down for additional context. The rest of this post remains unedited.

Big data platforms like Azure Data Lake Storage gen2 are enabling “modern data warehouse” scenarios that were not previously possible, and they’re making them more and more accessible. I don’t think there’s any argument on that front. But just because there is a new cool hammer in the toolbox, this doesn’t mean that every problem needs to be a big data nail.[2] The need for “traditional” Kimball-style data warehouse hasn’t gone away, and in my opinion isn’t likely to go away any time soon.

The other comment that prompted this post was a from Nimrod on Twitter, in response to my recent blog post about using dataflows as a way to handle slow data sources in a self-service solution when you don’t have a data warehouse.

2018-12-01_13-41-00

Before I proceed I should mention that the next few paragraphs are also informed by Nimrod’s excellent essay “The Self-Service BI Hoax“, which you are strongly encouraged to read. It’s not my goal to respond to this essay in general or in specific terms, but it provides significant context about the likely thinking behind the tweet pictured above.

I’m not sure where Nimrod was going with his “local” comment, since dataflows are built and executed and managed in the Power BI cloud service, but the rest of the post is worth considering carefully, both in the context of positioning and in the context of usage.

I’ve said this many times before, and I suspect I’ll say it many times again: dataflows are not a replacement for data warehouses. I said this in the first paragraph of the post to which Nimrod was responding, and in that post the phrase was a hyperlink back to my initial post on positioning. There will be people who claim that you don’t need a data warehouse if you have dataflows – this is false. This is as false as saying that you don’t need a curated and managed set of data models because you have a self-service BI tool.

Experience has shown time and time again that self-service BI succeeds at scale[3] when it is part of an organized and professional approach to data and analytics. Without structure and management, self-service BI is too often part of the problem, rather than part of the solution. To borrow from Nimrod’s essay, “With some governance, and with a realistic understanding of what the technology can do, the data anarchy can become a data democracy.” The converse, also holds true – without that governance, anarchy is likely, and its likelihood increases as the scope of the solution increases.

Despite this, I believe that Power BI dataflows have a better chance to be part of the solution because of how they’re implemented. This is why:

  1. Dataflows are defined and managed by the Power BI service. This means that they can be discovered and understood by Power BI administrators using the Power BI admin API and the dataflows API as well. Although the management experience is not yet complete while dataflows are in preview, the fact that dataflows are defined and executed in the context of a single cloud service means that they are inherently more manageable and less anarchic than other self-service alternatives.
  2. Dataflows are self-contained and self-describing in terms of the ETL logic they implement and their data lineage. Each dataflow entity is defined by a Power Query “M” query, and the data in the entity can only result from the execution of that query. This is fundamentally different from tools like Excel, where the logic that defines a dataset is difficult to parse and understand[4], and which would need to be reverse engineered and re-implemented by a developer in order to be included in a central data model. It is also fundamentally different from other self-service data preparation technologies that load data into unmanaged locations where they can be further manipulated with file system or database CRUD operations.
  3. Dataflows lend themselves to process-driven collaboration between business and IT. With a Power BI dataflow entity, an administrator can take the query that defines the entity and reuse it in another context that supports “M” queries such as a tabular model. They can also also be operationalized as-is; any dataflow or entity created by a business user can be added to the central IT-managed data lake. The technology behind dataflows lends itself better to the types of processes that successful BI centers of excellence put in place than do many other data preparation technologies.
  4. Business users are going to prepare and use the data they need regardless of the tools that are made available to them. In an ideal world, every data need that a business user has would be fulfilled by a central IT team in a timely and predictable manner. Sadly, we do not live in this world. In most situations it’s not a matter of choosing dataflows over a professionally-designed data warehouse. It’s a matter of choosing dataflows over an Excel workbook or other self-service solution.

This final point makes me think of one[5] of the key tenants of the Kimball Method:

It’s all about the business.
I say this many times during classes and consulting. It’s the most important characteristic of the Kimball Lifecycle method: practical, uncompromising focus on the business. It infuses everything we do, and it’s the single most important message to carry forward.

A mature IT organization will help the business it supports achieve its goals in the best way it can, where “best” is situational and dependent on the many complex factors that need to be balanced in each individual context. When done properly, BI has always been about the business and not about the technology – the technology is merely the means to the end of helping the business make better decisions with better information.

And in this context, dataflows can be part of the solution, or they can be part of the problem. Like other self-service technologies, dataflows present capabilities that can be misused, and which can introduce inconsistencies and duplication across an organization’s data estate, but their design helps mitigate the entropy that self-service approaches introduce into the system. When used as part of a managed approach to governed self-service, dataflows can help remove ad hoc ETL processes, or move them into a context where IT oversight and governance is easier.

Of course, this is a very optimistic conclusion for me to reach. What I’m describing above is what organizations can do if they use dataflows in a well thought out way. It’s not something that can be taken for granted. You need to work for it. And that’s probably the most important thing to keep in mind when evaluating dataflows or any self-service tool: no tool is a silver bullet.

In my mind[6] both of the comments that inspired this post have at their root an important point in the context of positioning Power BI dataflows: you need to choose the correct tool and implement it in the correct manner in order to be successful, and you need to evaluate tools against your requirements based on their capabilities, rather than based on any sales or marketing pitches.

The next time you see someone pitching dataflows as a silver bullet, please point them here. But at the same time, when you see organizations implementing dataflows as part of a managed and governed self-service BI… I’d like to hear about that too.


[1] I won’t repeat everything here, but you can go read the comments on the post yourself if you want to have the full context.

[2] I hope that translates well. In case it doesn’t, here’s a link: https://en.wikipedia.org/wiki/Law_of_the_instrument

[3] I include this qualification because SSBI can indeed be successful for individuals and teams without IT oversight and involvement.

[4] If you’ve ever had a business user or consulting client give you an Excel workbook with a dozen macros and/or hundreds of VLOOKUPs, you’ll know what I mean here.

[5] I recognize that I’m cherry-picking here, but I think this is an important point to make. The Kimball Group web site has 180 design tips, and they’re all worth reading.

[6] I emphasize here that this is my opinion, because I have asked neither Nimrod nor Darryll if this is what they actually meant, and I definitely do not want to falsely portray someone else’s intent. They can correct me as needed.

Are Power BI Dataflows a Master Data Management Tool?

Are Power BI dataflows a master data management tool?

This guy really wants to know.

MDM
Image from https://www.pexels.com/photo/close-up-photography-of-a-man-holding-ppen-1076801/

Spoiler alert: No. They are not.

When Microsoft first announced dataflows[1] were coming to Power BI earlier this year, I started hearing a surprising question[2]:

Are dataflows for Master Data Management in the cloud?

The first few times I heard the question, it felt like an anomaly, a non sequitur. The answer[3] seemed so obvious to me that I wasn’t sure how respond.[4]

But after I’d heard this more frequently, I started asking questions in return, trying to understand what was motivating the question. A common theme emerged: people seemed to be confusing the Common Data Service for Apps used by PowerApps, Microsoft Flow, and Dynamics 365, with dataflows – which were initially called the Common Data Service for Analytics.

The Common Data Service for Apps (CDS) is a cloud-based data service that provides secure data storage and management capabilities for business data entities. Perhaps most specifically for the context of this article, CDS provides a common storage location, which “enables you to build apps using PowerApps and the Common Data Service for Apps directly against your core business data already used within Dynamics 365 without the need for integration.”[5] CDS provides a common location for storing data that can be used by multiple applications and processes, and also defines and applies business logic and rules that are applied to any application or user manipulating data stored in CDS entities.[6]

And that is starting to sound more like master data management.

When I think about Master Data Management (MDM) systems, I think of systems that:

  • Serve as a central repository for critical organizational data, to provide a single source of truth for transactional and analytical purposes.
  • Provide mechanisms to define and enforce data validation rules to ensure that the master data is consistent, complete, and compliant with the needs of the business.
  • Provide capabilities for matching and de-duplication, as well as cleansing and standardization for the master data they contain.
  • Include interfaces and tools to integrate in with related systems in multiple ways, to help ensure that the master data is used (and used appropriately) throughout the enterprise.
  • (yawn)
    And all the other things they do, I guess.[7]

Power BI dataflows do not do these things.

While CDS has many of these characteristics, dataflows fit in here primarily in the context of integration. Dataflows can consume data from CDS and other data sources to make them available for analysis, but their design does not provide any capabilities for the curation of source data, or for transaction processing in general.

Hopefully it is now obvious that Power BI dataflows are not an MDM tool. Dataflows do provide complementary capabilities for self-service data preparation and reuse, and this can include data that comes from MDM systems. But are dataflows themselves for MDM? No, they are not.


[1] At the time, they weren’t called dataflows. Originally they were called the Common Data Service for Analytics, which may well have been part of the problem.

[2] There were many variations on how the question was phrased – this is perhaps the simplest and most common version.

[3] “No.”

[4] Other than by saying “no.”

[5] Taken directly from the documentation.

[6] Please understand that the Common Data Service for Apps is much more than just this. I’m keeping the scope deliberately narrow because this post isn’t actually about CDS.

[7] MDM is a pretty complex topic, and it’s not my intent to go into too much depth. If you’re really interested, you probably want to seek out a more focused source of information. MDM Geek may be a good place to start.