Positioning Power BI Dataflows (Part 2)

I didn’t plan on writing a sequel to my Positioning Power BI Dataflows post, but a few comments I’ve seen recently have made me think that one might be useful. I also didn’t plan on this article ending up quite as long as it has, but this is the direction in which it ended up needing to go.

One was a comment on my October post related to CDM folders, that was part of a discussion[1] about whether it makes sense to have data warehouses now that we have dataflows. I’d finished replying by saying “If your scenario includes the ability to add a new dimension to a data warehouse, or to add new attributes to existing dimensions, that’s probably a good direction to choose.” Darryll respectfully disagreed.

2018-12-01_13-43-32

The point in Darryll’s comment that stuck with me was related to data warehouses becoming an anti-pattern, a “common response to a recurring problem that is usually ineffective and risks being highly counterproductive.” Darryll and I will probably have to agree to disagree.

Update: Darryl was kind enough to comment on this post, so please scroll down for additional context. The rest of this post remains unedited.

Big data platforms like Azure Data Lake Storage gen2 are enabling “modern data warehouse” scenarios that were not previously possible, and they’re making them more and more accessible. I don’t think there’s any argument on that front. But just because there is a new cool hammer in the toolbox, this doesn’t mean that every problem needs to be a big data nail.[2] The need for “traditional” Kimball-style data warehouse hasn’t gone away, and in my opinion isn’t likely to go away any time soon.

The other comment that prompted this post was a from Nimrod on Twitter, in response to my recent blog post about using dataflows as a way to handle slow data sources in a self-service solution when you don’t have a data warehouse.

2018-12-01_13-41-00

Before I proceed I should mention that the next few paragraphs are also informed by Nimrod’s excellent essay “The Self-Service BI Hoax“, which you are strongly encouraged to read. It’s not my goal to respond to this essay in general or in specific terms, but it provides significant context about the likely thinking behind the tweet pictured above.

I’m not sure where Nimrod was going with his “local” comment, since dataflows are built and executed and managed in the Power BI cloud service, but the rest of the post is worth considering carefully, both in the context of positioning and in the context of usage.

I’ve said this many times before, and I suspect I’ll say it many times again: dataflows are not a replacement for data warehouses. I said this in the first paragraph of the post to which Nimrod was responding, and in that post the phrase was a hyperlink back to my initial post on positioning. There will be people who claim that you don’t need a data warehouse if you have dataflows – this is false. This is as false as saying that you don’t need a curated and managed set of data models because you have a self-service BI tool.

Experience has shown time and time again that self-service BI succeeds at scale[3] when it is part of an organized and professional approach to data and analytics. Without structure and management, self-service BI is too often part of the problem, rather than part of the solution. To borrow from Nimrod’s essay, “With some governance, and with a realistic understanding of what the technology can do, the data anarchy can become a data democracy.” The converse, also holds true – without that governance, anarchy is likely, and its likelihood increases as the scope of the solution increases.

Despite this, I believe that Power BI dataflows have a better chance to be part of the solution because of how they’re implemented. This is why:

  1. Dataflows are defined and managed by the Power BI service. This means that they can be discovered and understood by Power BI administrators using the Power BI admin API and the dataflows API as well. Although the management experience is not yet complete while dataflows are in preview, the fact that dataflows are defined and executed in the context of a single cloud service means that they are inherently more manageable and less anarchic than other self-service alternatives.
  2. Dataflows are self-contained and self-describing in terms of the ETL logic they implement and their data lineage. Each dataflow entity is defined by a Power Query “M” query, and the data in the entity can only result from the execution of that query. This is fundamentally different from tools like Excel, where the logic that defines a dataset is difficult to parse and understand[4], and which would need to be reverse engineered and re-implemented by a developer in order to be included in a central data model. It is also fundamentally different from other self-service data preparation technologies that load data into unmanaged locations where they can be further manipulated with file system or database CRUD operations.
  3. Dataflows lend themselves to process-driven collaboration between business and IT. With a Power BI dataflow entity, an administrator can take the query that defines the entity and reuse it in another context that supports “M” queries such as a tabular model. They can also also be operationalized as-is; any dataflow or entity created by a business user can be added to the central IT-managed data lake. The technology behind dataflows lends itself better to the types of processes that successful BI centers of excellence put in place than do many other data preparation technologies.
  4. Business users are going to prepare and use the data they need regardless of the tools that are made available to them. In an ideal world, every data need that a business user has would be fulfilled by a central IT team in a timely and predictable manner. Sadly, we do not live in this world. In most situations it’s not a matter of choosing dataflows over a professionally-designed data warehouse. It’s a matter of choosing dataflows over an Excel workbook or other self-service solution.

This final point makes me think of one[5] of the key tenants of the Kimball Method:

It’s all about the business.
I say this many times during classes and consulting. It’s the most important characteristic of the Kimball Lifecycle method: practical, uncompromising focus on the business. It infuses everything we do, and it’s the single most important message to carry forward.

A mature IT organization will help the business it supports achieve its goals in the best way it can, where “best” is situational and dependent on the many complex factors that need to be balanced in each individual context. When done properly, BI has always been about the business and not about the technology – the technology is merely the means to the end of helping the business make better decisions with better information.

And in this context, dataflows can be part of the solution, or they can be part of the problem. Like other self-service technologies, dataflows present capabilities that can be misused, and which can introduce inconsistencies and duplication across an organization’s data estate, but their design helps mitigate the entropy that self-service approaches introduce into the system. When used as part of a managed approach to governed self-service, dataflows can help remove ad hoc ETL processes, or move them into a context where IT oversight and governance is easier.

Of course, this is a very optimistic conclusion for me to reach. What I’m describing above is what organizations can do if they use dataflows in a well thought out way. It’s not something that can be taken for granted. You need to work for it. And that’s probably the most important thing to keep in mind when evaluating dataflows or any self-service tool: no tool is a silver bullet.

In my mind[6] both of the comments that inspired this post have at their root an important point in the context of positioning Power BI dataflows: you need to choose the correct tool and implement it in the correct manner in order to be successful, and you need to evaluate tools against your requirements based on their capabilities, rather than based on any sales or marketing pitches.

The next time you see someone pitching dataflows as a silver bullet, please point them here. But at the same time, when you see organizations implementing dataflows as part of a managed and governed self-service BI… I’d like to hear about that too.


[1] I won’t repeat everything here, but you can go read the comments on the post yourself if you want to have the full context.

[2] I hope that translates well. In case it doesn’t, here’s a link: https://en.wikipedia.org/wiki/Law_of_the_instrument

[3] I include this qualification because SSBI can indeed be successful for individuals and teams without IT oversight and involvement.

[4] If you’ve ever had a business user or consulting client give you an Excel workbook with a dozen macros and/or hundreds of VLOOKUPs, you’ll know what I mean here.

[5] I recognize that I’m cherry-picking here, but I think this is an important point to make. The Kimball Group web site has 180 design tips, and they’re all worth reading.

[6] I emphasize here that this is my opinion, because I have asked neither Nimrod nor Darryll if this is what they actually meant, and I definitely do not want to falsely portray someone else’s intent. They can correct me as needed.

3 thoughts on “Positioning Power BI Dataflows (Part 2)

  1. Pingback: Dataflows in Power BI – BI Polar

  2. Darryll Petrancuri

    Matthew:

    I want to clarify my position regarding data warehouses as an anti-pattern for many.

    I believe the traditional data warehouse stored in a star-schema (Kimball) in a relational database such as SQL Server may be an anti-pattern for many given latency, schema volatility, etc. I say this because through the use of Power BI / Analysis Services, a star-schema tabular model may be able to be implemented directly in Power BI using dataflows, common data model, etc., if the size of the dataset is viable and manageable within such. Certainly, there are technologies available, such as schema-on-read through Polybase (HDFS, etc) connectors available within SQL Server, Azure SQL DB / DW, that help to reduce the impact of such on data ingestion and availability. But in and of themselves they do not rationalize the use of a relational data warehouse.

    In summary, I am simply suggesting that technologies such as data flows, especially as they mature, offer compelling choices for architectures which may move away from the venerable norm.

    Respectfully,

    Darry;;

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s