Quick Tip: Creating “data workspaces” for dataflows and shared datasets

Power BI is constantly evolving – there’s a new version of Power BI Desktop every month, and the Power BI service is updated every week. Many of the new capabilities in Power BI represent gradual refinements, but some are significant enough to make you rethink how you your organization uses Power BI.

Power BI dataflows and the new shared and certified datasets[1] fall into the latter category. Both of these capabilities enable sharing data across workspace boundaries. When building a data model in Power BI Desktop you can connect to entities from dataflows in multiple workspaces, and publish the dataset you create into a different workspace altogether. With shared datasets you can create reports and dashboards in one workspace using a dataset in another[2].

The ability to have a single data resource – dataflow or dataset – shared across workspaces is a significant change in how the Power BI service has traditionally worked. Before these new capabilities, each workspace was largely self-contained. Dashboards could only get data from a dataset in the same workspace, and the tables in the dataset each contained the queries that extracted, transformed, and loaded their data. This workspace-centric design encouraged[3] approaches where assets were grouped into workspaces because of the platform, and not because it was the best way to meet the business requirements.

Now that we’re no longer bound by these constraints, it’s time to start thinking about having workspaces in Power BI whose function is to contain data artifacts (dataflows and/or datasets) that are used by visualization artifacts (dashboards and reports) in other workspaces. It’s time to start thinking about approaches that may look something like this:

data centric workspaces

Please keep in mind these two things when looking at the diagram:

  1. This is an arbitrary collection of boxes and arrows that illustrate a concept, and not a reference architecture.
  2. I do not have any formal art training.

Partitioning workspaces in this way encourages reuse and can reduce redundancy. It can also help enable greater separation of duties during development and maintenance of Power BI solutions. If you have one team that is responsible for making data available, and another team that is responsible for visualizing and presenting that data to solve business problems[4], this approach can given each team a natural space for its work. Work space. Workspace. Yeah.

Many of the large enterprise customers I work with are already evaluating or adopting this approach. Like any big change it’s safer to approach this effort incrementally. The customers I’ve spoken to are planning to apply this pattern to new solutions before they think about retrofitting any existing solutions.

Once you’ve had a chance to see how these new capabilities can change how your teams work with Power BI, I’d love to hear what you think.

Edit 2019-06-26: Adam Saxton from Guy In A Cube has published a video on Shared and Certified datasets. If you want another perspective on how this works, you should watch it.


[1] Currently in preview: blog | docs.

[2] If you’re wondering how these capabilities for data reuse relate to each other, you may want to check out this older post, as the post you’re currently reading won’t go into this topic: Lego Bricks and the Spectrum of Data Enrichment and Reuse.

[3] And in some cases, required.

[4] If you don’t, you probably want to think about it. This isn’t the only pattern for successful adoption of Power BI at scale, but it is a very common and tested pattern.

Self-Service BI: Asleep at the wheel?

I’ve long been a fan of the tech new site Ars Technica. They have consistently good writing, and they cover interesting topics that sit at the intersection of technology and life, including art, politics[1], and more.

When Ars published this article earlier this week, it caught my eye – but not necessarily for the reason you might think.

sleeping tesla

This story immediately got me thinking about how falling asleep at the wheel is a surprisingly good analogy[2] for self-service BI, and for shadow data in general. The parallels are highlighted in the screen shot above.

  1. Initial reaction: People are using a specific tool in a way we do not want them to use it, and this is definitely not ideal.
  2. Upon deeper inspection: People are already using many tools in this bad way, and were it not for the capabilities of this particular tool the consequences would likely be much worse.

If you’re falling asleep at the wheel, it’s good to have a car that will prevent you from injuring or killing yourself or others. It’s best to simply not fall asleep at the wheel at all, but that has been sadly shown to be an unattainable goal.

If you’re building a business intelligence solution without involvement from your central analytics or data team, it’s good to have a tool[3] that will help prevent you from misusing organizational data assets and harming your business. It’s best to simply not “go rogue” and build data without the awareness of your central team at all, but that has been sadly shown to be an unattainable goal.

Although this analogy probably doesn’t hold up to close inspection as well as the two-edge sword analogy, it’s still worth emphasizing. I talk with a lot of enterprise Power BI customers, and I’ve had many conversations where someone from IT talks about their desire to “lock down” some key self-service feature or set of features, not fully realizing the unintended consequences that this approach might have.

I don’t want to suggest that this is inherently bad – administrative controls are necessary, and each organization needs to choose the balance that works best for their goals, priorities, and resources. But turning off self-service features can be like turning off Autopilot in a Tesla. Keeping users from using a feature is not going to prevent them from achieving the goal that the feature enables. Instead, it will drive[4] users into using other features and other tools, often with even more damaging consequences.

Here’s a key quote from that Ars Technica article:

We should be crystal clear about one point here: the problem of drivers falling asleep isn’t limited to Tesla vehicles. To the contrary, government statistics show that drowsy driving leads to hundreds—perhaps even thousands—of deaths every year. Indeed, this kind of thing is so common that it isn’t considered national news—which is why most of us seldom hear about these incidents.

In an ideal world, everyone will always be awake and alert when driving, but that isn’t the world we live in. In an ideal world, every organization will have all of the data professionals necessary to engage with every business user in need. We don’t live in that world either.

There’s always room for improvement. Tools like Power BI[5] are getting better with each release. Organizations keep maturing and building more successful data cultures to use those tools. But until we live in an ideal world, we each need to understand the direct and indirect consequences of our choices…


[1] For example, any time I see stories in the non-technical press related to hacking or electronic voting, I visit Ars Technica for a deeper and more informed perspective. Like this one.

[2] Please let me explicitly state that I am in no way minimizing or downplaying the risks of distracted, intoxicated, or impaired driving. I have zero tolerance for these behaviors, and recognize the very real dangers they present. But I also couldn’t let this keep me from sharing the analogy…

[3] As well as the processes and culture that enable the tool to be used to greatest effect, as covered in a recent post: Is self-service business intelligence a two-edged sword?

[4] Pun not intended, believe it or not.

[5] As a member of the Power BI CAT team I would obviously be delighted if everyone used Power BI, but we also don’t live in that world. No matter what self-service BI tool you’ve chosen, these lessons will still apply – only the details will differ.

Is self-service business intelligence a two-edged sword?

I post about Power BI dataflows a lot, but that’s mainly because I love them. My background in data preparation and ETL, combined with dataflows’ general awesomeness makes them a natural fit for my blog. This means that people often think of me as “the dataflows guy” even though dataflows are actually a small part of my role on the Power BI CAT team. Most of what I do at work is help large enterprise customers successfully adopt Power BI, and to help make Power BI a better tool for their scenarios[1].

As part of my ongoing conversations with senior stakeholders from these large global companies, I’ve noticed an interesting trend emerging: customers describing self-service BI as a two-edged sword. This trend is interesting for two main reasons:

  1. It’s a work conversation involving swords
  2. Someone other than me is bringing swords into the work conversation[2]

As someone who has extensive experience with both self-service BI and with two-edged swords, I found myself thinking about these comments more and more – and the more I reflected, the more I believed this simile holds up, but not necessarily in the way you might suspect.

This week in London I delivered a new presentation for the London Power BI User Group – Lessons from the Enterprise: Managed Self-Service BI at Global Scale. In this hour-long presentation I explored the relationship between self-service BI and two-edged swords, and encouraged my audience to consider the following points[4]:

  • The two sharp edges of a sword each serve distinct and complementary purposes.
  • A competent swordsperson knows how and when to use each, and how to use them effectively in combination.
  • Having two sharp edges is only dangerous to the wielder if they are ignorant of their tool.
  • A BI tool like Power BI, which can be used for both “pro” IT-driven BI and self-service business-driven BI has the same characteristics, and to use it successfully at scale an organization needs to understand its capabilities and know how to use both “edges” effectively in combination.

As you can imagine, there’s more to it than this, so you should probably watch the session recording.

ssbi and swords

If you’re interested in the slides, please download them here: London PUG – 2019-06-03 – Lessons from the Enterprise.

If you interested in the videos shown during the presentation, they’re included in the PowerPoint slides, and you can view them on YouTube here:

For those who are coming to the Microsoft Business Applications Summit next week, please consider joining the CAT team’s “Enterprise business intelligence with Power BI” full-day pre-conference session on Sunday. Much of the day will be deep technical content, but we’ll be wrapping up with a revised and refined version of this content, with a focus on building a center of excellence and a culture of data in your organization.

Update 2019-06-10: The slides from the MBAS pre-conference session can be found here: PRE08 – Enterprise business intelligence with Power BI – Building a CoE.

There is also a video of the final demo where Adam Saxton joined me to illustrate how business and IT can work together to effectively respond to unexpected challenges. If you ever wondered what trust looks like in a professional[5] environment, you definitely want to watch this video.

 


[1] This may be even more exciting for me than Power BI dataflows are, but it’s not as obvious how to share this in blog-sized pieces.

[2] Without this second point, it probably wouldn’t be noteworthy. I have a tendency to bring up swords more often in work conversations than you might expect[3].

[3] And if you’ve been paying attention for very long, you’ll probably expect this to come up pretty often.

[4] Pun intended. Obviously.

[5] For a given value of “professional.”

Positioning Power BI Dataflows (Part 2)

I didn’t plan on writing a sequel to my Positioning Power BI Dataflows post, but a few comments I’ve seen recently have made me think that one might be useful. I also didn’t plan on this article ending up quite as long as it has, but this is the direction in which it ended up needing to go.

One was a comment on my October post related to CDM folders, that was part of a discussion[1] about whether it makes sense to have data warehouses now that we have dataflows. I’d finished replying by saying “If your scenario includes the ability to add a new dimension to a data warehouse, or to add new attributes to existing dimensions, that’s probably a good direction to choose.” Darryll respectfully disagreed.

2018-12-01_13-43-32

The point in Darryll’s comment that stuck with me was related to data warehouses becoming an anti-pattern, a “common response to a recurring problem that is usually ineffective and risks being highly counterproductive.” Darryll and I will probably have to agree to disagree.

Update: Darryl was kind enough to comment on this post, so please scroll down for additional context. The rest of this post remains unedited.

Big data platforms like Azure Data Lake Storage gen2 are enabling “modern data warehouse” scenarios that were not previously possible, and they’re making them more and more accessible. I don’t think there’s any argument on that front. But just because there is a new cool hammer in the toolbox, this doesn’t mean that every problem needs to be a big data nail.[2] The need for “traditional” Kimball-style data warehouse hasn’t gone away, and in my opinion isn’t likely to go away any time soon.

The other comment that prompted this post was a from Nimrod on Twitter, in response to my recent blog post about using dataflows as a way to handle slow data sources in a self-service solution when you don’t have a data warehouse.

2018-12-01_13-41-00

Before I proceed I should mention that the next few paragraphs are also informed by Nimrod’s excellent essay “The Self-Service BI Hoax“, which you are strongly encouraged to read. It’s not my goal to respond to this essay in general or in specific terms, but it provides significant context about the likely thinking behind the tweet pictured above.

I’m not sure where Nimrod was going with his “local” comment, since dataflows are built and executed and managed in the Power BI cloud service, but the rest of the post is worth considering carefully, both in the context of positioning and in the context of usage.

I’ve said this many times before, and I suspect I’ll say it many times again: dataflows are not a replacement for data warehouses. I said this in the first paragraph of the post to which Nimrod was responding, and in that post the phrase was a hyperlink back to my initial post on positioning. There will be people who claim that you don’t need a data warehouse if you have dataflows – this is false. This is as false as saying that you don’t need a curated and managed set of data models because you have a self-service BI tool.

Experience has shown time and time again that self-service BI succeeds at scale[3] when it is part of an organized and professional approach to data and analytics. Without structure and management, self-service BI is too often part of the problem, rather than part of the solution. To borrow from Nimrod’s essay, “With some governance, and with a realistic understanding of what the technology can do, the data anarchy can become a data democracy.” The converse, also holds true – without that governance, anarchy is likely, and its likelihood increases as the scope of the solution increases.

Despite this, I believe that Power BI dataflows have a better chance to be part of the solution because of how they’re implemented. This is why:

  1. Dataflows are defined and managed by the Power BI service. This means that they can be discovered and understood by Power BI administrators using the Power BI admin API and the dataflows API as well. Although the management experience is not yet complete while dataflows are in preview, the fact that dataflows are defined and executed in the context of a single cloud service means that they are inherently more manageable and less anarchic than other self-service alternatives.
  2. Dataflows are self-contained and self-describing in terms of the ETL logic they implement and their data lineage. Each dataflow entity is defined by a Power Query “M” query, and the data in the entity can only result from the execution of that query. This is fundamentally different from tools like Excel, where the logic that defines a dataset is difficult to parse and understand[4], and which would need to be reverse engineered and re-implemented by a developer in order to be included in a central data model. It is also fundamentally different from other self-service data preparation technologies that load data into unmanaged locations where they can be further manipulated with file system or database CRUD operations.
  3. Dataflows lend themselves to process-driven collaboration between business and IT. With a Power BI dataflow entity, an administrator can take the query that defines the entity and reuse it in another context that supports “M” queries such as a tabular model. They can also also be operationalized as-is; any dataflow or entity created by a business user can be added to the central IT-managed data lake. The technology behind dataflows lends itself better to the types of processes that successful BI centers of excellence put in place than do many other data preparation technologies.
  4. Business users are going to prepare and use the data they need regardless of the tools that are made available to them. In an ideal world, every data need that a business user has would be fulfilled by a central IT team in a timely and predictable manner. Sadly, we do not live in this world. In most situations it’s not a matter of choosing dataflows over a professionally-designed data warehouse. It’s a matter of choosing dataflows over an Excel workbook or other self-service solution.

This final point makes me think of one[5] of the key tenants of the Kimball Method:

It’s all about the business.
I say this many times during classes and consulting. It’s the most important characteristic of the Kimball Lifecycle method: practical, uncompromising focus on the business. It infuses everything we do, and it’s the single most important message to carry forward.

A mature IT organization will help the business it supports achieve its goals in the best way it can, where “best” is situational and dependent on the many complex factors that need to be balanced in each individual context. When done properly, BI has always been about the business and not about the technology – the technology is merely the means to the end of helping the business make better decisions with better information.

And in this context, dataflows can be part of the solution, or they can be part of the problem. Like other self-service technologies, dataflows present capabilities that can be misused, and which can introduce inconsistencies and duplication across an organization’s data estate, but their design helps mitigate the entropy that self-service approaches introduce into the system. When used as part of a managed approach to governed self-service, dataflows can help remove ad hoc ETL processes, or move them into a context where IT oversight and governance is easier.

Of course, this is a very optimistic conclusion for me to reach. What I’m describing above is what organizations can do if they use dataflows in a well thought out way. It’s not something that can be taken for granted. You need to work for it. And that’s probably the most important thing to keep in mind when evaluating dataflows or any self-service tool: no tool is a silver bullet.

In my mind[6] both of the comments that inspired this post have at their root an important point in the context of positioning Power BI dataflows: you need to choose the correct tool and implement it in the correct manner in order to be successful, and you need to evaluate tools against your requirements based on their capabilities, rather than based on any sales or marketing pitches.

The next time you see someone pitching dataflows as a silver bullet, please point them here. But at the same time, when you see organizations implementing dataflows as part of a managed and governed self-service BI… I’d like to hear about that too.


[1] I won’t repeat everything here, but you can go read the comments on the post yourself if you want to have the full context.

[2] I hope that translates well. In case it doesn’t, here’s a link: https://en.wikipedia.org/wiki/Law_of_the_instrument

[3] I include this qualification because SSBI can indeed be successful for individuals and teams without IT oversight and involvement.

[4] If you’ve ever had a business user or consulting client give you an Excel workbook with a dozen macros and/or hundreds of VLOOKUPs, you’ll know what I mean here.

[5] I recognize that I’m cherry-picking here, but I think this is an important point to make. The Kimball Group web site has 180 design tips, and they’re all worth reading.

[6] I emphasize here that this is my opinion, because I have asked neither Nimrod nor Darryll if this is what they actually meant, and I definitely do not want to falsely portray someone else’s intent. They can correct me as needed.

Are Power BI Dataflows a Master Data Management Tool?

Are Power BI dataflows a master data management tool?

This guy really wants to know.

MDM
Image from https://www.pexels.com/photo/close-up-photography-of-a-man-holding-ppen-1076801/

Spoiler alert: No. They are not.

When Microsoft first announced dataflows[1] were coming to Power BI earlier this year, I started hearing a surprising question[2]:

Are dataflows for Master Data Management in the cloud?

The first few times I heard the question, it felt like an anomaly, a non sequitur. The answer[3] seemed so obvious to me that I wasn’t sure how respond.[4]

But after I’d heard this more frequently, I started asking questions in return, trying to understand what was motivating the question. A common theme emerged: people seemed to be confusing the Common Data Service for Apps used by PowerApps, Microsoft Flow, and Dynamics 365, with dataflows – which were initially called the Common Data Service for Analytics.

The Common Data Service for Apps (CDS) is a cloud-based data service that provides secure data storage and management capabilities for business data entities. Perhaps most specifically for the context of this article, CDS provides a common storage location, which “enables you to build apps using PowerApps and the Common Data Service for Apps directly against your core business data already used within Dynamics 365 without the need for integration.”[5] CDS provides a common location for storing data that can be used by multiple applications and processes, and also defines and applies business logic and rules that are applied to any application or user manipulating data stored in CDS entities.[6]

And that is starting to sound more like master data management.

When I think about Master Data Management (MDM) systems, I think of systems that:

  • Serve as a central repository for critical organizational data, to provide a single source of truth for transactional and analytical purposes.
  • Provide mechanisms to define and enforce data validation rules to ensure that the master data is consistent, complete, and compliant with the needs of the business.
  • Provide capabilities for matching and de-duplication, as well as cleansing and standardization for the master data they contain.
  • Include interfaces and tools to integrate in with related systems in multiple ways, to help ensure that the master data is used (and used appropriately) throughout the enterprise.
  • (yawn)
    And all the other things they do, I guess.[7]

Power BI dataflows do not do these things.

While CDS has many of these characteristics, dataflows fit in here primarily in the context of integration. Dataflows can consume data from CDS and other data sources to make them available for analysis, but their design does not provide any capabilities for the curation of source data, or for transaction processing in general.

Hopefully it is now obvious that Power BI dataflows are not an MDM tool. Dataflows do provide complementary capabilities for self-service data preparation and reuse, and this can include data that comes from MDM systems. But are dataflows themselves for MDM? No, they are not.


[1] At the time, they weren’t called dataflows. Originally they were called the Common Data Service for Analytics, which may well have been part of the problem.

[2] There were many variations on how the question was phrased – this is perhaps the simplest and most common version.

[3] “No.”

[4] Other than by saying “no.”

[5] Taken directly from the documentation.

[6] Please understand that the Common Data Service for Apps is much more than just this. I’m keeping the scope deliberately narrow because this post isn’t actually about CDS.

[7] MDM is a pretty complex topic, and it’s not my intent to go into too much depth. If you’re really interested, you probably want to seek out a more focused source of information. MDM Geek may be a good place to start.