Are you building a BI house of cards?

Every few weeks I see someone asking about using Analysis Services as a data source for Power BI dataflows. Every time I hear this, I cringe, and then include advice something like this[1] in my response.

Using Analysis Services as a data source is an anti-pattern – a worst practice. It is not recommended, and any solution built using this pattern is likely to produce dissatisfied customers. Please strongly consider using other data sources, likely the data sources on which the AS model is built.


There are multiple reasons for this advice.


Some reasons are technical. Extraction of large volumes of data is not what an Analysis Services model is designed for. Performance for the ETL process is likely to be poor, and you’re likely end up with memory/caching issues on the Analysis Services server. Beyond this, AS models typically don’t include the IDs/surrogate keys that you need for data warehousing, so joining the AS data to other data sources will be problematic.[2]


For some specific examples and technical deep dives into how and why this is a bad idea, check out this excellent blog post from Shabnam Watson. The focus of the post is on SSAS memory settings, but it’s very applicable to the current discussion.


Some reasons for this advice are less technical, but no less important. Using analytics models as data sources for ETL processing are a strong code smell[3] (“any characteristic in the source code of a program that possibly indicates a deeper problem”) for business intelligence solutions.


Let’s look at a simple and familiar diagram:


01 good


There’s a reason this left-to-right flow is the standard representation of BI applications: it’s what works. Each component has specific roles and responsibilities that complement each other, and which are aligned with the technology used to implement the component. This diagram includes a set of logical “tiers” or “layers” that are common in analytics systems, and which mutually support each other to achieve the systems’ goals.
Although there are many successful variations on this theme, they all tend to have this general flow and these general layers. Consider this one, for example:


02 ok

This example has more complexity, but also has the same end-to-end flow as the simple one. This is pretty typical for  scenarios where a single data warehouse and analytics model won’t fulfill all requirements, so the individual data warehouses, data marts, and analytics models each contain a portion – often an overlapping portion – of the analytics data.

Let’s look at one more:

03 - trending badly

This design is starting to smell. The increased complexity and blurring of responsibilities will produce difficulties in data freshness and maintenance. The additional dependencies, and the redundant and overlapping nature of the dependencies means that any future changes will require additional investigation and care to ensure that there are no unintended side effects to the existing functionality.

As an aside, my decades of working in data and analytics suggest that this care will rarely actually be taken. Instead, this architecture will be fragile and prone to problems, and the teams that built it will not be the teams who solve those problems.

And then we have this one[4]:

04 - hard no

This is what you get when you use Analysis Services as the data source for ETL processing, whether that ETL and downstream storage is implemented in Power BI dataflows or different technologies. And this is probably the best case you’re likely to get when you go down this path. Even with just two data warehouses and two analytics models in the diagram, the complex and unnatural dependencies are obvious, and are painful to consider.

What would be better here?[5] As mentioned at the top of the post, the logical alternative is to avoid using the analytics model and to instead use the same sources that the analytics model already uses. This may require some refactoring to ensure that the duplication of logic is minimized. It may require some political or cross-team effort to get buy-in from the owners of the upstream systems. It may not be simple, or easy. But it is almost always the right thing to do.

Don’t take shortcuts to save days or weeks today that will cause you or your successors months or years to undo and repair. Don’t build a house of cards, because with each new card you add, the house is more and more likely to fall.


[1] Something a lot like this. I copied this from a response I sent a few days ago.

[2] Many thanks to Chris Webb for some of the information I’ve paraphrased here. If you want to hear more from Chris on this subject, check out this session recording from PASS Summit 2017. The whole session is excellent; the information most relevant to this subject begins around the 26 minute mark in the recording. Chris also gets credit for pointing me to Shabnam Watson’s blog.

[3] I learned about code smells last year when I attended a session by Felienne Hermans at Craft Conference in Budapest. You can watch the session here. And you really should, because it’s really good.

[4] My eyes are itching just looking at it. It took an effort of will to create this diagram, much less share it.

[5] Yes, just about anything would be better.

Quick Tip: Creating “data workspaces” for dataflows and shared datasets

Power BI is constantly evolving – there’s a new version of Power BI Desktop every month, and the Power BI service is updated every week. Many of the new capabilities in Power BI represent gradual refinements, but some are significant enough to make you rethink how you your organization uses Power BI.

Power BI dataflows and the new shared and certified datasets[1] fall into the latter category. Both of these capabilities enable sharing data across workspace boundaries. When building a data model in Power BI Desktop you can connect to entities from dataflows in multiple workspaces, and publish the dataset you create into a different workspace altogether. With shared datasets you can create reports and dashboards in one workspace using a dataset in another[2].

The ability to have a single data resource – dataflow or dataset – shared across workspaces is a significant change in how the Power BI service has traditionally worked. Before these new capabilities, each workspace was largely self-contained. Dashboards could only get data from a dataset in the same workspace, and the tables in the dataset each contained the queries that extracted, transformed, and loaded their data. This workspace-centric design encouraged[3] approaches where assets were grouped into workspaces because of the platform, and not because it was the best way to meet the business requirements.

Now that we’re no longer bound by these constraints, it’s time to start thinking about having workspaces in Power BI whose function is to contain data artifacts (dataflows and/or datasets) that are used by visualization artifacts (dashboards and reports) in other workspaces. It’s time to start thinking about approaches that may look something like this:

data centric workspaces

Please keep in mind these two things when looking at the diagram:

  1. This is an arbitrary collection of boxes and arrows that illustrate a concept, and not a reference architecture.
  2. I do not have any formal art training.

Partitioning workspaces in this way encourages reuse and can reduce redundancy. It can also help enable greater separation of duties during development and maintenance of Power BI solutions. If you have one team that is responsible for making data available, and another team that is responsible for visualizing and presenting that data to solve business problems[4], this approach can given each team a natural space for its work. Work space. Workspace. Yeah.

Many of the large enterprise customers I work with are already evaluating or adopting this approach. Like any big change it’s safer to approach this effort incrementally. The customers I’ve spoken to are planning to apply this pattern to new solutions before they think about retrofitting any existing solutions.

Once you’ve had a chance to see how these new capabilities can change how your teams work with Power BI, I’d love to hear what you think.

Edit 2019-06-26: Adam Saxton from Guy In A Cube has published a video on Shared and Certified datasets. If you want another perspective on how this works, you should watch it.


[1] Currently in preview: blog | docs.

[2] If you’re wondering how these capabilities for data reuse relate to each other, you may want to check out this older post, as the post you’re currently reading won’t go into this topic: Lego Bricks and the Spectrum of Data Enrichment and Reuse.

[3] And in some cases, required.

[4] If you don’t, you probably want to think about it. This isn’t the only pattern for successful adoption of Power BI at scale, but it is a very common and tested pattern.

Self-Service BI: Asleep at the wheel?

I’ve long been a fan of the tech new site Ars Technica. They have consistently good writing, and they cover interesting topics that sit at the intersection of technology and life, including art, politics[1], and more.

When Ars published this article earlier this week, it caught my eye – but not necessarily for the reason you might think.

sleeping tesla

This story immediately got me thinking about how falling asleep at the wheel is a surprisingly good analogy[2] for self-service BI, and for shadow data in general. The parallels are highlighted in the screen shot above.

  1. Initial reaction: People are using a specific tool in a way we do not want them to use it, and this is definitely not ideal.
  2. Upon deeper inspection: People are already using many tools in this bad way, and were it not for the capabilities of this particular tool the consequences would likely be much worse.

If you’re falling asleep at the wheel, it’s good to have a car that will prevent you from injuring or killing yourself or others. It’s best to simply not fall asleep at the wheel at all, but that has been sadly shown to be an unattainable goal.

If you’re building a business intelligence solution without involvement from your central analytics or data team, it’s good to have a tool[3] that will help prevent you from misusing organizational data assets and harming your business. It’s best to simply not “go rogue” and build data without the awareness of your central team at all, but that has been sadly shown to be an unattainable goal.

Although this analogy probably doesn’t hold up to close inspection as well as the two-edge sword analogy, it’s still worth emphasizing. I talk with a lot of enterprise Power BI customers, and I’ve had many conversations where someone from IT talks about their desire to “lock down” some key self-service feature or set of features, not fully realizing the unintended consequences that this approach might have.

I don’t want to suggest that this is inherently bad – administrative controls are necessary, and each organization needs to choose the balance that works best for their goals, priorities, and resources. But turning off self-service features can be like turning off Autopilot in a Tesla. Keeping users from using a feature is not going to prevent them from achieving the goal that the feature enables. Instead, it will drive[4] users into using other features and other tools, often with even more damaging consequences.

Here’s a key quote from that Ars Technica article:

We should be crystal clear about one point here: the problem of drivers falling asleep isn’t limited to Tesla vehicles. To the contrary, government statistics show that drowsy driving leads to hundreds—perhaps even thousands—of deaths every year. Indeed, this kind of thing is so common that it isn’t considered national news—which is why most of us seldom hear about these incidents.

In an ideal world, everyone will always be awake and alert when driving, but that isn’t the world we live in. In an ideal world, every organization will have all of the data professionals necessary to engage with every business user in need. We don’t live in that world either.

There’s always room for improvement. Tools like Power BI[5] are getting better with each release. Organizations keep maturing and building more successful data cultures to use those tools. But until we live in an ideal world, we each need to understand the direct and indirect consequences of our choices…


[1] For example, any time I see stories in the non-technical press related to hacking or electronic voting, I visit Ars Technica for a deeper and more informed perspective. Like this one.

[2] Please let me explicitly state that I am in no way minimizing or downplaying the risks of distracted, intoxicated, or impaired driving. I have zero tolerance for these behaviors, and recognize the very real dangers they present. But I also couldn’t let this keep me from sharing the analogy…

[3] As well as the processes and culture that enable the tool to be used to greatest effect, as covered in a recent post: Is self-service business intelligence a two-edged sword?

[4] Pun not intended, believe it or not.

[5] As a member of the Power BI CAT team I would obviously be delighted if everyone used Power BI, but we also don’t live in that world. No matter what self-service BI tool you’ve chosen, these lessons will still apply – only the details will differ.

Customer Stories sessions from MBAS 2019

In addition to many excellent technical sessions, last week’s Microsoft Business Applications Summit (MBAS) event also included a series of “Customer Stories” sessions that may be even more valuable.

My recent “Is self-service business intelligence a two-edged sword?” post has gotten more buzz than any of my recent technical posts. Some of this might be due to the awesome use of swords[1] and the inclusion of a presentation recording and slides, but some is also largely due to how it presents guidance for successfully implementing managed self-service BI at the scale needed by a global enterprise.

Well, if you liked that post, you’re going to love these session recordings from MBAS. These “Customer Stories” sessions are hosted by the Power BI CAT team’s leader Marc Reguera, and are presented by key technical and business stakeholders from Power BI customers around the world. Unlike my presentation that focused on general patterns and success factors, these presentations each tell a real-world story about a specific enterprise-scale Power BI implementation.

Why should you care about these customer stories? I think Sun Tzu summed it up best[2]:

Strategy without tactics is the slowest route to victory. Tactics without strategy is the noise before defeat.

Understanding how to use technology and features is a tactical necessity, but unless you have a strategic plan for using them, it’s very likely you won’t succeed in the long term. And just as any military leader will study past battles, today’s technical leaders can get great value from those who have gone before them.

If you’re part of your organization’s Power BI team, add these sessions to your schedule. You’ll thank me later.


[1] Just humor me here, please.

[2] I was also considering going with the Julius Caesar’s famous quote about his empire-wide adoption of Power BI, “Veni, vidi, vizi,” but I think the Sun Tzu quote works a little better. Do you agree?

Is self-service business intelligence a two-edged sword?

I post about Power BI dataflows a lot, but that’s mainly because I love them. My background in data preparation and ETL, combined with dataflows’ general awesomeness makes them a natural fit for my blog. This means that people often think of me as “the dataflows guy” even though dataflows are actually a small part of my role on the Power BI CAT team. Most of what I do at work is help large enterprise customers successfully adopt Power BI, and to help make Power BI a better tool for their scenarios[1].

As part of my ongoing conversations with senior stakeholders from these large global companies, I’ve noticed an interesting trend emerging: customers describing self-service BI as a two-edged sword. This trend is interesting for two main reasons:

  1. It’s a work conversation involving swords
  2. Someone other than me is bringing swords into the work conversation[2]

As someone who has extensive experience with both self-service BI and with two-edged swords, I found myself thinking about these comments more and more – and the more I reflected, the more I believed this simile holds up, but not necessarily in the way you might suspect.

This week in London I delivered a new presentation for the London Power BI User Group – Lessons from the Enterprise: Managed Self-Service BI at Global Scale. In this hour-long presentation I explored the relationship between self-service BI and two-edged swords, and encouraged my audience to consider the following points[4]:

  • The two sharp edges of a sword each serve distinct and complementary purposes.
  • A competent swordsperson knows how and when to use each, and how to use them effectively in combination.
  • Having two sharp edges is only dangerous to the wielder if they are ignorant of their tool.
  • A BI tool like Power BI, which can be used for both “pro” IT-driven BI and self-service business-driven BI has the same characteristics, and to use it successfully at scale an organization needs to understand its capabilities and know how to use both “edges” effectively in combination.

As you can imagine, there’s more to it than this, so you should probably watch the session recording.

ssbi and swords

If you’re interested in the slides, please download them here: London PUG – 2019-06-03 – Lessons from the Enterprise.

If you interested in the videos shown during the presentation, they’re included in the PowerPoint slides, and you can view them on YouTube here:

For those who are coming to the Microsoft Business Applications Summit next week, please consider joining the CAT team’s “Enterprise business intelligence with Power BI” full-day pre-conference session on Sunday. Much of the day will be deep technical content, but we’ll be wrapping up with a revised and refined version of this content, with a focus on building a center of excellence and a culture of data in your organization.

Update 2019-06-10: The slides from the MBAS pre-conference session can be found here: PRE08 – Enterprise business intelligence with Power BI – Building a CoE.

There is also a video of the final demo where Adam Saxton joined me to illustrate how business and IT can work together to effectively respond to unexpected challenges. If you ever wondered what trust looks like in a professional[5] environment, you definitely want to watch this video.

 


[1] This may be even more exciting for me than Power BI dataflows are, but it’s not as obvious how to share this in blog-sized pieces.

[2] Without this second point, it probably wouldn’t be noteworthy. I have a tendency to bring up swords more often in work conversations than you might expect[3].

[3] And if you’ve been paying attention for very long, you’ll probably expect this to come up pretty often.

[4] Pun intended. Obviously.

[5] For a given value of “professional.”

Lego Bricks and the Spectrum of Data Enrichment and Reuse

I love Lego bricks.

I love them because they were a favorite toy from my childhood. I love them because my children love them. I love them because they foster and encourage creativity.

I love them because they serve as an excellent metaphor for the enrichment and reuse of data in the enterprise.[1]

Consider this starting point.

1

A single brick has almost unlimited potential. This brick could become the floor of a building, the chassis of a fire truck, the wing of an airplane or spaceship, or part of something that no Lego engineer had ever imagined. This potential comes with a cost – this single brick must be enriched with many other bricks before it can achieve any of these lofty goals.

Similarly, a single file, entity, or table has massive potential. The data it maintains could become part of many different applications in many different contexts. But as with the lone brick, it would need to first be enriched, to be combined with other data for that potential to be realized.

 

As each additional brick is combined with the first brick, its complexity increases, but so does its value. These more complex components are closer to the ultimate end state, so less work will be required to achieve that goal.

But at the same time, each additional brick reduces the potential for reuse. After only a few additions we’ve already ruled out creating a floor or an airplane wing. We might still create a car or truck, or some types of spaceships, but the enrichment we’ve performed is already constraining our potential.

Similarly, each time we enrich a data source to move closer to our goal, we also limit the potential scenarios for which the data can be used. If we filter the data horizontally or vertically to eliminate records or fields we don’t currently need, we are eliminating data that may be needed for some other purpose. If we merge our starter data set with additional data, we may also be adding records or fields that aren’t needed for future purposes, while increasing complexity and adversely affecting performance.

 

As we continue building, we see this pattern continue. We also see the pattern repeated on multiple levels, while contributing to the overall goal. At multiple points we will combine a small number of individual bricks to build a component, and then add that component to the main model to make it more complex, and more specialized. Each time – both for the component and for the whole – the process of enrichment adds value and complexity, and reduces the potential scope of reuse. When the final model is finished we have the final model we needed[2]. The value is very high, but the opportunities for reuse are very small.

The parallel is clear: when we finish building a BI application, the value is very high, but the potential for reuse is very low. The dashboards and reports, the data model with its measures, dimensions, and KPIs, the data warehouse and data lake, and all of the upstream logic and components that make up a BI solution need to be combined in specific ways to achieve specific goals. The application as a whole is specialized for a given purpose…

…but what about the components that make up the application? Where can we find the sweet spot, the perfect balance between achieved value and potential value?

Like these Lego components:

 

When you’re building your solution using Power BI, this is where dataflows come in.

The answer will of course differ for each context, but when designing an application, it’s important to take larger and longer-term requirements into account.

Consider this diagram[3]:

New Excel Diagram

In this simple architecture, each dataflow (represented by the lighter squares) represents a stage in the overall data processing and enrichment flow. Each one adds value toward the application, and each serves as a building block that can be further reused both in this application and in other applications with overlapping data requirements.

The nature of Power BI dataflows lends itself well to this problem – each dataflow is a collection of reusable data entities managed by the Power BI service, easily discoverable and usable by technical and business users in BI applications. The computed entities feature in particular makes this type of scenario easy to set up and manage.

At the same time, the nature of Power BI dataflows introduces challenges for this Lego-style reuse. Dataflows in Power BI are optimized to enable non-IT users, business users and analysts who are typically focused on solving their immediate data challenges without relying on support from IT. These users are less likely to be focused on bigger-picture requirements like broad reuse of the entities they create.

This is where a little process and collaboration can come in, aided by the fact that dataflows are managed by the Power BI service. Power BI administrators can monitor the service to understand what dataflows are being used most frequently and most widely, and in what contexts. With this as a starting point, they can then operationalize[4] dataflows and entities created by business users, so that they are managed and maintained by IT. Since each dataflow entity is defined by the Power Query “M” code in the entity definition, this operationalization process is likely to be simpler and easier than similar processes with other technologies.

This approach also fits in well with how many larger enterprises implement Power BI. It is common[5] for larger organizations to use both shared capacity and dedicated Premium capacity for different purposes, and often those applications deployed to Premium capacity are those that are managed by a central IT/BI team. Since computed entities are only available when using Power BI Premium[6], this approach could lend itself well to the hand-off from business users to IT.

In any event, the next time you’re building dataflow entities, pause for a moment to think about Lego bricks, and what types of bricks or components your entities and dataflows represent. And then maybe take a break to go play with your kids.


[1] All images and examples in this post are taken from the building instructions for the Lego City Starter Set. I used to have a similar Lego set in my office that I would use to tell this story in person, but I gave it away during an office move shortly before I started this blog. The moral of the story: never get rid of your Lego sets.

[2] A fire truck!!

[3] This diagram is my reformatted version of a diagram included in the Power BI dataflows whitepaper. If you haven’t read this document, you really should.

[4] Or industrialize – which term do you use?

[5] Please note that this is not a blanket recommendation. I have the advantage of talking to many top Power BI customers around the world, so I can see this approach emerging as a common pattern, but the “right” approach in a given context will always depend on the context and the goals of the organization. I personally believe it is too early to start talking about best practices for Power BI dataflows (as I write this, dataflows have been in public preview for barely three weeks) but this is one of the areas where I am most excited to see best practices start to emerge.

[6] Even though Power BI dataflows do enable reuse in other ways that do not require Premium capacity.

Power BI Dataflows and Slow Data Sources

One advantage of data warehouses is the ability to have data from multiple data sources available for analysis without being constrained by the availability and performance characteristics of those data sources. Although Power BI dataflows are not a replacement for data warehouses in a general sense, they can provide same value to a self-service BI solution.

I’ve been using Power Query for over five years, and one consistently recurring pattern is the need to work with slow data sources such as web pages or APIs. “Slow” might mean “takes 10 minutes to load” but it could also mean “takes 10 hours” or “takes 10 days.” In each of these cases, my overall productivity ended up being closely tied to the refresh duration for the slowest data source in my solution, and this often introduced significant delay and frustration into my life. So I looked for solutions…

In the early days of Power BI, before Power BI dataflows, I would sometimes work around this problem by defining a set of Power Query queries in an Excel workbook, configuring those queries to load data into worksheets rather than into the data model. I would then use that Excel workbook as a data source for my primary queries. Essentially I was using an Excel file as a self-service data mart, since there wasn’t any similar capability in Power BI at the time. Here’s an example of the queries in a Power BI solution that I built almost five years ago[1]:

2018-11-23_15-33-06

These queries are the foundation of a solution I built to track my personal fitness, back when my exercise routine revolved around lifting weights. Back before I knew that people fought with longswords. I haven’t looked at it in years, but the organization shown in the screen shot above sure looks familiar.

There’s one folder for the queries extracting data from each source system. These queries are configured as “connection only” which means the data isn’t loaded directly into the data model – these queries serve as data sources for other queries. One of the folders for source queries references my Excel “staging area” workbook. As implied above, this other workbook contains data from sources that are too slow to include in the refresh of the primary solution workbook, as well as sources that update very infrequently, so that refreshing the data on a regular basis would be unwarranted and inefficient.

There’s also one folder for the queries that define the solution’s data model. These queries are loaded and represent a simple star schema that is friendly for analytics use cases.[2]

Does this pattern look familiar?

I hope it does. This is essentially the same pattern that I’ve included in almost every dataflows presentation I’ve delivered. This one:

2018-11-23_15-59-39

Dataflows make this pattern much easier to implement and maintain.

Were I to build a similar solution today[3], I would factor it in the same way. Each folder in the Excel-based solution would translate into a Power BI dataflow, and the queries in the folders would translate into entities. By having each source as its own dataflow, I can independently configure refresh schedules based on the performance and data freshness characteristics of each source. And by having the data in dataflow entities, I can easily reuse that data in multiple Power BI solutions – and enable other users in my organization to do the same[4].

If I were using Power BI Premium, I would probably build the whole solution using dataflows and computed entities, and only load the star schema into Power BI Desktop.

If I were not using Power BI Premium, I would probably build everything but the star schema using dataflows, and build the star schema using queries in Power BI Desktop that get their data from these dataflows.[5] As if the dataflows were logically a data warehouse.

Which is sort of where we came in…[6]

 


[1] The astute reader will realize that this is back before Power BI existed as we know it today. It feels strange to share the details of such an old solution, but this goes to show that although new tools are always being introduced, old problem patterns never seem to go away.

[2] In September 2014 I did a presentation about this solution for the PASS Data Warehousing and Business Intelligence Virtual Chapter, and the recording is still online on YouTube. I have not watched it, and don’t plan on watching it, and you should be warned before you watch it, because if I remember correctly it includes shirtless progress pictures of me.

[3] I will not do this, but only because I don’t have similarly rich sources of swordfighting-related data.

[4] This sounds like I’m implying that I have a group of analysts working on my personal fitness, doesn’t it? That’s just weird, even to me.

[5] This is another “reuse without Premium” scenario, independent of the one introduced in my earlier Power BI Dataflows – Reuse without Premium post.

[6] Imagine my disappointment when this screen shot just didn’t fit into this post. If only there was some way I could find a reason to include it somewhere…

2018-11-23_15-57-32.png