Building a data culture

tl;dr – to kick off 2020 we’re starting a new BI Polar video series focusing on building a data culture, and the first video introduces the series. You should watch it and share it.

Succeeding with a tool like Power BI is easy – self-service BI tools let more users do more things with data more easily, and can help reduce the reporting burden on IT teams.

Succeeding at scale with a tool like Power BI is not easy. It’s very difficult, not because of the technology, but because of the context in which the technology is used. Organizations adopt self-service BI tools because their existing approaches to working with data are no longer successful – and because the cost and pain[1] of change has become outweighed by the cost and pain of maintaining course.

Tool adoption may be top-down, encouraged or mandated by senior management as a broad organization-wide effort. Adoption may be bottom-up, growing organically and virally in the teams and departments least well served by the existing tools and processes in place.

Both of these approaches[2] can be successful, and both of these approaches can fail. The most important success factor is a data culture in which the proper use of self-service BI tools can deliver the greatest value for the organization.

The most important success factor is a data culture

books-1655783_640
There must be a data culture on the other side of this door.

Without an organizational culture that values, encourages, recognizes, and rewards users and teams for their use of data, no tool and no amount of effort and skill is enough to achieve the full potential of the tools – or of the data.

In this new video series we’ll be covering practices that will help build a data culture. More specifically, we’ll introduce common practices that are exhibited by large organizations that have mature and successful data cultures. Each culture is unique, but there are enough commonalities to identify patterns and anti-patterns.

The content in this series will be informed by my work with enterprise Power BI customers as part of my day job[3], and will complement nicely[4] the content and guidance in the Power BI Adoption Framework.

Back in November when the 100th BI Polar blog post was published, I asked what everyone wanted to read about in the next 100 posts. There were lots of different ideas and suggestions, but the most common theme was around guidance like this. Hopefully you’ll enjoy the result – and hopefully you’ll let me know either way.


[1] I strongly believe that pain is a fundamental precursor to significant change. If there is no pain, there is no motivation to change. Only when the pain of not changing exceeds the perceived pain of going through the change will most people and organizations consider giving up the status quo. There are occasional exceptions, but in my experience these are very rare.

[2] Including any number of variations – these approaches are common points on a wide spectrum, but should not be interpreted as the only ways to adopt Power BI or other self-service BI tools.

[3] By day I’m a masked crime-fighter. Or a member of the Power BI customer advisory team. Or both. It varies from day to day.

[4] Hopefully this will be true. I’m at least as interested in seeing where this ends up as you are.

Video: A most delicious analogy

Every time I cook or bake something, I think about how the tasks and patterns present in making food have strong and significant parallels with building BI[1] solutions. At some point in the future I’m likely to write a “data mis en place” blog post, but for today I decided to take a more visual approach, starting with one of my favorite holiday recipes[2].

Check it out:

(Please forgive my clickbaitey title and thumbnail image. I was struggling to think of a meaningful title and image, and decided to have a little fun with this one.)

I won’t repeat all of the information from the video here, but I will share a view of what’s involved in making this self-service BI treat.

2019-12-17-12-52-36-894--VISIO

When visualized like this, the parallels between data development and reuse are probably a bit more obvious. Please take a look at the video, and see what others jump out at you.

And please let me know what you think. Seriously.


[1] And other types of software, but mainly BI these days.

[2] I published this recipe almost exactly a year ago. The timing isn’t intentional, but it’s interesting to me to see this pattern emerging as well…

The Power BI Adoption Framework – it’s Power BI AF

You may have seen things that make you say “that’s Power BI AF” but none of them have come close to this. It’s literally the Power BI AF[1].

That’s right – this week Microsoft published the Power BI Adoption Framework on GitHub and YouTube. If you’re impatient, here’s the first video – you can jump right in. It serves as an introduction to the framework, its content, and its goals.

Without attempting to summarize the entire framework, this content provides a set of guidance, practices, and resources to help organizations build a data culture, establish a Power BI center of excellence, and manage Power BI at any scale.

Even though I blog a lot about Power BI dataflows, most of my job involves working with enterprise Power BI customers – global organizations with thousands of users across the business who are building, deploying, and consuming BI solutions built using Power BI.

Each of these large customers takes their own approach to adopting Power BI, at least when it comes to the details. But with very few exceptions[2], each successful customer will align with the patterns and practices presented in the Power BI Adoption Framework – and when I work with a customer that is struggling with their global Power BI rollout, their challenges are often rooted in a failure to adopt these practices.

There’s no single “right way” to be successful with Power BI, so don’t expect a silver bullet. Instead, the Power BI Adoption Framework presents a set of roles, responsibilities, and behaviors that have been developed after working with customers in real-world Power BI deployments.

If you look on GitHub today, you’ll find a set of PowerPoint decks broken down into five topics, plus a few templates.

2019-12-12-12-09-19-166--msedge

These slide decks are still a little rough. They were originally built for use by partners who could customize and deliver them as training content for their customers[3], rather than for direct use by the general public, and as of today they’re still a work in progress. But if you can get past the rough edges, there’s definitely gold to be found. This is the same content I used when I put together my “Is self-service business intelligence a two-edged sword?” presentation earlier this year, and for the most part I just tweaked the slide template and added a bunch of sword pictures.

And if the slides aren’t quite ready for you today, you can head over to the official Power BI YouTube channel where this growing playlist contains bite-size training content to supplement the slides. As of today there are two videos published – expect much more to come in the days and weeks ahead.

The real heroes of this story[4] are Manu Kanwarpal and Paul Henwood.  They’re both cloud solution architects working for Microsoft in the UK. They’ve put the Power BI AF together, delivered its content to partners around the world, and are now working to make it available to everyone.

What do you think?

To me, this is one of the biggest announcements of the year, but I really want to hear from you after you’ve checked out the Power BI AF. What questions are still unanswered? What does the AF not do today that you want or need it to do tomorrow?

Please let me know in the comments below – this is just a starting point, and there’s a lot that we can do with it from here…


[1] If you had any idea how long I’ve been waiting to make this joke…

[2] I can’t think of a single exception at the moment, but I’m sure there must be one or two. Maybe.

[3] Partners can still do this, of course.

[4] Other than you, of course. You’re always a hero too – never stop doing what you do.

Using and reusing Power BI dataflows

I use this diagram a lot[1]:

excel white

This diagram neatly summarizes a canonical use case for Power BI dataflows, with source data being ingested and processed as part of an end-to-end BI application. It showcases the Lego-like composition that’s possible with dataflows. But it also has drawbacks – its simplicity omits common scenarios for using and reusing dataflows.

So, let’s look at what’s shown – and at what’s not shown – in my favorite diagram. Let’s look at some of the ways these dataflows and their entities can be used.

  1. Use the final entities as-is: This is the scenario implied by the diagram. The entities in the “Final Business View” dataflow represent a star schema, and are loaded as-is into a dataset.
  2. Use the final entities with modification: The entities in the “Final Business View” dataflow are loaded into a dataset, but with additional transformation or filtering applied in the dataset’s queries.
  3. Use the final entities with mashup: The entities in the “Final Business View” dataflow are loaded into a dataset, but with additional data from other sources added via the dataset’s queries.
  4. Use upstream entities: The entities in other dataflows are loaded into a dataset, likely with transformations and filtering applied, and with data from other sources added via the dataset’s queries.

Please understand that this list is not exhaustive. There are likely dozens of variations on these themes that I have not called out explicitly. Use this list as a starting point and see where dataflows will take you. I’ll keep the diagram simple, but you can build solutions as complex as you need them to be.


[1] This is my diagram. There are many like it, but this one is mine.

 

Quick Tip: Factoring your dataflow entities

This post started as a response to this question from Mark, who was commenting on last week’s data lineage post:

How would you decide how big or how small to make each artifact in the lineage, in terms of the amount of transformations taking place inside the artifact? In my case they would only be shared with 2-3 other users.

For instance I could go all out and have every step that would previously take place in a query editor result in a new link in the data lineage chain, but that would probably be overkill.

I agree that “one step per dataflow” would be overkill, but beyond that the answer is largely “it depends.”

Image by Esi Grünhagen from Pixabay

The approach I generally take is to break the end to end data preparation down into blocks that look like this:

  1. Staging – getting the source data into the system (in this case dataflow, but could be data mart, data warehouse, data lake, etc.) with zero or minimal transformations
  2. Cleansing – correcting known data quality and format problems from the staged data
  3. Transformation 1 – getting the cleansed data into the shape required for intended downstream purposes
  4. Enrichment – adding data from other sources, which have ideally already gone through steps 1 through 3
  5. Transformation 2 – getting the cleansed and enriched data into the shape required for analysis, typically as dimensions and facts

The final step may also be performed in the queries that are used to create the final tabular model when creating a dataset in Power BI Desktop. If a given dimension is likely to be used in multiple datasets, implement it as a dataflow entity. If it isn’t, implement it as a table in your dataset.

These guidelines tend to create a moderate number of easily maintainable entities, but they’re obviously the bare minimum – take what works for you, and discard the rest.

I feel like I’m dating myself with this link[1], but I definitely recommend looking at the Kimball Group’s techniques for data warehousing and BI: resources link. Ralph Kimball and his amazing team know more about this stuff than I will ever forget (or something like that) and there’s a huge volume of guidance available. Do yourself a favor and check it out.


[1] I assume there are newer resources out there, but when I was your age it was the Kimball Method or the… synonym for highway that rhymes with method.

Are you building a BI house of cards?

Every few weeks I see someone asking about using Analysis Services as a data source for Power BI dataflows. Every time I hear this, I cringe, and then include advice something like this[1] in my response.

Using Analysis Services as a data source is an anti-pattern – a worst practice. It is not recommended, and any solution built using this pattern is likely to produce dissatisfied customers. Please strongly consider using other data sources, likely the data sources on which the AS model is built.

 

There are multiple reasons for this advice.

 

Some reasons are technical. Extraction of large volumes of data is not what an Analysis Services model is designed for. Performance for the ETL process is likely to be poor, and you’re likely end up with memory/caching issues on the Analysis Services server. Beyond this, AS models typically don’t include the IDs/surrogate keys that you need for data warehousing, so joining the AS data to other data sources will be problematic.[2]

 

For some specific examples and technical deep dives into how and why this is a bad idea, check out this excellent blog post from Shabnam Watson. The focus of the post is on SSAS memory settings, but it’s very applicable to the current discussion.

 

Some reasons for this advice are less technical, but no less important. Using analytics models as data sources for ETL processing are a strong code smell[3] (“any characteristic in the source code of a program that possibly indicates a deeper problem”) for business intelligence solutions.

 

Let’s look at a simple and familiar diagram:

 

01 good

 

There’s a reason this left-to-right flow is the standard representation of BI applications: it’s what works. Each component has specific roles and responsibilities that complement each other, and which are aligned with the technology used to implement the component. This diagram includes a set of logical “tiers” or “layers” that are common in analytics systems, and which mutually support each other to achieve the systems’ goals.
Although there are many successful variations on this theme, they all tend to have this general flow and these general layers. Consider this one, for example:

 

02 ok

This example has more complexity, but also has the same end-to-end flow as the simple one. This is pretty typical for  scenarios where a single data warehouse and analytics model won’t fulfill all requirements, so the individual data warehouses, data marts, and analytics models each contain a portion – often an overlapping portion – of the analytics data.

Let’s look at one more:

03 - trending badly

This design is starting to smell. The increased complexity and blurring of responsibilities will produce difficulties in data freshness and maintenance. The additional dependencies, and the redundant and overlapping nature of the dependencies means that any future changes will require additional investigation and care to ensure that there are no unintended side effects to the existing functionality.

As an aside, my decades of working in data and analytics suggest that this care will rarely actually be taken. Instead, this architecture will be fragile and prone to problems, and the teams that built it will not be the teams who solve those problems.

And then we have this one[4]:

04 - hard no

This is what you get when you use Analysis Services as the data source for ETL processing, whether that ETL and downstream storage is implemented in Power BI dataflows or different technologies. And this is probably the best case you’re likely to get when you go down this path. Even with just two data warehouses and two analytics models in the diagram, the complex and unnatural dependencies are obvious, and are painful to consider.

What would be better here?[5] As mentioned at the top of the post, the logical alternative is to avoid using the analytics model and to instead use the same sources that the analytics model already uses. This may require some refactoring to ensure that the duplication of logic is minimized. It may require some political or cross-team effort to get buy-in from the owners of the upstream systems. It may not be simple, or easy. But it is almost always the right thing to do.

Don’t take shortcuts to save days or weeks today that will cause you or your successors months or years to undo and repair. Don’t build a house of cards, because with each new card you add, the house is more and more likely to fall.

Update: The post above focused mainly on technical aspects of the anti-pattern, and suggests alternative recommended patterns to follow instead. It does not focus on the reasons why so many projects are pushed into the anti-pattern in the first place. Those reasons are almost always based on human – not technical – factors.

You should read this post next: http://workingwithdevs.com/its-always-a-people-problem/. It presents a delightful and succinct approach to deal with the root causes, and will put the post you just read in a different context.


[1] Something a lot like this. I copied this from a response I sent a few days ago.

[2] Many thanks to Chris Webb for some of the information I’ve paraphrased here. If you want to hear more from Chris on this subject, check out this session recording from PASS Summit 2017. The whole session is excellent; the information most relevant to this subject begins around the 26 minute mark in the recording. Chris also gets credit for pointing me to Shabnam Watson’s blog.

[3] I learned about code smells last year when I attended a session by Felienne Hermans at Craft Conference in Budapest. You can watch the session here. And you really should, because it’s really good.

[4] My eyes are itching just looking at it. It took an effort of will to create this diagram, much less share it.

[5] Yes, just about anything would be better.

Quick Tip: Creating “data workspaces” for dataflows and shared datasets

Power BI is constantly evolving – there’s a new version of Power BI Desktop every month, and the Power BI service is updated every week. Many of the new capabilities in Power BI represent gradual refinements, but some are significant enough to make you rethink how you your organization uses Power BI.

Power BI dataflows and the new shared and certified datasets[1] fall into the latter category. Both of these capabilities enable sharing data across workspace boundaries. When building a data model in Power BI Desktop you can connect to entities from dataflows in multiple workspaces, and publish the dataset you create into a different workspace altogether. With shared datasets you can create reports and dashboards in one workspace using a dataset in another[2].

The ability to have a single data resource – dataflow or dataset – shared across workspaces is a significant change in how the Power BI service has traditionally worked. Before these new capabilities, each workspace was largely self-contained. Dashboards could only get data from a dataset in the same workspace, and the tables in the dataset each contained the queries that extracted, transformed, and loaded their data. This workspace-centric design encouraged[3] approaches where assets were grouped into workspaces because of the platform, and not because it was the best way to meet the business requirements.

Now that we’re no longer bound by these constraints, it’s time to start thinking about having workspaces in Power BI whose function is to contain data artifacts (dataflows and/or datasets) that are used by visualization artifacts (dashboards and reports) in other workspaces. It’s time to start thinking about approaches that may look something like this:

data centric workspaces

Please keep in mind these two things when looking at the diagram:

  1. This is an arbitrary collection of boxes and arrows that illustrate a concept, and not a reference architecture.
  2. I do not have any formal art training.

Partitioning workspaces in this way encourages reuse and can reduce redundancy. It can also help enable greater separation of duties during development and maintenance of Power BI solutions. If you have one team that is responsible for making data available, and another team that is responsible for visualizing and presenting that data to solve business problems[4], this approach can given each team a natural space for its work. Work space. Workspace. Yeah.

Many of the large enterprise customers I work with are already evaluating or adopting this approach. Like any big change it’s safer to approach this effort incrementally. The customers I’ve spoken to are planning to apply this pattern to new solutions before they think about retrofitting any existing solutions.

Once you’ve had a chance to see how these new capabilities can change how your teams work with Power BI, I’d love to hear what you think.

Edit 2019-06-26: Adam Saxton from Guy In A Cube has published a video on Shared and Certified datasets. If you want another perspective on how this works, you should watch it.


[1] Currently in preview: blog | docs.

[2] If you’re wondering how these capabilities for data reuse relate to each other, you may want to check out this older post, as the post you’re currently reading won’t go into this topic: Lego Bricks and the Spectrum of Data Enrichment and Reuse.

[3] And in some cases, required.

[4] If you don’t, you probably want to think about it. This isn’t the only pattern for successful adoption of Power BI at scale, but it is a very common and tested pattern.