Quick Tip: Power Query column by example

It seems like everyone knows about how Power Query in Power BI Desktop lets you use web pages as data sources, and lets you build queries by entering values from the web page. This is probably because so many people use the “from web page by example” feature for demos[1].

It also seems like no one knows about the “column from examples” feature in Power Query, even though this is just as exciting, and much more useful. They’re both built on the same underlying intelligence, but this one lets you work with data from any source.

Here’s the quick overview:

  1. In the Power Query editor in Power BI Desktop, choose “Column from Examples” from the “Add Column” tab.
  2. Enter the values that the new column should have for rows that are already in your data set.
  3. Review the values that Power Query is suggesting for the other rows, and when they are all correct, choose OK, and then say “Ooooooohhhhh” when Power Query does all the work for you.

Like this[2]:

column by example

In this example, I want to extract an ID value from a URL column where the URLs all follow the same pattern. I’ve copied the ID value from the first row, and when I paste it in to the new column, Power Query immediately generates the M code to extract the ID and adds a new step the query.

= Table.AddColumn(#”Renamed Columns”, “Text After Delimiter”, each Text.AfterDelimiter([Band Page URL], “/”, 4), type text)

This code isn’t too complex. You could have written this code. I could have written it, but odds are it would have taken me a few minutes to look up the syntax for the Text.AfterDelimiter function, and to iterate a few times getting it right. Using this awesome feature I get the same code in seconds. For more complex scenarios you may need to enter more than one desired value before Power Query has enough information to get the function just right.

The best part is that the step that is added is just another query step. You can delete it, you can edit it, you can add more steps after it. This is just a fast and simple way to add the step without needing to write any code or even know what functions to use.

There have been a few occasions where Power Query couldn’t quite figure out what I needed it to do, but it hasn’t happened often, or lately. This may be because I’ve been getting better, but it’s probably because the Power Query team has continued to improve the feature since it was introduced in April 2017. That’s right – this feature has been available for over two years, and you’re just learning about it now.

Or are you[3]?

Either way, check it out the next time you’re working in Power Query, and let me know what you think.


[1] To be fair, it does make for an awesome demo.

[2] There also used to be a GIF here, but apparently my GIF-making skills need some refinement before I start using GIFs in blog posts.

[3] When I demo Power Query in Power BI Desktop to customers, maybe 10% have already seen this feature already. I suspect that the high quality demographic that this blog is certain to attract will be more aware and more engaged than the general population, so I’m genuinely curious to know if you already knew about this gem.

Power BI dataflows enhanced compute engine

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

This month the Power BI dataflows team announced an exciting new capability available in preview for dataflows in workspaces backed by Power BI Premium dedicated capacity[1]. This compute engine builds on the Excel-like capabilities delivered through linked and computed entities to dramatically increase refresh performance for dataflow entities that get their data from other dataflow entities.

This post won’t go deep into the technical details – I’ll defer to the folks who built the feature to share that – but it will share important highlights to answer the most common questions I’ve seen since the preview was announced.

Let’s begin with a quick review. Consider this typical scenario:

excel white

  1. Data is loaded into entities in the “Ingest from Dynamics Sales” and “Product Telemetry in Azure” dataflows from data sources external to Power BI.
  2. The “Clean and enrich sales data” and “Add Telemetry Customer Attributes” reference these dataflows as their sources, and the “Final Business View” dataflow references them.
  3. The first two dataflows are configured for scheduled refresh. The other three dataflows will automatically refresh[2] to keep the data in their entities up to date.

But what is actually happening in this scenario?

The boxes and arrows in this type of diagram could represent any number of technologies, but for Power BI dataflows the key technology is the Mashup Engine. The Mashup Engine is the runtime engine behind Power Query – it takes the Power Query “M” script and performs the extract, transform, and load (ETL) processes that the script defines. The Mashup Engine runs in Excel, in Power BI Desktop, in the Power BI on-premises data gateway, and in the Power BI service, and it is a cornerstone for data connectivity in Power BI.

A key characteristic of Power Query and the Mashup Engine is query folding. Query folding is a process by which the Mashup Engine offloads the heavy lifting of the data transformation to the data source, as enabled by the capabilities of the data source. For example, if a Power Query user built a query that extracted three columns from one table in a SQL Server database, joined them with four columns from another table in the same database, and then groups the join results by one of the columns. The Mashup Engine will automatically send the appropriate JOIN and GROUP BY statements to the server, and the server will return the results. If the same query is executed against two folders full of CSV files[4], there is no “server” to perform the required join and grouping operations, so the Mashup Engine performs them in memory.

To summarize this example, the query is largely the same, and the Mashup Engine uses query folding to choose the best location[5] for the required compute operations to be performed.

This brings us back to dataflows, CDM folders, and why the new advanced compute engine is exciting. When a Power BI dataflow is refreshed, the data produced by the Mashup Engine’s execution of the dataflow entities’ queries is stored in Azure in the CDM folder format, which uses CSV files in folders for the storage. This means that when one entity references another entity, it is accessing CSV data and is using the Mashup Engine for its processing[6]. This can result in slow refresh performance for some linked entity scenarios when the data volume or transformations can’t be optimized by the Mashup Engine.

The new advanced compute engine addresses this problem by loading the CDM folder data into an Azure SQL DB-based cache, and using query folding against this cache. This can produce significant performance improvements for the downstream refresh of computed entities. If we consider again the scenario shown above, the new advanced compute engine will change the underlying processing without requiring any change to the dataflows or entities themselves.

  1. When data is loaded into entities in the dataflows, the data is also loaded into the SQL-based cache.
  2. This may increase the refresh times for the “Ingest from Dynamics Sales” and “Product Telemetry in Azure” dataflows, because the caching is an additional operation not otherwise required.
  3. When the “Clean and enrich sales data” and “Add Telemetry Customer Attributes” and “Final Business View” dataflows refresh, the Mashup Engine can use query folding against the SQL-based cache. This can result in dramatic improvements in performance for the refresh of computed entities.

If you want to know more, you should watch this session from the Microsoft Business Applications Summit and read this blog post after you watch the session. Both of these excellent resources are from the architect responsible for much of the technical underpinnings of Power BI, and both go into more technical detail than I’ve gone into here.

As mentioned above, this post is designed to address some questions I’ve heard since this new preview capability was announced. Please let me know if this post is helpful, or if you have any questions of your own that aren’t answered here.


[1] Yes – this is a Premium-only feature.

[2] In case you didn’t click through already, this is why we call this behavior “Excel-like.” The Power BI service maintains the dependencies between dataflow entities like Excel maintains dependencies between cells and ranges.

[3] If you want to know more about query folding, check out this awesome post from MVP Matt Allington.

[4] This is one of the aspects of Power Query that are most exciting for me, as only the “source” function in the query would need to be changed. This data source agnostic nature of Power Query is enabled by query folding.

[5] Or locations, because a single query can combine data from multiple data sources.

[6] This is also why dataflows are an import-only data source in Power BI Desktop.

 

Ignorance as a finite resource (or “Data visualization resources”)

I’m not a data visualization professional, nor do I play one on TV. Even though I’ve been working with data professionally for over 20 years[1], I’ve generally been on the back end of systems, working on data platforms and not data visualization. I’ve been known to say on occasion[2] that I’ll make your data sing, but you probably don’t want me responsible for making it pretty. If I’m your report designer, building them will take five times as long, and the reports will suck ten times as much.

Before I joined Microsoft, I worked for many years as a consultant. When working with potential clients, I would often present my ignorance as an asset. This may sound strange, but think about the truth in this simple pitch: “You and your team know too much. There are some questions you will never think to ask, because they’re just too obvious. One of the strengths I will bring to this engagement is my ignorance – I will ask those questions and together we will find answers and build solutions.”

Sometimes ignorance doesn’t feel like an asset – sometimes it just gets in the way. In the past few months this has been the case as I’ve been trying to get build Power BI reports for some of the data I work with day to day. To be more specific, I’ve been trying to build reports that don’t suck. I already have functional reports, but I believe that there are insights waiting in the data that could be discovered and shared more easily if I had stronger data visualization skills. I read documentation, and followed examples and reached out to multiple friends and colleagues or advice and… and often found that not only did I not have the knowledge to effectively visualize my data, I often struggled to effectively communicate my goals. I was both mega-ignorant and meta-ignorant.

Which brings me to the point of this post: I’m hoping to use this monumental ignorance as an asset.

Earlier this week I reached out on Twitter asking for help. I wanted to find resources that could help me build my data visualization skills, and to build my vocabulary of terms and concepts so I could ask better questions. I wanted to address both the mega-ignorance and the meta-ignorance. So I asked both Alberto Cairo[3] and Alyssa Fowers[4] by name, and asked for help in general, and boy did I get it.

I learned one immediate lesson from the replies to this post: I did not do an effective job in communicating my goals. Previously I had struggled when asking for tactical data visualization help; here I was struggling when asking for strategic help, and most of the responses weren’t even close to what I was asking for[5]. This was frustrating, but it validated my understanding of the problem, and reinforced my belief that I need a deeper understanding of the concepts and nomenclature of data visualization if I’m going to improve.

Of the many resources that were shared with me, this one looks the most promising: Andy Kirk’s “Data Visualization: A Handbook for Data Driven Design.” Based on the thoughtful description[6] and glowing reviews, I think this is where I will begin. I’ve pre-ordered the second edition and when it arrives next month I’ll begin my journey…

book

…which is sort of where I was going with this whole post[7]. Ignorance as a finite resource, remember?

I’ve long observed that when one person is struggling with something, he’s typically not alone. When one person asks a question, a dozen other people breathe a silent sign of relief, because they had the same question but didn’t dare ask it. Working under the assumption that I’m not alone in this starting point, I’m hoping to use my current state of ignorance, and the upcoming process of destroying that ignorance, to start a “data visualization as a second language” series of posts.

If you’re looking for excellent tools that you can use today to build on your existing foundation of data visualization knowledge and skills, click on the Twitter link above because there are over a dozen web sites, books, and other resources – and I’m too lazy to copy them all here. But if you’re interested in joining me on my learning journey, stay tuned until August.

Do you think this type of content has an audience? Are you that audience? I’d love to hear from you


[1] How young is too young to play the “old man” card so consistently? Asking for a friend.

[2] Typically these are occasions when someone is looking at one of my ugly reports, and I’m feeling self-conscious.

[3] You may know Mr. Cairo as a leading expert and author on data visualization, as the Knight Chair at the University of Miami, or as the guy who is too busy to replace the lorem ipsum in his bio page, but I will always think of him as the other person who regularly presents on the intersection of data and heavy metal. \m/

[4] If you don’t follow Ms. Fower’s “Data and Dragons” blog you definitely should, because she already has two parts in her “Sword Graphs” series.

[5] Yes, I know this is how Twitter works. I’m trying to be generous here – work with me.

[6] Click on the “Who is this book for” link and tell me that you don’t wish every technical book had this information so readily available.

[7] Brevity will never be my forte.

[8] I was originally thinking of “Data Viz 101” but that sounded too advanced, and if I went with the second idea of  “Data Viz 001”  I know I would forever be making “James Bond’s analyst colleague” jokes. DVSL wasn’t my first choice, but I think I like it.

Status Check: Power BI dataflows and ADLSg2

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

In the last few weeks I’ve seen a spike in questions related to the integration of Power BI dataflows and Azure Data Lake Storage Gen2. Here’s a quick “status check” on the current state of this feature to get the answers out for as many people as possible.

  • Power BI dataflows are generally available (GA) for capabilities that use the built-in Power BI-managed storage.
  • Power BI dataflows integration with Azure is currently in preview – this includes the “Bring your own storage account” capabilities where you can configure Power BI to use your ADLSg2 storage account for dataflows storage, instead of using the built-in Power BI-managed storage.
  • During preview, there are several known limitations:
    • Only a single ADLSg2 storage account can be configured for a Power BI tenant.
    • The storage account, once configured, cannot be changed.
    • The setup process to connect Power BI with ADLSg2 is somewhat lengthy and step-intensive.
    • To grant users other than the owner of the dataflow access to the dataflow in Power BI, you must grant them permissions to access the workspace in Power BI and grant them access to the CDM folder in ADLSg2.

These limitations will be addressed when this capability hits GA, but you should definitely be aware of them in the meantime. (You may also want to take a look at this MBAS session for an overview of the roadmap for the rest of this calendar year.)

I’ve seen customers take different approaches:

  1. Some customers delay their integration of Power BI and ADLSg2, and are waiting for these limitations to be removed before they move forward.
  2. Some customers adopt within the constraints of the preview, and choose a workload or set of workloads where the current limitations are acceptable.
  3. Some customers set up a demo tenant of Power BI and use it to test and validate the ADLSg2 integration while deciding on option 1 or option 2.

I hope this helps. If you or your customers have any questions on this topic that aren’t answered here, please let me know!


[1] And they’re all documented. Nothing in this blog post is new, but hopefully it will help to have this summary online and sharable.

Writing effective problem reports

If you build software or data solutions[1] you have probably encountered one or both of these situations:

  1. You’re trying to report a bug, but the developer doesn’t believe that there’s a problem.
  2. Someone is trying to report a bug to you, and you can’t tell what the problem is supposed to be.

The problem report has itself become a problem[2].

Fortunately, there’s a simple approach, and simple template, that can make reporting problems easier. This is the template I typically use when I’m reporting a problem and asking someone else to fix it:

Problem: Concise description of problem behavior

Steps to reproduce problem:

  1. First step
  2. Second step
  3. Third and subsequent steps, as necessary

Desired or expected behavior:

4. What I wanted to happen

Observed behavior:

4. What actually happened, including the full details of any error messages

That’s it – simple and easy.

It can also often be helpful to include screen shots, recordings, or other visual resources to supplement the text descriptions. If you use TechSmith Camtasia (commercial, paid) or ShareX (open source, free) or other screen recording software, it can be trivial to record and attach a video – but remember that the video does not replace the written problem report, it supplements it.

I should mention that if you’re a software developer working on a software development team, you probably have a heavier-weight process already[3]. Follow that process. This approach is intended for more casual problem reporting – the sort of thing that you might send to someone in an email asking for help. The sort of email where if you don’t communicate clearly and effectively, the recipient ends up spending more time asking for information than he spends actually answering the question or solving the problem.

Yes, this is why I wrote this post. I hope I never need to link to it…


[1] Or work in a technical field, or use software…

[2] Because I never metaproblem I didn’t like. Yes, this sounded funnier in my head.

[3] If you search for “how to write good bugs” you’ll find a huge number of excellent resources that go into much more depth than this post.

Power BI dataflows – New features

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

The Power BI dataflows team has just posted a blog update on new dataflows capabilities added this month[1]. You should check it out here.

Once you’re done reading the blog post and are asking “where can I learn more about that new compute engine? you should head over to this post and watch the Microsoft Power BI: Enterprise-grade BI with Power BI dataflows session recording from this month’s Microsoft Business Applications Summit.

Go. Go now!


[1] And thank goodness they did, because I didn’t know what I was going to blog about today!

Managing email and work-life balance

I’ll probably never be the most consistent blogger, but WordPress recently made me aware of something: I only blog regularly when I’m taking time off from work.

Streak

I’m back in the office today after a week of part-time work from home[1] and I realized that without making a conscious to do so I ended up blogging every day that I was away from the office.

This insight got me thinking. Specifically, it got me thinking about how I spend my work days, and about how in recent months[2] I’ve been letting my inbox push me around. Although playing a defensive game can work in some contexts[3], I believe I need to adopt a more aggressive posture in this fight.

Starting today I’m trying this approach to Inbox Zero from MVP Luise Freese. I’m hoping that by managing my email more proactively and strategically I can not only be more productive at work, but also have more mental energy and time remaining for blogging.

My teammate Adam Saxton is doing the same thing; he’s a few days ahead of me and is pleased with his progress so far. I’ll check in with him – and with you – next week to see how things are going. Now back to Outlook; I have more items marked for follow-up today…


[1] Part time work from home and full-time feeling old: My older son graduated from high school last week. Where did the years go?

[2] Years.

[3] Please don’t tell Johannes Liechtenauer I said this.

Quick Tip: Creating “data workspaces” for dataflows and shared datasets

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

Power BI is constantly evolving – there’s a new version of Power BI Desktop every month, and the Power BI service is updated every week. Many of the new capabilities in Power BI represent gradual refinements, but some are significant enough to make you rethink how you your organization uses Power BI.

Power BI dataflows and the new shared and certified datasets[1] fall into the latter category. Both of these capabilities enable sharing data across workspace boundaries. When building a data model in Power BI Desktop you can connect to entities from dataflows in multiple workspaces, and publish the dataset you create into a different workspace altogether. With shared datasets you can create reports and dashboards in one workspace using a dataset in another[2].

The ability to have a single data resource – dataflow or dataset – shared across workspaces is a significant change in how the Power BI service has traditionally worked. Before these new capabilities, each workspace was largely self-contained. Dashboards could only get data from a dataset in the same workspace, and the tables in the dataset each contained the queries that extracted, transformed, and loaded their data. This workspace-centric design encouraged[3] approaches where assets were grouped into workspaces because of the platform, and not because it was the best way to meet the business requirements.

Now that we’re no longer bound by these constraints, it’s time to start thinking about having workspaces in Power BI whose function is to contain data artifacts (dataflows and/or datasets) that are used by visualization artifacts (dashboards and reports) in other workspaces. It’s time to start thinking about approaches that may look something like this:

data centric workspaces

Please keep in mind these two things when looking at the diagram:

  1. This is an arbitrary collection of boxes and arrows that illustrate a concept, and not a reference architecture.
  2. I do not have any formal art training.

Partitioning workspaces in this way encourages reuse and can reduce redundancy. It can also help enable greater separation of duties during development and maintenance of Power BI solutions. If you have one team that is responsible for making data available, and another team that is responsible for visualizing and presenting that data to solve business problems[4], this approach can given each team a natural space for its work. Work space. Workspace. Yeah.

Many of the large enterprise customers I work with are already evaluating or adopting this approach. Like any big change it’s safer to approach this effort incrementally. The customers I’ve spoken to are planning to apply this pattern to new solutions before they think about retrofitting any existing solutions.

Once you’ve had a chance to see how these new capabilities can change how your teams work with Power BI, I’d love to hear what you think.

Edit 2019-06-26: Adam Saxton from Guy In A Cube has published a video on Shared and Certified datasets. If you want another perspective on how this works, you should watch it.


[1] Currently in preview: blog | docs.

[2] If you’re wondering how these capabilities for data reuse relate to each other, you may want to check out this older post, as the post you’re currently reading won’t go into this topic: Lego Bricks and the Spectrum of Data Enrichment and Reuse.

[3] And in some cases, required.

[4] If you don’t, you probably want to think about it. This isn’t the only pattern for successful adoption of Power BI at scale, but it is a very common and tested pattern.

Quick Tip: Restricting access to linked entities in Power BI dataflows

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

If you use dataflows with Power BI Premium, you probably use linked and computed entities. There’s an overview post here, and an example of how to use these tools for data profiling here, but in case you don’t want to click through[1], here’s a quick summary:

  • When adding entities to a dataflow, you use another dataflow as a data source
  • This adds linked entities to your new dataflow, which are basically pointers to the entities in the source dataflow
  • You then use these linked entities as building blocks for new entities, using union or merge or similar approaches

This approach is simple and powerful, but[2] it may not always give you exactly what you want. For example, what if you don’t want the users who have access to your new computed entities to also have access to the linked entities your new dataflow references?

Let’s take a look at what this looks like. I’m using the dataflow I build for that older post on data profiling as the starting point[3], so if you’re a regular reader this may look familiar.

01 dataflow before

This is a simple dataflow that contains three linked entities and three computed entities. The computed entities use Table.Profile to generate profiles for the data in the linked entities. When you connect to the dataflow using Power BI Desktop, it looks like this:

02 - consumption before

As you can see, all six entities are available to load into Power BI Desktop.

What if you only wanted users to be able to read the profiles, without also granting them access to the entities being profiled? Why do dataflows give access to both?

The answer is equally simple, and obvious once you see it:

03 - load is enabled by default

As with other dataflow entities[4], the linked entities are enabled for load by default. Removing these entities from the dataflow is as simple as clearing this setting.

04 - load disabled

Once this option is cleared for the linked entities, the dataflow will look like this, with only the three computed entities being included:

05 - dataflow after

And as desired, only these entities are accessible to users in Power BI Desktop:

06 - Consumption after

Hopefully this quick tip is helpful. If this is something that has been making you wonder, please realize you’re in excellent company – you’re not the only one. And if you have other questions about using dataflows in Power BI, please don’t hesitate to ask!


[1] Don’t feel bad – I didn’t want to click through either, and wrote this summary mainly so I didn’t need to read through those older posts to see what I said last year.

[2] As I’ve recently learned by having multiple people ask me about this behavior.

[3] Because I’m lazy.

[4] And Power Query queries in general.