Authoring Power BI Dataflows in Power BI Desktop

That title got your attention, didn’t it?

Before we move on, no, you cannot create and publish dataflow entities from Power BI Desktop today. Creating dataflows is a task you need to perform in the browser. But you can build your queries in Power BI Desktop if that is your preferred query authoring tool. Here’s how.[1]

  1. Create a query or set of queries in Power BI Desktop.
  2. Copy the query script(s) from the Power BI Desktop query editor, and paste it into a “Blank query” in Power Query Online.
  3. Rename the new queries to match your desired entity names, being careful to match the names of the source queries if there are any references between them.
  4. If necessary, disable the “load” option for queries that only implement shared logic and should not be persisted in the dataflow’s CDM folder.

That’s it.

Some of you may be asking “but why would I want to do this, when there’s already an awesome query authoring experience in the browser?”

Good question! There are three reasons why I will often use this technique:

  1. I prefer rich, non-browser-based editing tools[2], and Power BI Desktop has a polished and refined UX.
  2. The online editor doesn’t have all of the transformations in its UI compared to Power BI Desktop.
  3. The online editor doesn’t have all supported connectors exposed in the UI.

Each of these points relates to the the maturity of Power BI Desktop as a tool[3], as opposed to the relatively new Power Query Online. Power Query Online is part of the dataflows preview and is continuing to improve and expand in functionality, but Power BI Desktop has been generally available for years.

And Although I didn’t realize it until I started writing this post, Power BI Desktop actually has features that make this scenario easier than expected. Let’s look at this example. Here are the queries I’m starting with:

2018-12-05_13-06-44

In the PBIX file I have three sets of queries, organized by folder:

  1. A query that references the one data source that is used in the other queries, so I can change the connection information in one place and have everything else update.
  2. Three queries that each contain minimally-transformed data from the data source, and which are not loaded into the data model.
  3. Two queries that are loaded into the data model and which are used directly for reporting.

This is a common pattern for my PBIX files. I don’t know if it counts as a best practice (especially now that Power BI has better support for parameterization than it did when I started doing things this way) but it works for me, and nicely illustrates the approach I’m following.

To move this query logic out of my PBIX file and into a dataflow definition, I first need to copy the query scripts. Power BI Desktop makes this incredibly easy. When I right-click on any folder and choose “copy”, Power BI Desktop places the scripts for all queries in the workbook – including the query names as comments – on the clipboard.

2018-12-05_13-25-15

Now that I have all of the query scripts in a text editor, I can get started by creating a new dataflow, and selecting “Blank query” for my first entity.

2018-12-05_13-34-42

After I paste in the query, I will un-select the “Enable load” option, and will paste in the query name as well.

2018-12-05_13-44-11

Once this is done, I can repeat the process by selecting “Get data” option in the query editor, and choosing “Blank query” for each additional query.

2018-12-05_14-00-28

After I repeat this process for each remaining query, my dataflow will look something like this.

2018-12-05_14-12-05.jpg

And if I want to, I can even add groups to organize the queries.

2018-12-05_14-14-26

This looks a lot like where we started, which is both a good thing and a bad thing. The good side is that it demonstrates how we can use a more mature and authoring experience for our interactive query development. The bad side is that it introduces additional steps into our workflow.

I expect the integration between dataflows and Desktop to only get better over time[4], but for today, there’s still an easy path if you want to use Power BI Desktop to author your dataflow entities.

As a closing word of caution, please be aware that not all data sources and functions that work in Power BI Desktop will work in dataflows. If you’re using a data source for the first time, or are using a new function[5], you’ll probably want to test things early to avoid a potentially unpleasant surprise later on.

 


[1] I’ve mentioned this technique in a few other posts, but I’ve heard a bunch of questions in recent days that makes me believe that the word isn’t getting out. Hopefully this post will help.

[2] I’m writing this post from the WordPress Windows app – even though the app offers nothing that the web editor does not, and actually appears to be the thinnest of wrappers around the web editor.

[3] And they all relate to the fact that Power BI Desktop is just so awesome, nothing compares to it, and although the Power Query Online editor is useful and good, it hasn’t had a team making it better every month, year after year.

[4] Please remember that this is my personal blog, and that even though I’m a member of the Power BI team, I’m not working on either dataflows or Desktop, so what I expect and what will actually happen aren’t always well aligned.

[5] Like Html.Table, which is not yet supported in dataflows. Darn it.

Power BI Dataflows and Slow Data Sources

One advantage of data warehouses is the ability to have data from multiple data sources available for analysis without being constrained by the availability and performance characteristics of those data sources. Although Power BI dataflows are not a replacement for data warehouses in a general sense, they can provide same value to a self-service BI solution.

I’ve been using Power Query for over five years, and one consistently recurring pattern is the need to work with slow data sources such as web pages or APIs. “Slow” might mean “takes 10 minutes to load” but it could also mean “takes 10 hours” or “takes 10 days.” In each of these cases, my overall productivity ended up being closely tied to the refresh duration for the slowest data source in my solution, and this often introduced significant delay and frustration into my life. So I looked for solutions…

In the early days of Power BI, before Power BI dataflows, I would sometimes work around this problem by defining a set of Power Query queries in an Excel workbook, configuring those queries to load data into worksheets rather than into the data model. I would then use that Excel workbook as a data source for my primary queries. Essentially I was using an Excel file as a self-service data mart, since there wasn’t any similar capability in Power BI at the time. Here’s an example of the queries in a Power BI solution that I built almost five years ago[1]:

2018-11-23_15-33-06

These queries are the foundation of a solution I built to track my personal fitness, back when my exercise routine revolved around lifting weights. Back before I knew that people fought with longswords. I haven’t looked at it in years, but the organization shown in the screen shot above sure looks familiar.

There’s one folder for the queries extracting data from each source system. These queries are configured as “connection only” which means the data isn’t loaded directly into the data model – these queries serve as data sources for other queries. One of the folders for source queries references my Excel “staging area” workbook. As implied above, this other workbook contains data from sources that are too slow to include in the refresh of the primary solution workbook, as well as sources that update very infrequently, so that refreshing the data on a regular basis would be unwarranted and inefficient.

There’s also one folder for the queries that define the solution’s data model. These queries are loaded and represent a simple star schema that is friendly for analytics use cases.[2]

Does this pattern look familiar?

I hope it does. This is essentially the same pattern that I’ve included in almost every dataflows presentation I’ve delivered. This one:

2018-11-23_15-59-39

Dataflows make this pattern much easier to implement and maintain.

Were I to build a similar solution today[3], I would factor it in the same way. Each folder in the Excel-based solution would translate into a Power BI dataflow, and the queries in the folders would translate into entities. By having each source as its own dataflow, I can independently configure refresh schedules based on the performance and data freshness characteristics of each source. And by having the data in dataflow entities, I can easily reuse that data in multiple Power BI solutions – and enable other users in my organization to do the same[4].

If I were using Power BI Premium, I would probably build the whole solution using dataflows and computed entities, and only load the star schema into Power BI Desktop.

If I were not using Power BI Premium, I would probably build everything but the star schema using dataflows, and build the star schema using queries in Power BI Desktop that get their data from these dataflows.[5] As if the dataflows were logically a data warehouse.

Which is sort of where we came in…[6]

 


[1] The astute reader will realize that this is back before Power BI existed as we know it today. It feels strange to share the details of such an old solution, but this goes to show that although new tools are always being introduced, old problem patterns never seem to go away.

[2] In September 2014 I did a presentation about this solution for the PASS Data Warehousing and Business Intelligence Virtual Chapter, and the recording is still online on YouTube. I have not watched it, and don’t plan on watching it, and you should be warned before you watch it, because if I remember correctly it includes shirtless progress pictures of me.

[3] I will not do this, but only because I don’t have similarly rich sources of swordfighting-related data.

[4] This sounds like I’m implying that I have a group of analysts working on my personal fitness, doesn’t it? That’s just weird, even to me.

[5] This is another “reuse without Premium” scenario, independent of the one introduced in my earlier Power BI Dataflows – Reuse without Premium post.

[6] Imagine my disappointment when this screen shot just didn’t fit into this post. If only there was some way I could find a reason to include it somewhere…

2018-11-23_15-57-32.png

Power BI Dataflows FAQ

Photo by Matthew Brodeur on Unsplash

Q: What are Power BI dataflows?

A: Dataflows are a capability in Power BI for self-service ETL and data preparation that enable analysts and business users to define and share reusable data entities. Each dataflow is created in a Power BI workspace and can contain one or more entities. Each entity is defined by a Power Query “M” query. When the dataflow is refreshed, the queries are executed, and the entities are populated with data.

Q: Where is the data stored?

A: Data is stored in Azure Data Lake Storage gen2 (ADLSg2) in the CDM folder format. Each dataflow is saved in a folder in the data lake. The folder contains one or more files per entity. If an entity does not use incremental refresh, there will be one file for the entity’s data. For entities that do use incremental refresh, there will be multiple files based on the refresh settings. The folder also contains a model.json file that has all of the metadata for the dataflow and the entities.

Q: Do I need to pay for Power BI dataflows?

A: Yes, but you don’t need to pay extra for them. Dataflows are available to Power BI Pro and Premium users.

Q: Do I need Power BI Premium to use dataflows?

A: No. Although some specific features (incremental refresh of dataflow entities and linked/computed entities) do require Premium, dataflows are not a Premium-only capability.

Q: How much data storage do I get?

A: Storage for dataflow entities counts against the existing limits for Power BI. Each user with a Power BI Pro license has a limit of 10GB, and each Premium capacity node has a limit of 100TB.

Q: Do dataflows support incremental refresh?

A: Yes. Incremental refresh can be configured on a per-entity basis. Incremental refresh is supported only in Power BI Premium.

Q: Can I use on-premises data sources with dataflows?

A: Yes. Dataflows use the same gateways used by Power BI datasets to access on-premises data sources.

Q: How do I do X with dataflows? I can do it in a query in Power BI Desktop, but I don’t see it in the dataflows query editor UI!

A: Most Power Query functionality is available in dataflows, even if it isn’t exposed through the query editor in the browser. If you have a query that works in Power BI Desktop, copy the “M” script into a “blank” query to create a new dataflow entity. In most cases it will work.

Q: Do I still need a data warehouse if I use dataflows?

A: If you needed a data warehouse before Power BI dataflows, you probably still need a data warehouse. Although dataflows serve a similar logical function as a data warehouse or data mart, modern data warehouse platforms provide capabilities that dataflows do not.

Q: Do I need dataflows if I already have a data warehouse?

A: Dataflows fill a gap in data warehousing and BI tools by allowing business users and analysts to prepare and share data without needing help from IT. With dataflows, users can build a “self service data mart” in Power BI that can be used in their solutions. Because each dataflow entity is defined by a Power Query “M” query, handing off the definitions to an IT team for operationalization/industrialization is more straightforward.

Q: Do dataflows replace Azure Data Factory?

A: No. Azure Data Factory (ADF) is a hybrid data integration platform designed to support enterprise-scale ETL and data integration needs. ADF is designed for use by professional data engineers. Power BI dataflows are designed for use by analysts and business users – people familiar with the Power Query experience from Power BI Desktop and Excel – to load data into ADLSg2.

Q: With the Wrangling Data Flow in Azure Data Factory do we still need Dataflows in Power BI?

A: Probably. Power BI dataflows are a self-service data preparation tool that enable analysts and other business users who may not be comfortable using SSIS or ADF to solve data prep problems without IT involvement. This is still relevant even after ADF now includes Power Query with Wrangling Data Flows.

Q: Can I use dataflows for realtime / streaming data?

A: No. Dataflows are for batch data, not streaming data.

Q: Do dataflows replace Power BI datasets?

A: No. Power BI datasets are tabular analytic models that contain data from various sources. Power BI dataflows can be some or all of the sources used by a dataset. You cannot build a Power BI report directly against a dataflow – you need to build reports against datasets.

Q: How can I use the data in a dataflow?

A: No. Oh, wait, that doesn’t make sense – this wasn’t even a yes or no question, but I was on a roll… Anyway, you the use data in a dataflow by connecting to the Power BI dataflows connector in Power Query. This will give you a list of all workspaces, dataflows, and entities that you have permission to access, and you can use them like any other data source.

Q: Can I connect to dataflows via Direct Query?

A: No. Dataflows are an import-only data source.

Q: Can I use the data in dataflows in one workspace from other workspaces?

A: Yes! You can import entities from any combination of workspaces and dataflows in your PBIX file and publish it to any workspace where you have the necessary permissions.

Q: Can I use the data in a dataflow from tools other than Power BI?

A: Yes. You can configure your Power BI workspace to store dataflow data in an Azure Data Lake Storage gen2 resource that is part of your Azure subscription. Once this is done, refreshing the dataflow will create files in the CDM Folder in the location you specify. The files can then be consumed by other Azure services and applications.

Q: Why do dataflows use CSV files? Why not a cooler file format like Parquet or Avro?

A: The dataflows whitepaper answers this one, but it’s still a frequently asked question. From the whitepaper: “CSV format is the most ubiquitously supported format in Azure Data Lake and data lake tools in general, and CSV is generally the fastest and simplest to write for data producers.” You should probably read the whole thing, and not just this excerpt, because later on it says that Avro and Parquet will also be supported.

Q: Is this an official blog or official FAQ?

A: No, no. Absolutely not. Oh my goodness no. This is my personal blog, and I always suspect the dataflows team cringes when they read it.

Power Query: Connecting to Excel Workbooks in SharePoint Online

I try to avoid using Excel as a data source when I can. Although Excel has many benefits[1], its limitations[2] make it less than ideal for enterprise BI solutions. Despite this, it feels like I end up needing to analyze data in Excel at least once a month. And every time I’m connecting from Power Query to an Excel workbook located in SharePoint Online or OneDrive for Business, I find myself exclaiming “it shouldn’t be this hard!”[3]

What’s the problem? It’s not working with the data – Power Query makes this as painless as you’d expect. The problem is referencing the XSLX file. What path do I use?

In Excel and in SharePoint, there is a “copy link” option, which will copy a link to the workbook so you can share it with others.

doesn't work

it always feels like this should work, that I should be able to paste this URL into the Get Data experience in Power Query, and if it’s been more than a few months since the last time I tried, I expect it to work. But it never does.

This works:

this works

In the Excel ribbon, click on the File tab, and on the Info screen that appears click on the label under the file name. There’s no visual indicator that this label is a menu button until you hover over it, but when you click on it, a menu appears with a “Copy path to clipboard” option.

This will give you a path in this format:

https://organization.sharepoint.com/path/path/path/NicelyEscapedFileName.xlsx?web=1

With this starting point, all you need to do is remove the ?web=1 query string parameter at the end of the URL, and you have the path to the workbook that Power Query needs.

Choose “Web” as the data source in the Get Data menu, and paste this URL into the Dialog:

From web

Power Query is smart enough to understand that this URL is pointing to an Excel workbook, so from this point on, you’ll get an experience where you can select tables and worksheets, and can start manipulating and analyzing your Excel data.

navigator

Life is good.

Full disclosure: Were it not for Power Query program manager Miguel Llopis, I never would have found this. During one of our dataflows sessions at Microsoft Ignite, Miguel spent a few minutes walking through the these steps at the beginning of a demo. My initial reaction was along the lines of “Why are you showing this?  This isn’t a dataflows topic!” but when I realized what he was showing it quickly changed to “Holy crap – I didn’t know you could do that!”

In any event, I’m working with data in Excel again today, and spent a few minutes trying to find this technique documented anywhere. When I couldn’t find any, this post was born… but all credit should go to Miguel.

 


[1] None of which will I list here. It’s not that kind of blog.

[2] Neither will the limitations be enumerated. Feel free to take my word for it, or go build a data warehouse or two that extract data from Excel, and then come back and let me know what you think.

[3] To be fair, this may be another of those “get off my lawn” moments. It’s hard to tell sometimes.