Using Custom Functions in Power BI Dataflows

You can use custom Power Query “M” functions in Power BI dataflows, even though they’re not exposed and supported in the preview Power Query Online editor to the same extent they are supported in Power BI Desktop.[1]

As mentioned in a recent post on Authoring Power BI Dataflows in Power BI Desktop, the Power Query “M” queries that define your dataflow entities can contain a lot more than what can be created in Power Query Online. One example of this is support for custom functions in a dataflow. Functions work the same way in dataflows as they work in Power Query Desktop – there’s just not the same UX support.

Let’s see how this works. Specifically, let’s build a dataflow that contains a custom function and which invokes it in one of the dataflow entities. Here’s what we’ll do:

  1. We’ll define a custom function that accepts start and end dates, and returns a table with one row for each day between these dates. Specifically, we’ll use the date dimension approach that Matt Masson published five years ago[2], when Power Query functions were new.
  2. We’ll pull in sales order data from the SalesOrderHeader table in the AdventureWorks sample database to be an “Order” entity in the dataflow.
  3. We’ll use the min and max of the various date columns in the SalesOrderHeader table to get the parameter values to pass into the custom function. We’ll then call the custom function to build a Date entity in the dataflow.
  4. We’ll close our eyes and imagine doing the rest of the work to load other entities in the dataflow to make what we’d need to build a full star schema in a Power BI dataset, but we won’t actually do the work.

Let’s go. Since we’re just copying the code from Matt‘s blog, we’ll skip the code here, but the result in Power Query Online is worth looking at.

2018-12-08_12-58-57

Even though Power Query Online doesn’t have a dedicated “create function” option, it does recognize when a query is a function, and does include a familiar UX for working with a function. You will, however, need to clear the “Enable load” option for the query, since a function can’t be loaded directly.

The Order entity is super simple – we’re just pulling in a table and removing the complex columns that Power Query adds to represent related tables in the database. Here’s the script:

let
Source = Sql.Database("myserver.database.windows.net", "adventureworks"),
SalesLT_SalesOrderHeader = Source{[Schema = "SalesLT", Item = "SalesOrderHeader"]}[Data],
#"Removed columns" = Table.RemoveColumns(SalesLT_SalesOrderHeader, {"SalesLT.Address(BillToAddressID)", "SalesLT.Address(ShipToAddressID)", "SalesLT.Customer", "SalesLT.SalesOrderDetail"})
in
#"Removed columns"

Now we need to put the two of them together. Let’s begin by duplicating the Order entity. If we referenced the Order entity instead of duplicating it, we would end up with a computed entity, which would require Power BI Premium capacity to refresh.

This is what the query looks like before we invoke the custom function. With all of the awesome options on the “Add Column” tab in Power BI Desktop, implementing this logic was surprisingly easy.

let
Source = Sql.Database("myserver.database.windows.net", "adventureworks"),
SalesLT_SalesOrderHeader = Source{[Schema="SalesLT",Item="SalesOrderHeader"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(SalesLT_SalesOrderHeader,{"OrderDate", "DueDate", "ShipDate"}),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Custom", each "Group by all"),
#"Grouped Rows" = Table.Group(#"Added Custom", {"Custom"}, {{"Min Order Date", each List.Min([OrderDate]), type datetime}, {"Max Order Date", each List.Max([OrderDate]), type datetime}, {"Min Due Date", each List.Min([DueDate]), type datetime}, {"Max Due Date", each List.Max([DueDate]), type datetime}, {"Min Ship Date", each List.Min([ShipDate]), type datetime}, {"Max Ship Date", each List.Max([ShipDate]), type datetime}}),
#"Changed Type" = Table.TransformColumnTypes(#"Grouped Rows",{{"Min Order Date", type date}, {"Max Order Date", type date}, {"Min Due Date", type date}, {"Max Due Date", type date}, {"Min Ship Date", type date}, {"Max Ship Date", type date}}),
#"Inserted Earliest" = Table.AddColumn(#"Changed Type", "Min Date", each List.Min({[Min Order Date], [Min Due Date], [Min Ship Date]}), type date),
#"Inserted Latest" = Table.AddColumn(#"Inserted Earliest", "Max Date", each List.Max({[Max Order Date], [Max Due Date], [Max Ship Date]}), type date),
#"Removed Other Columns1" = Table.SelectColumns(#"Inserted Latest",{"Min Date", "Max Date"})
in
#"Removed Other Columns1"

At the end of this interim query, we have two columns to pass in to the custom function. And once we do, it looks like this:

2018-12-08_13-48-40

And here’s the final script used to define the Date entity.

 
let
Source = Sql.Database("myserver.database.windows.net", "adventureworks"),
SalesLT_SalesOrderHeader = Source{[Schema="SalesLT",Item="SalesOrderHeader"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(SalesLT_SalesOrderHeader,{"OrderDate", "DueDate", "ShipDate"}),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Custom", each "Group by all"),
#"Grouped Rows" = Table.Group(#"Added Custom", {"Custom"}, {{"Min Order Date", each List.Min([OrderDate]), type datetime}, {"Max Order Date", each List.Max([OrderDate]), type datetime}, {"Min Due Date", each List.Min([DueDate]), type datetime}, {"Max Due Date", each List.Max([DueDate]), type datetime}, {"Min Ship Date", each List.Min([ShipDate]), type datetime}, {"Max Ship Date", each List.Max([ShipDate]), type datetime}}),
#"Changed Type" = Table.TransformColumnTypes(#"Grouped Rows",{{"Min Order Date", type date}, {"Max Order Date", type date}, {"Min Due Date", type date}, {"Max Due Date", type date}, {"Min Ship Date", type date}, {"Max Ship Date", type date}}),
#"Inserted Earliest" = Table.AddColumn(#"Changed Type", "Min Date", each List.Min({[Min Order Date], [Min Due Date], [Min Ship Date]}), type date),
#"Inserted Latest" = Table.AddColumn(#"Inserted Earliest", "Max Date", each List.Max({[Max Order Date], [Max Due Date], [Max Ship Date]}), type date),
#"Removed Other Columns1" = Table.SelectColumns(#"Inserted Latest",{"Min Date", "Max Date"}),
#"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns1", "fn_DateTable", each fn_DateTable([Min Date], [Max Date], null)),
fn_DateTable1 = #"Invoked Custom Function"{0}[fn_DateTable],
#"Changed Type1" = Table.TransformColumnTypes(fn_DateTable1,{{"Year", Int64.Type}, {"QuarterOfYear", Int64.Type}, {"MonthOfYear", Int64.Type}, {"DayOfMonth", Int64.Type}, {"DateInt", Int64.Type}, {"DayInWeek", Int64.Type}, {"MonthName", type text}, {"MonthInCalendar", type text}, {"QuarterInCalendar", type text}, {"DayOfWeekName", type text}})
in
#"Changed Type1"

 

Most of the complexity in this approach is in the work required to get min and max values from three columns in a single table. The topic of the post – calling a custom function inside a dataflow entity definition – is trivial.

When we’re done, the list of entities only shows Order and Date, because these are the only two queries that are being loaded into the dataflow’s CDM folder storage. But the definition of the Date query includes the use of a custom function, which allows us to have rich and possibly complex functionality included in the dataflow code, and referenced by one or more entities as necessary.


[1] I was inspired to write this post when I saw this idea on ideas.powerbi.com. If this capability is obscure enough to get its own feature request and over a dozen votes, it probably justifies a blog post.

[2] And the same code, copied and pasted, like real developers do.

Dataflows in Power BI: Overview Part 8 – Using an Organizational Azure Data Lake Resource

One key aspect of Power BI dataflows is that they store their data in CDM folders in Azure Data Lake Storage gen2.[1] When a dataflow is refreshed, the queries that define the dataflow entities are executed, and their results are stored in the underlying CDM folders in the data lake storage that’s managed by the Power BI service.

By default the Power BI service hides the details of the underlying storage. Only the Power BI service can write to the CDM folders, and only the Power BI service can read from them.

NARRATOR:

But Matthew knew that there are other options beyond the default…

Please note: At the time this post is published, the capabilities it describes are being rolled out to Power BI customers around the world. If you do not yet see these capabilities in your Power BI tenant, please understand that the deployment process may take several days to reach all regions.

In addition to writing to the data lake storage that is included with Power BI, you can also configure Power BI to write to an Azure Data Lake Storage gen2 resource in your own Azure subscription. This configuration opens up powerful capabilities for using data created in Power BI as the source for other Azure services. This means that data produced by analysts in a low-code/no-code Power BI experience can be used by data scientists in Azure Machine Learning, or by data engineers in Azure Data Factory or Azure Databricks.

Let that sink in for a minute, because it’s more important that it seemed when you just read it. Business data experts – the people who may not know professional data tools and advanced concepts in depth, but who are intimately involved with how the data is used to support business processes – can now use Power BI to produce data sets that can be easily used by data professionals in their tools of choice. This is a Big Deal. Not only does this capability deliver the power of Azure Data Lake Storage gen2 for scale and computing capability, it enables seamless collaboration between business and IT.

The challenge of operationalization/industrialization that has been part of self-service BI since self-service BI has been around has typically been solved by business handing off to IT the solution that they created. Ten years ago the artifact being handed off may have been an Excel workbook full of macros and VLOOKUP. IT would then need to reverse-engineer and re-implement the logic to reproduce it in a different tool and different language. Power Query and dataflows have made this story simpler – an analyst can develop a query that can be re-used directly by IT. But now an analyst can easily produce data that can be used – directly and seamlessly – by IT projects. Bam.

Before I move on, let me add a quick sanity check here. You can’t build a production data integration process on non-production data sources and expect it to deliver a stable and reliable solution, and that last paragraph glossed over this fact. When IT starts using a business-developed CDM folder as a data source, this needs to happen in the context of a managed process that eventually includes the ownership of the data source transitioning to IT. The integration of Power BI dataflows and CDM folders in Azure Data Lake Storage gen2 will make this process much simpler, but the process will still be essential.

Now let’s take a look at how this works.

I’m not going to go into details about the data lake configuration requirements here – but there are specific steps that need to be taken on the Azure side of things before Power BI can write to the lake. For information on setting up Azure Data Lake Storage gen2 to work with Power BI, check the documentation.

The details are in the documentation, but once the setup is complete, there will be a filesystem[2] named powerbi, and the Power BI service will be authorized to read it and write to it. As the Power BI service refreshes dataflows, it writes entity data in a folder structure that matches the content structure in Power BI. This approach – which has folders named after workspaces, dataflows, and entities, and files named after entities, makes it easier for all parties to understand what data is stored where, and how the file storage in the data lake relates to the the objects in Power BI.

To enable this feature, a Power BI administrator first needs to use the Power BI admin portal to connect Power BI to Azure Data Lake Storage gen2. This is a tenant-level setting. The administrator must enter the Subscription ID, the Resource Group ID, and the Storage Account name for the Azure Data Lake Storage gen2 resource that Power BI will use. The administrator needs to turn it on. In the admin portal there is an option labeled “Allow workspace admins to assign workspaces to this storage account.” Once this is turned on, we’re ready to go.

And of course, by “we” I mean ” workspace admins” and by “go” I mean “configure our workspaces storage settings.”

When creating a new app workspace, in the “Advanced” portion of the UI, you can see the “Dataflow storage (Preview)” option. When this option is enabled, any dataflow in the workspace will be created in the ADLSg2 resource configured by the Power BI admin, rather than in the default internal ADLSg2 storage that is managed by the Power BI service.

workspace settings

There are a few things worth mentioning about this screen shot:

  1. This is not a Premium-only feature. Although the example above shows a workspace being created in dedicated Premium capacity, this is not required to use your own data lake storage account.
  2. If no Power BI administrator has configured an organizational data lake storage account, this option will not be visible.
  3. Apparently I need to go back and fix every blog post I’ve made up until now to replace “gen2” with “Gen2” because we’re using an upper-case G now.

There are a few limitations mentioned in the screen shot, and a few that aren’t, that are worth pointing out as well:

  1. Because linked and computed entities use in-lake compute, you need to be using the same lake for them to work.
  2. You can’t change this setting for a workspace that already has dataflows in it. This option is always available when creating a new workspace, and will also be available in existing workspaces without dataflows, but if you have defined dataflows in a workspace you cannot change its storage location.
  3. Permissions… get a little complicated.

…so let’s look at permissions a little[3].

When you’re using the default Power BI storage, the Power BI service manages data access through the workspace permissions. Power BI service is the only reader and the only writer for the underlying CDM folders, and the Power BI service controls any access to the data the CDM folders contain.

When you’re using your organization’s data lake resource, ADLSg2 manages data access through the ACLs set on the folders and files. The Power BI service will grant permissions to the dataflow creator, but any additional permissions must be manually set on the files and folders in ADLSg2[4]. This means that for any user to access the dataflow through Power BI or the CDM folder through ADLSg2, they need to be granted permissions on all files and folders in ADLSg2.

Between the ability to store dataflow data in your organization’s Azure Data Lake Storage gen2 resource, and the ability to attach external CDM folders as dataflows, Power BI now enables a wide range of collaboration scenarios


[1] This time I just copied the opening sentence from the last blog post. Since I was writing them at the same time, that was much easier.

[2] Basically a root folder, says the guy who doesn’t really know much about Azure Data Lake Storage gen2.

[3] I’m planning a post dedicated to dataflows security, but it’s not ready yet. Hopefully this will be useful in the interim.

[4] This early experience will improve as the integration between Power BI and ADLSg2 continues to evolve.

More Resources: Power BI Dataflows and Azure Data

I’m not the only one who’s been busy sharing news and content this weekend about the integration of Power BI dataflows and Azure data services. Check out these additional resources and share the news.

  • Power BI Blog: This is the main Power BI announcement for the availability of Power BI dataflows integration with Azure Data Lake Storage Gen2.
  • Azure SQL Data Warehouse Blog: This is the main Azure announcement for the new integration capabilities, with lots of links to additional information for data professionals.
  • End-to-end CDM Tutorial on GitHub: This is the big one! Microsoft has published an end to end tutorial that includes Azure Data Factory, Azure Databricks, Azure SQL Data Warehouse, Azure SQL Database, and Azure Machine Learning.
  • CDM Documentation for ADLSg2: This is the official documentation for the Common Data Model including the model.json metadata file created for Power BI dataflows.

If you’re as excited as I am about today’s announcements, you’ll want to take the time to read all of these posts and to work through the tutorial as well. And probably do a happy dance of some sort.

Dataflows in Power BI: Overview Part 7 – External CDM Folders

One key aspect of Power BI dataflows is that they store their data in CDM Folders in Azure Data Lake Storage gen2.[1] When a dataflow is refreshed, the queries that define the dataflow entities are executed, and their results are stored in the underlying CDM Folders in the data lake.

By default the Power BI service hides the details of the underlying storage. Only the Power BI service can write to the CDM folders, and only the Power BI service can read from them.

NARRATOR:

But Matthew knew that there are other options beyond the default…

Because the CDM folder format is an open standard, any service or application can create them. A CDM folder can be produced by Azure Data Factory, Azure Databricks, or any other service that can output text and JSON files. Once the CDM folder exists, we just need to let Power BI know that it’s there.

Like this.

When creating a new dataflow, select the “Attach an external CDM folder” option. If you don’t see the “Attach an external CDM folder” and “Link entities from other dataflows” options, the most likely reason is that you’re not using a new “v2” workspace. These capabilities are available only in the new workspaces, which are currently also in preview.

2018-11-27_10-40-30

You’ll then be prompted to provide the same metadata you would enter when saving a standard Power BI dataflow (required name and optional description) and also to enter the path to the CDM folder in Azure Data Lake Storage gen2.

Just as you need permissions to access your data sources when building a dataflow in Power BI, you also need permission on the CDM folder in Azure Data Lake in order to attach the CDM folder as an external dataflow.

2018-11-27_18-17-22

And that’s it!

The other steps that would normally be required to build a new dataflow are not required when attaching an external CDM folder. You aren’t building queries to define the entities, because a service other than Power BI will be writing the data in the CDM folder.

Once this is done, users can work with this external CDM folder as if it were a standard Power BI dataflow. An analyst working with this data in Power BI Desktop will likely never know (or care) that the data came from somewhere outside of Power BI. All that they will notice is that the data source is easy to discover and use, because it is a dataflow.

One potential complication[2] is that Power BI Desktop users must be granted permissions both in Power BI and in Azure Data Lake in order to successfully consume the data. In Power BI, the user must be a member of the workspace that contains the dataflow. If this is not the case, the user will not see the workspace in the list of workspaces when connecting to Power BI dataflows in Power BI Desktop. In Azure Data Lake, the user must be granted read permissions on the CDM folder and the files it contains. If this is not the case, the user will receive an error when attempting to connect to the dataflow.

One additional consideration to keep in mind is that linked entities are not supported when referencing dataflows created from external CDM folders. This shouldn’t be a surprise given how linked entities work, but it’s important to mention nonetheless.

Now that we’ve seen how to set up external folders, let’s look at why we should care. What scenarios does this feature enable? The biggest scenario for me is the ability to seamlessly bridge the worlds of self-service and centralized data, at the asset level.

Enabling business users to work with IT-created data obviously isn’t a new thing – this is the heart of many “managed self-service” approaches to BI. But typically this involves a major development effort, and it involves the sharing of complete models. IT builds data warehouses and cubes, and then educates business users on how to find the data and connect to it. But with external CDM folders, any data set created by a data professional in Azure can be exposed in Power BI without any additional IT effort. The fact that the data is in CDM folder format is enough. Once the CDM folder is attached in Power BI, any authorized user can easily discover and consume the data from directly within Power BI Desktop. And rather than sharing complete models, this approach enables the sharing of more granular reusable building blocks that can be used in multiple models and scenarios.

There doesn’t even need to be a multi-team or multi-persona data sharing aspect to the scenario. If a data engineer or data scientist is creating CDM folders in Azure, she may need to visualize that data, and Power BI is an obvious choice. Although data science tools typically have their own visualization capabilities, their options for distributing insights based on those visuals tend to fall short of what Power BI delivers. For data that is in CDM folders in Azure Data Lake Store gen2, any data producer in Azure can easily have a seamless way to have their data easily exposed and shared with Power BI.

And of course, there are certainly many possibilities that I haven’t even thought of. I can’t wait to hear what you imagine!

Please also check out the blog post from Ben Sack on the dataflows team, because he goes into some details I do not.


[1] If you click through to read the CDM folders post you’ll see that I used almost exactly the same opening sentence, even though I hadn’t read that post since I wrote it over a month ago. That’s just weird.

[2] At least during the preview. I plan on going into greater depth on dataflows security in a future post, and you should expect to see things get simpler and easier while this feature is in preview.

Choosing Between Power BI Premium and Azure Analysis Services

Yesterday I posted an article comparing Power BI dataflows, Power BI datasets, and Azure Analysis Services. Although I’d like to believe that the article was useful, I used the disclaimer “I’m not an expert” in multiple places where I was talking about the differences between models in Power BI and AAS. I may not be an expert, but I do know quite a few people who are.

Specifically I know Gabi Münster and Oliver Engels from oh22data AG in Germany, and Paul Turley from Intelligent Business LLC in the United States.

Gabi and Oliver presented last month at the PASS Summit conference in Seattle on this very topic. Their session “Azure Analysis Services or Power BI – Which service fits you best?” looked at the history of the two services, their current capabilities and strengths, and future capabilities that Microsoft has announced in its business intelligence release notes. They even included a decision flowchart!

If you weren’t able to attend their session in November, I have good news and I have bad news and I have more good news.

The good news is that this session is included in the PASS Summit session recordings, which you can purchase and download today.

The bad news is that the session recordings cost $699, which may be difficult to justify if this is the only session you’re interested in[1].

The good news is that Oliver and Gabi were kind enough to share the slide deck and let me share it with you. You can download it here: AAS or PBI – Which service fits – from PASS Summit 2018.

And I’m very happy to see that their conclusions line up pretty well with my previous post.

PBI AAS

Paul has also presented a conference session[2] related to this topic, and has also recently blogged with an excellent feature comparison table between the different options in SQL Server Analysis Services, Azure Analysis Services, and Power BI.

If you’re in a position where you need to select a BI platform, I highly recommend checking out these resources, as they includes both valuable information, and a framework for using that information in different common scenarios.

Update: Check out this new post from James Fancke at selfservedbi.com: Why Azure Analysis Services is a great value proposition. This article provides a great counterpoint to my post, and drill-down specifically into Azure Analysis Services, and is well worth a read.

And if after reading these posts and this slide deck you still have unanswered questions, please seek professional help. Specifically, please find a Microsoft Partner who specializes in business intelligence, or a similar expert consultant who can help evaluate your specific needs and how the different available technical options can be applied to address them.


[1] The conference had many awesome sessions, so this should not be the only one you’re interested in.

[2] A delightfully themed conference session, at that.

Dataflows, Datasets, and Models – Oh My!

How do Power BI datasets and dataflows relate to each other? Do you need one if you have the other?

Photo by Chris Liverani on Unsplash

I received this question as a comment on another post, and I think it warrants a full post as a reply:

Hi Matthew,  my organization is currently evaluating where to put BI data models for upcoming PBI projects. Central in the debates is the decision of whether to use PBI Datasets, SSAS or DataFlows. I know a lot of factors need considering. I’m interested in hearing your thoughts.

Rather than answering the question directly, I’m gong to rephrase and re-frame it in a slightly different context.

I’m currently evaluating how to best chop and prepare a butternut squash. Central in the debates is the decision of whether to use a 6″ chef’s knife, a 10″ chef’s knife, or a cutting board.

(I’ll pause for a moment to let that sink in.)

It doesn’t really make sense to compare two knives and a cutting board in this way, does it? You can probably get the job done with either knife, and the cutting board will certainly make the job easier… but it’s not like you’d need to choose one of the three, right? Right?

Right!

Your choice of knife will depend on multiple factors including the size of the squash, the size of your hand, and whether or not you already have one or the other or both.

Your choice of using a cutting board will come down to your workflow and priorities. Do you already have a cutting board? Is it more important to you to have a safe place to chop the squash and not damage the edge of your knife, or is it more important to not have one more thing to clean?

Both of these are valid decisions that need to be made – but they’re not dependent on each other.

Let’s get back to the original question by setting some context for dataflows and datasets in Power BI.

2018-12-07_11-54-18.jpg

This image is from one of the standard slides in my dataflows presentation deck, and I stole it from the dataflows team[1]. It shows where datasets and dataflows fit in Power BI from a high-level conceptual perspective.

Here’s what seems most important in the context of the original question:

  • Power BI visualizations are built using datasets as their sources
  • Power BI includes datasets, which are tabular BI models hosted in the Power BI service
  • Dataflows are a data preparation capability in Power BI for loading data into Azure Data Lake Storage gen2
  • Dataflows can be used as a data source when building datasets in Power BI, but cannot currently be used as a data source for models outside of Power BI, including SSAS and AAS
  • Dataflows and datasets solve different problems and serve different purposes, and cannot be directly compared to each other as the original question tries to do – that’s like comparing chef’s knives and cutting boards

What’s not shown in this diagram is SQL Server Analysis Services (SSAS) or Azure Analysis Services (AAS) because the diagram is limited in scope to capabilities that are natively part of Power BI. SSAS and AAS are both analytics services that can host tabular BI models that are very similar to Power BI datasets, and which can be used as a data source for Power BI datasets. Each option – SSAS, AAS, or Power BI datasets – is implemented using the same underlying technology[2], but each has different characteristics that make it more or less desirable for specific scenarios.

This list isn’t exhaustive, and I make no claims to being an expert on this topic, but these are the factors that seem most significant when choosing between SSAS, AAS, or Power BI datasets as your analytics engine of choice:

  • Cost and pricing model – if you choose SSAS you’ll need to own and manage your own physical or virtual server. If you choose AAS or Power BI you’ll pay to use the managed cloud service. Dedicated Power BI Premium capacity and shared Power BI Pro capacity have different licensing models and costs tp target different usage patterns.
  • Model size – you can scale SSAS to pretty much any workload if you throw big enough hardware at it[3]. AAS can scale to models that are hundreds of gigabytes in size. Power BI Premium can support PBIX files up to 10GB[4], and Power BI Pro supports PBIX files up to 1GB.
  • Deployment and control scenarios – with SSAS and AAS, you have a huge range of application lifecycle management (ALM) and deployment capabilities that are enabled by the services’ XMLA endpoint and a robust tool ecosystem. Power BI Premium will support this before too long[5] as well.

I’m sure I’m missing many things, but this is what feels most important to me. Like I said, I’m far from being an expert on this aspect of Power BI and the Microsoft BI stack.

So let’s close by circling back to the original question, and that delicious analogy. You need a knife, but the knife you choose will depend on your requirements. Having a cutting board will probably also help, but it’s not truly required.

Now I’m hungry.

 


[1] If you want to watch a conference presentation or two that includes this slide, head on over to the Dataflows in Power BI: Resources post.

[2] This feels like an oversimplification, but it’s technically correct at the level of abstraction at which I’m writing it. If anyone is interested in arguing this point, please reply with a comment that links to your article or blog post where the salient differences are listed.

[3] Remember I’m not an expert on this, so feel free to correct me by pointing me to documentation. Thanks!

[4] This is not a direct size-to-size comparison. The services measure things differently.

[5] As announced at Microsoft Ignite a few months back, no firm dates shared yet.

Authoring Power BI Dataflows in Power BI Desktop

That title got your attention, didn’t it?

Before we move on, no, you cannot create and publish dataflow entities from Power BI Desktop today. Creating dataflows is a task you need to perform in the browser. But you can build your queries in Power BI Desktop if that is your preferred query authoring tool. Here’s how.[1]

  1. Create a query or set of queries in Power BI Desktop.
  2. Copy the query script(s) from the Power BI Desktop query editor, and paste it into a “Blank query” in Power Query Online.
  3. Rename the new queries to match your desired entity names, being careful to match the names of the source queries if there are any references between them.
  4. If necessary, disable the “load” option for queries that only implement shared logic and should not be persisted in the dataflow’s CDM folder.

That’s it.

Some of you may be asking “but why would I want to do this, when there’s already an awesome query authoring experience in the browser?”

Good question! There are three reasons why I will often use this technique:

  1. I prefer rich, non-browser-based editing tools[2], and Power BI Desktop has a polished and refined UX.
  2. The online editor doesn’t have all of the transformations in its UI compared to Power BI Desktop.
  3. The online editor doesn’t have all supported connectors exposed in the UI.

Each of these points relates to the the maturity of Power BI Desktop as a tool[3], as opposed to the relatively new Power Query Online. Power Query Online is part of the dataflows preview and is continuing to improve and expand in functionality, but Power BI Desktop has been generally available for years.

And Although I didn’t realize it until I started writing this post, Power BI Desktop actually has features that make this scenario easier than expected. Let’s look at this example. Here are the queries I’m starting with:

2018-12-05_13-06-44

In the PBIX file I have three sets of queries, organized by folder:

  1. A query that references the one data source that is used in the other queries, so I can change the connection information in one place and have everything else update.
  2. Three queries that each contain minimally-transformed data from the data source, and which are not loaded into the data model.
  3. Two queries that are loaded into the data model and which are used directly for reporting.

This is a common pattern for my PBIX files. I don’t know if it counts as a best practice (especially now that Power BI has better support for parameterization than it did when I started doing things this way) but it works for me, and nicely illustrates the approach I’m following.

To move this query logic out of my PBIX file and into a dataflow definition, I first need to copy the query scripts. Power BI Desktop makes this incredibly easy. When I right-click on any folder and choose “copy”, Power BI Desktop places the scripts for all queries in the workbook – including the query names as comments – on the clipboard.

2018-12-05_13-25-15

Now that I have all of the query scripts in a text editor, I can get started by creating a new dataflow, and selecting “Blank query” for my first entity.

2018-12-05_13-34-42

After I paste in the query, I will un-select the “Enable load” option, and will paste in the query name as well.

2018-12-05_13-44-11

Once this is done, I can repeat the process by selecting “Get data” option in the query editor, and choosing “Blank query” for each additional query.

2018-12-05_14-00-28

After I repeat this process for each remaining query, my dataflow will look something like this.

2018-12-05_14-12-05.jpg

And if I want to, I can even add groups to organize the queries.

2018-12-05_14-14-26

This looks a lot like where we started, which is both a good thing and a bad thing. The good side is that it demonstrates how we can use a more mature and authoring experience for our interactive query development. The bad side is that it introduces additional steps into our workflow.

I expect the integration between dataflows and Desktop to only get better over time[4], but for today, there’s still an easy path if you want to use Power BI Desktop to author your dataflow entities.

As a closing word of caution, please be aware that not all data sources and functions that work in Power BI Desktop will work in dataflows. If you’re using a data source for the first time, or are using a new function[5], you’ll probably want to test things early to avoid a potentially unpleasant surprise later on.

 


[1] I’ve mentioned this technique in a few other posts, but I’ve heard a bunch of questions in recent days that makes me believe that the word isn’t getting out. Hopefully this post will help.

[2] I’m writing this post from the WordPress Windows app – even though the app offers nothing that the web editor does not, and actually appears to be the thinnest of wrappers around the web editor.

[3] And they all relate to the fact that Power BI Desktop is just so awesome, nothing compares to it, and although the Power Query Online editor is useful and good, it hasn’t had a team making it better every month, year after year.

[4] Please remember that this is my personal blog, and that even though I’m a member of the Power BI team, I’m not working on either dataflows or Desktop, so what I expect and what will actually happen aren’t always well aligned.

[5] Like Html.Table, which is not yet supported in dataflows. Darn it.