It is increasingly common to include external data in business intelligence and data science. If you’re working to understand retail sales patterns, you may want to include weather data, traffic data, event data, crime data, and other externally-sourced data sets to supplement your internal corporate data.
The World Bank publishes a number of freely available data sets, including this one, which provides per-country-per year metrics related to global climate change. The data set is available in three different formats: CSV, XML, and XLS. Each version is accessible via HTTP.
Using Power Query in Power BI Desktop makes it easy to include this data in a PBIX file and dataset, but doesn’t provide easy opportunities for reuse. And because significant transformation is required before the World Bank data can be used, each dataset that incorporates this data needs to duplicate this transformation logic, which is far from ideal.
Power BI dataflows present an obvious and elegant solution to this problem. Using dataflows, a user can use Power Query to define entities that are stored in and managed by the Power BI service. The queries and the transformation logic they contain can be defined once, and the data produced by the queries can be used by authorized users in as many workspaces and datasets as necessary.
But the nature of the World Bank API introduces a few challenges.
- The Excel data is in the binary XLS format, not XLSX, which means that it requires a gateway to refresh. This isn’t a show-stopper, but I always prefer to avoid using an on-premises gateway to refresh online data. It just feels wrong.
- The XML and CSV data isn’t actually exposed as XML and CSV files, They’re exposed as Zip files that contain XML and CSV files. And Power Query doesn’t support Zip files as a data source.
Or does it?
BI specialist Ken Russel, with a little help from MVP Chris Webb, figured out how to make this work. Ken published a Power Query “M” script that will decompress a Zip file, essentially exposing its content to the rest of the script as if it were a folder containing files.
Life is good.
With this part of the problem solved, the rest is pretty straightforward, especially considering that I already had a Power BI Desktop file that uses the World Bank Excel API, and that solves most of the rest of the problem.
With a little text editing aided by the Power Query extension for Visual Studio Code, I ended up with two queries: one for per-country metadata, and one for the climate change data itself. I can now get into the fun and easy part: dataflows.
I’ll start by opening my workspace, and creating a new dataflow, and adding an entity.
Since I already have the query written, I’ll choose “Blank Query” from the available data sources, and I’ll just paste in the query text.
At this point, all I need to do is give the query a name that will be meaningful and understandable to the users who will have access to the dataflow. As you can see in the screen shot below, even though the query was written largely by hand, all of the steps are maintained and can be interacted with in the Power Query editor.
Once I finish with the first entity, I can follow the same basic steps with the second entity, and can save the dataflow.
Since the dataflow can’t be used until its entities contain data, I’ll refresh it immediately. The dataflow and its last refresh time will now be shown in the dataflows list in my workspace.
From this point on, no user in my organization who wants to include World Bank climate change data in her reports will need to connect to the World Bank API and transform it until it is fit for purpose. All users will need to do is connect to Power BI dataflows from Power BI Desktop, where the data is always available, always current, and always consistent, managed by the Power BI service. Life is good.
Obviously, this pattern doesn’t apply only to World Bank data. It applies to any external data source where multiple users need to use the data in different solutions. It can also apply to internal/organizational data sources that need to be transformed to prepare them for analytics.
 See Ken’s awesome blog post here for the full details.
 I could also set up a refresh schedule to ensure that the data in the entities is always current, but I’m saving this for a future blog post.