Dataflows and API data sources

These days more and more data lives behind an API, rather than in a more traditional data source like databases or files. These APIs are often designed and optimized for small, chatty[1] interactions that don’t lend themselves well to use as a source for business intelligence.

beehive-337695_1280
I’d explain the choice of image for APIs, but it doesn’t take a genus to figure it out

These APIs are often slower[3] than a database, which can increase load/refresh times. Sometimes the load time is so great that a refresh may not fit within the minimum window based on an application’s functional requirements.

These APIs may also be throttled. SaaS application vendors often have a billing model that doesn’t directly support frequent bulk operations, so to avoid customer behaviors that affect their COGS and their bottom line, their APIs may be limited to a certain number of calls[4] for a given period.

The bottom line is that when you’re using APIs as a data source in Power BI, you need to take the APIs’ limitations into consideration, and often dataflows can help deliver a solution that accommodates those limitations while delivering the functionality your application needs.

I’ve covered some of the fundamental concepts of this approach in past blog posts , specifically this one from 2018 on Power BI Dataflows and Slow Data Sources, and this one from 2019 on Creating “data workspaces” for dataflows and shared datasets.  I hadn’t planned on having a dedicated new blog post, but after having this pattern come up in 4 or 5 different contexts in the past week or so, I thought maybe a post was warranted.

In late July I met with a customer to discuss their Power BI dataflows architecture. They showed me a “before” picture that looked something like this:

2020-08-01-13-40-14-887--POWERPNT

One of their core data sources was the API for a commercial software application. Almost every Power BI application used some data from this API, because the application supports almost every part of their business. This introduced a bunch of familiar challenges:

  • Training requirements and a steeper-then-necessary learning curve for SSBI authors due to the complex and unintuitive API design
  • Long refresh times for large data extracts
  • Complex, redundant queries in many applications
  • Technical debt and maintenance due to the duplicate logic

Then they showed me an “after” picture that looked something like this:

2020-08-01-13-40-32-810--POWERPNT

They had implemented a set of dataflows in a dedicated workspace. These dataflows have entities that pull data from the source APIs, and make them available for consumption by IT and business Power BI authors. All data transformation logic is implemented exactly once, and each author can easily connect to trusted tabular data without needing to worry about technical details like connection strings, API parameters, authentication, or paging. The dataflows are organized by the functional areas represented in the data, mapping roughly to the APIs in the source system as viewed through the lens of their audience.

The diagram above simplifies things a bit. For their actual implementation they used linked and computed entities to stage, prepare, and present the dataflows, but for the general pattern the diagram above is probably the right level of abstraction. Each IT-developed or business-developed application uses the subset of the dataflows that it needs, and the owners of the dataflows keep them up to date and current.

Life is good[5].


[1]There’s a lot of information available online covering chatty and chunky patterns in API design, so if you’re not sure what I’m talking about here, you might want to poke around and take a peek[2] at what’s out there.

[2] Please let me know if you found the joke in footnote[1].

[3] Possibly by multiple orders of magnitude.

[4] Or a given data volume, etc. There are probably as many variations in licensing as there are SaaS vendors with APIs.

[5] If you’re wondering about the beehive image and the obtuse joke it represents, the genus for honey bees is “apis.” I get all of my stock photography from the wonderful web site Pixabay. I will typically search on a random term related to my blog post to find a thumbnail image, and when the context is completely different like it was today I will pick one that I like. For this post I searched on “API” and got a page full of bees, which sounds like something Bob Ross would do…

5 thoughts on “Dataflows and API data sources

  1. Pingback: API Consumption with Power BI Dataflows – Curated SQL

  2. Jordan

    Not envious of an enterprise that’s using APIs as their primary data source – if it were me I’d slap in a middle tier to call the APIs on a set basis and dump the returns to something more enterprise grade like a data warehouse that would enable DirectQuery, query folding, incremental refresh, etc.

    APIs get really tricky with 1) gateways and 2) paging.

    PBI Desktop accepts M that can’t be used by the gateway. Chris Webb covered that extensively here: https://blog.crossjoin.co.uk/2016/08/23/web-contents-m-functions-and-dataset-refresh-errors-in-power-bi/ but its frustrating to get everything working Power BI Desktop then hit a brick wall with the gateway.

    Paging also requires a whole bunch of M that average PBI user is not going to know how to implement and isn’t very efficient since the calls cannot be made in parallel.

    If a middle tier isn’t an option I’d build a custom connector to help translate the API calls to a more business-friendly interface, similar to what we’ve done with the Power BI REST API custom connector: https://github.com/migueesc123/PowerBIRESTAPI

    Like

    1. For this particular customer and their particular API, dataflows are effectively serving as that middle tier. Definitely agreed that this may not be the best option for every use case… but it’s certainly better than having those complex M queries copied and pasted into 10s of different datasets. 😉

      Like

    2. jerry deng

      Dataflow and PBI provide an Agile way to implement business data solutions, without waiting for the longer IT cycle of datawarehoues implementation. Also considering the proliferation of SaaS platforms used today in every business, there will always be critical business data from SaaS (hence API) that an EDW cannot cover or catch up on.

      On the other hand, it’s important to selectively migrate some of these dataflows into EDW when appropriate.

      Like

  3. Pingback: Power BI dataflows PowerShell scripts on GitHub – BI Polar

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s