Power BI and ADLSg2 – but not dataflows

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

Back in July[1] the Power BI team announced the availability of a new connector for Azure Data Lake Storage Gen2.

It's a data lake. Get it?
When Matthew closes his eyes and pictures a data lake, this is what he sees.

In recent weeks I’ve been starting to hear questions that sound like these:

Question: Is this ADLSg2 connector how you get to the data behind dataflows?

Answer: No. Dataflows are how you get to the data behind dataflows.

Question: Is this how I can access dataflows if I don’t use Power BI Premium?

Answer: No. Dataflows are not a Premium-only feature.

Question: Can I use the ADLSg2 connector to work with CDM folder data?

Answer: Yes, but why would you?

If your data is already in CDM folders, using the ADLSg2 connector simply adds effort to consuming it in Power BI. You’ll be working with raw, untyped text files instead of working with strongly typed entities.

If your ADLSg2 data is already in CDM folders, strongly consider attaching the CDM folder as a dataflow. This means less up-front work for you, and less ongoing work for the users who need to get insights from the data.

Question: Why do we need an ADLSg2 connector if we have dataflows?

Answer: Now that is a good question!

Power BI dataflows store their data in CDM folder format, and they can be configured to store those CDM folders in your organization’s ADLSg2 data lake. In addition to this, you can attach a CDM folder in ADLSg2 as an external dataflow, making its data available to Power BI users even though the data ingress is taking place through another tool like Azure Data Factory.

But ADLSg2 is much, much more[2] than a repository for dataflows or CDM folders. ADLSg2 supports all sorts of file and blob data, not just CDM folders. And sometimes you need to work with that data in Power BI.

The ADLSg2 connector exists for these scenarios, when your data is not stored in CDM folders. With this connector, users in Power BI Desktop can connect to ADLSg2 resources and work with the files they contain, similar to the existing HDFS and Folder connectors.


[1] Yes, this is another catch-up post that has been waiting to be finished. No, I do not have any reason to believe that 2020 will be any more forgiving than 2019 has been.

[2] I could have linked to the product documentation or the official product page, but I believe that Melissa‘s blog does the best job summing up ADLSg2 in a single post.

Power BIte: Creating dataflows with Power Query Online

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

This week’s Power BIte is the first in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by defining new entities in Power Query Online, the final dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service
Data location Data stored in the CDM folder defined for the newly created dataflow
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-10-27-10-13-08-592--Code

At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 19 and the Power Query code for all of the entities on line 22.

2019-10-27-10-23-37-698--Code.png

At the bottom of the file we can see information about the refresh and storage.[2] Line 26 identifies the entity as a LocalEntity, which means that the entity’s data is physically stored in the current CDM folder.

Line 30 shows that the entity is fully refreshed rather than incrementally refreshed, and line 31 shows the file name where the entity data is stored. Lines 97 through 99 identify the single partition where the data for the current version of the entity is stored, including the full URI for the data file. If this entity used incremental refresh, there would be multiple partitions to match the incremental refresh policy.

If this information doesn’t all make sense just yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.

I guarantee[3] it will make as much sense as anything on this blog.


[1] New videos every Monday morning!

[2] The same information is also included starting on line 103 for the Promotions entity,  but is not pictured here.

[3] Or your money back.

It all comes down to culture

I talk about data culture a lot, and in my presentations I often emphasize how the most important success factor when adopting a tool like Power BI[1] is the culture of the organization, not the tool itself.

I talk about this a lot, but I think Caitie McCaffrey may have just had the final word.[2]

2019-10-24-12-47-22-722--msedge

I don’t think that Caitie was talking about the enterprise adoption of self-service business intelligence, but she could have been.

In my day job I get to talk to leaders from large companies around the world, and to see how they’re adopting and using Power BI, Azure. Before today I didn’t think of Moby Dick – I thought of Leo Tolstoy’s classic Anna Karenina, which starts with this classic line:

All happy families are alike; each unhappy family is unhappy in its own way.

Although the details vary, large companies that have successfully adopted managed self-service BI at scale have cultures with important aspects in common:

  • Leaders empower business users to work with data
  • Leaders trust business users to use data to make better decisions
  • IT supports business users with platforms and tools and with curated data sources
  • Business users work with the tools from IT and the guidance from leaders, and work within the guardrails and guidelines given to them for this use
  • Business and IT collaborate to deliver responsive solutions and mature/stable solutions, with clearly defined responsibilities between them

Companies that are successful with managed self-service BI do these things. Companies that are not successful do not. The details vary, but the pattern holds up again and again.

How do these roles and responsibilities relate to culture?

In many ways a culture is defined by the behaviors it rewards, the behaviors it allows, and the behaviors it punishes. A culture isn’t what you say – it’s what you do.

In the context of BI, having a culture with shared goals that enable business and IT to work together with the support from the company leaders is the key. If you have this culture, you can be successful with any tool. Some tools may be more helpful than others, and the culture will enable the selection of better tools over time, but the tool is not the most important factor. The culture – not the tool – inevitably determines success.

This is not to say that BI tools should not improve to be a bigger part of the solution. But to paraphrase Caitie… maybe you should let that white whale swim past.

 


[1] But definitely not only Power BI.

[2] He says unironically, before writing many more words.

Using and reusing Power BI dataflows

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

I use this diagram a lot[1]:

excel white

This diagram neatly summarizes a canonical use case for Power BI dataflows, with source data being ingested and processed as part of an end-to-end BI application. It showcases the Lego-like composition that’s possible with dataflows. But it also has drawbacks – its simplicity omits common scenarios for using and reusing dataflows.

So, let’s look at what’s shown – and at what’s not shown – in my favorite diagram. Let’s look at some of the ways these dataflows and their entities can be used.

  1. Use the final entities as-is: This is the scenario implied by the diagram. The entities in the “Final Business View” dataflow represent a star schema, and are loaded as-is into a dataset.
  2. Use the final entities with modification: The entities in the “Final Business View” dataflow are loaded into a dataset, but with additional transformation or filtering applied in the dataset’s queries.
  3. Use the final entities with mashup: The entities in the “Final Business View” dataflow are loaded into a dataset, but with additional data from other sources added via the dataset’s queries.
  4. Use upstream entities: The entities in other dataflows are loaded into a dataset, likely with transformations and filtering applied, and with data from other sources added via the dataset’s queries.

Please understand that this list is not exhaustive. There are likely dozens of variations on these themes that I have not called out explicitly. Use this list as a starting point and see where dataflows will take you. I’ll keep the diagram simple, but you can build solutions as complex as you need them to be.


[1] This is my diagram. There are many like it, but this one is mine.

 

Power BIte: Sharing and reuse with dataflows and datasets

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

Last week I kicked off the new BI Polar YouTube channel with a video and blog post comparing and contrasting Power BI dataflows and datasets. In the days that followed, I continued to hear questions that led me to believe I hadn’t done a great job answering one vital question:

When would I use dataflows, and when would I use shared datasets?

Here’s the short answer:

And here’s the long answer: Lego Bricks and the Spectrum of Data Enrichment and Reuse.

The video focuses on – and demonstrates – sharing and reuse with both dataflows and datasets. It’s short and to the point[1] and focuses on this one question.

The blog post takes a more conceptual view, using Lego bricks as a metaphor for dataflows and datasets and the types of reuse they enable.

If you’ve watched the videos and read the posts and you still have questions, please let me know.


[1] As short and to the point as anything you’re likely to get from me, anyway. Brevity has never been my forte.

Fiore’s Virtues of Business Intelligence

In the late 1300s and early 1400s, Fiore de’i Liberi was a knight, a diplomat, and a fencing master. He also wrote one of the most comprehensive treatises on medieval combat, his Flower of Battle, of which four copies survive in museums and private collections today. Fiore started – or was a significant evolutionary step in – one of the most important and long-lasting traditions in armed and unarmed combat.

In addition to detailed instruction on fighting with dagger, longsword, spear, and other weapons, Fiore’s manuscript included a preface with information about the virtues that any fencer[1] would need to be successful in combat.

MS_Ludwig_XV_13_32r.jpg

In the image above, Fiore pictures the seven blows of the sword, and his four virtues, each represented by a different animal[2][3]:

This Master with these swords signifies the seven blows of the sword. And the four animals signify four virtues, that is prudence, celerity, fortitude, and audacity. And whoever wants to be good in this art should have part in these virtues.

Fiore then goes on to describe each virtue in turn:

Prudence
No creature sees better than me, the Lynx.
And I always set things in order with compass and measure.

Celerity
I, the tiger, am so swift to run and to wheel
That even the bolt from the sky cannot overtake me.

Audacity
None carries a more ardent heart than me, the lion,
But to everyone I make an invitation to battle.

Fortitude
I am the elephant and I carry a castle as cargo,
And I do not kneel nor lose my footing.[4]

Step back and read this again: “And whoever wants to be good in this art should have part in these virtues.”

That’s right – Fiore was documenting best practices, 600+ years ago. And although I suspect that Fiore wasn’t thinking about business intelligence projects at the time, I do believe that these virtues are just as relevant to the slicing and dicing[5] we’re still doing today. Let me explain.

Prudence – “…I always set things in order with compass and measure“: A successful BI practitioner knows what needs to be done before a project can begin, and when additional work is required before they can get started. Initiating a project requires careful setup and planning, and moving before the prerequisites for success are in place can be disastrous.[6]

Celerity – “I… am so swift to run and to wheel that even the bolt from the sky cannot overtake me:  Business requirements change day to day and hour to hour. To succeed, a BI practitioner must be prepared to move quickly and decisively, engaging without delay when an opportunity presents itself – and also be prepared to change direction as the needs of the project change.

Audacity – “…to everyone I make an invitation to battle:  Any project declined presents an opening for another practitioner, another team, another tool, and this is likely to reduce opportunities over time. Saying yes to difficult projects – and succeeding in their execution – is necessary to ensure that future projects don’t pass you by.

Fortitude – “And I do not kneel nor lose my footing: When Fiore speaks of fortitude, he does not speak of the strength that comes from big muscles. He speaks of the strength that comes from structure, and balance. His “elephant with a castle on its back” is a perfect metaphor for a BI solution delivered quickly and confidently because of the solid and stable platform on which it is built. Success doesn’t come from the extra effort put in when delivering a solution – it comes from the care and planning that went into the overall data estate.

You may look at these virtues and see contradiction – how can you have prudence and audacity and celerity? The answer for BI is the same answer that it is for the sword: practice, training, and preparation. In both situations, whether you’re battling with an armed foe or battling with a difficult client, you need to apply the right virtues at the right times, and to understand both the big picture and the day to day steps that produce larger successes. In both situations you’re also facing complex and dynamic challenges where you need to quickly take advantage of opportunities as they arise, and create opportunities when they don’t appear on their own[7]. Fortunately, as BI practitioners we can rely on the strengths of our teams – it’s not always a solo battle.

You may also look at these virtues and see Matthew stretching to make the most tenuous of analogies work, just because he loves swords as much as he loves BI. While this may be true, I do honestly believe that these virtues do apply here. Over the past 20-25 years I have seen many projects succeed because these virtues were embodied by the people and teams involved, and I’ve seen many projects fail where these virtues were absent. This isn’t the only way to look at success factors… but at the moment it’s my favorite.

In closing, I’d like to mention that this post marks one year since I started this blog. In the past year I’ve published almost 90 posts, and have had roughly 50,000 visitors and 100,000 page views. Here’s hoping that by applying Fiore’s virtues I’ll be able to make the next year even more productive and more successful than the year that has passed.

Thanks to all of you who read what I write, and who provide feedback here and on Twitter – I couldn’t do it without you.


[1] Fencer in this context meaning someone who fights with swords or other edged weapons, not the Olympic-style sport of fencing that a modern reader might picture when reading the word.

[2] As translated by Michael Chidester and Colin Hatcher.

[3] Although it may not be obvious to the modern reader, the animal at the bottom is an elephant with a tower or castle on its back. I suspect that Fiore never actually saw an elephant.

[4] In case these terms don’t immediately have meaning, prudence == wisdom, celerity == speed, audacity == daring, and fortitude == strength.

[5] See what I did there?

[6] I assume that Fiore’s use of the term “measure” here is pure coincidence.

[7] If you’ve worked on a high-stakes, high-visibility BI project where requirements changed during implementation, or where not all stakeholders were fully committed to the project goals, this will probably feel very familiar.

Power BI dataflows best practices

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

I do a lot that’s not related to dataflows. In fact, dataflows take up a surprisingly small part of my day, if your insight into my calendar came solely from this blog.

Despite this, I like to believe that I’m keeping my finger on the pulse of this feature, and when I learned today that the dataflows team had published best practice guidance almost a month ago, I was shocked and surprised.

Here are those best practices: https://docs.microsoft.com/en-us/power-bi/service-dataflows-best-practices

Image by rawpixel from Pixabay
This is apparently where Matthew thinks documentation comes from

In my defense, the blog post where this guidance was announced was the overall September update for dataflows, and it was the last link at the bottom of the post… but I still should have noticed.

These practices were produced by the dataflows team, and are based on questions and support tickets from customers around the world. Definitely check them out, and see how you can incorporate them into your Power BI solutions!

Power BIte: Dataflows vs. datasets

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

This is still one of the most common dataflows questions: what’s the difference between Power BI dataflows and Power BI datasets?

For the last year I have resisted tackling this question head-on. This isn’t because it’s a bad or “dumb” question. Just the opposite – this is a very simple question, and the simpler a question is, the more complex and nuanced the answer is likely to be.

See how complex this is?
A graphical representation of the answer’s likely complexity.

If you’re a regular reader of this blog, you probably already know the answer, because I’ve answered it already. Sort of. The existing answer is distributed across dozens of posts, and if you’ve read all of them and picked up the answer along the way. But I keep hearing this question, and I keep thinking that there must be a more direct answer I could share.

Here it is, in a single, simple table[1].

Power BI dataflows Power BI datasets
Implementation CDM folder Analysis Services tabular model
Storage[2] CSV files Vertipaq
Metadata[3] Common Data Model – model.json BISM
Development Power Query Online Power Query in Power BI Desktop
Primary purpose[4] Data reuse Data analysis
Reuse[5] Acts as data source in multiple datasets Shared datasets across workspaces
Scope of reuse[6] Entity level reuse Dataset level reuse
Mashup with other data sources[7] Yes No
Used for reporting[8] Not directly Yes
Reuse outside Power BI[9] Yes, through ADLSg2 Yes, through XMLA
Data access methods[10] Import Import, DirectQuery
Connection methods[11] Import Live Connection
Row-level security No Yes
Certification and promotion Not yet Yes
What else am I missing? Please let me know! Seriously, you should let me know.

Update: I’ve added a few rows to the table after the post was originally published, to incorporate feedback from readers on differences I had missed. Thank you!

Each of the rows in this table could easily be an in-depth topic in and of itself, so if you’re looking at any of them and thinking “that’s not quite right” I might very well agree with you. There’s a lot of context and a lot of nuance here, and we’re trying to sum things up in a word or two… which is kind of the whole point.

Oh yeah, there’s a video too.[12]

I can’t wait to hear what you think!


[1] A simple table with ten footnotes.

[2] The storage aspect of dataflows and datasets is one of the most significant differences between the two. Datasets use the Vertipaq column store to load data into an optimized and highly compressed in-memory representation that is optimized for analysis. Dataflows use text files in folders, which are optimized for interoperability.

[3] The Analysis Services Tabular engine uses the BI Semantic Model (BISM) to represent its metadata. This is a metadata model originally included in SQL Server 2012 Analysis Services, and used by the Tabular engine ever since.

[4] Saying “this is the primary purpose” of any complex tool is fraught with risk, because no matter what you say, there are other valid things that remain unsaid. With this said… the big gap that dataflows close is that of self-service data preparation for the purpose of data sharing and reuse. Power BI has always had self-service data preparation through Power Query, but before dataflows the data that was prepared was “locked” in a dataset, for analysis, and not for sharing or reuse.

[5] Once you have loaded data into dataflows, authorized users can reuse entities from multiple dataflows, and use them as the building blocks for new dataflows or new datasets. Once you have loaded data into a dataset (and published it to the Power BI service) you can enable users to connect to it.

[6] With dataflows, users can pick and choose the entities they want, but a dataset can only be reused as-is.

[7] Dataflow entities can be used as data sources in the same Power BI Desktop file as other data sources, and can serve as part of a mashup or composite model, but a dataset can only be reused as-is.

[8] Although you can obviously use dataflows for reporting, you do so by first importing the data from the dataflow into a dataset.

[9] It’s interesting to point out that using your own organizational ADLSg2 account does not require Power BI Premium, but using the XMLA endpoint to connect to Power BI datasets from non-Power BI clients does.

[10] You can only import data into your dataflow entities, but tables in your dataset can import data or use DirectQuery, and a dataset can use a combination of the two.

[11] You can only import data from a dataflow into a dataset. When connecting to a shared dataset you can only use Live Connections.

[12] I’ve been thinking of making videos to supplement this blog for almost as long as I’ve been hearing the question that inspired this post. Please take a moment to share your thoughts on the video. This is something of a “soft launch” and although I have plans for a few dozen more videos already, your feedback will be a main factor in how the video series evolves.

Power BI dataflows and query folding

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

In a recent post I mentioned an approach for working around the import-only nature of Power BI dataflows as a data source in Power BI Desktop, and in an older post I shared information about the enhanced compute engine that’s currently available in preview.

Image by Chris Martin from Pixabay

Some recent conversations have led me to believe that I should summarize a few points about dataflows and query folding, because these existing posts don’t make them easy to find and understand.

  1. When accessing dataflow entities from Power BI Desktop, no query folding takes place, even if the enhanced compute engine is enabled.
  2. When accessing dataflow entities from other entities in the Power BI service, no query folding takes place unless the enhanced compute engine enabled.
  3. When accessing dataflow entities from other entities in the Power BI service, query folding will take place when the enhanced compute engine is enabled, because the linked entity’s query will be executed against the cached data in SQL, rather than the underlying CDM folder.

These three statements summarize how query folding works – or does not work – in Power BI dataflows today.

The Power BI team has discussed some of their plans for the enhanced compute engine, so this should change in the future [1] but as of today, the only dataflows scenario where query folding takes place is when a dataflow is backed by the enhanced compute engine is referenced by a linked entity.

I hope this helps clarify things, at least a little…


[1] I think this will be around the time the engine goes GA, but I don’t remember for sure, and I’m too lazy to re-watch the MBAS session to double check. If you watch it and let me know, I’ll gladly update this post with the details.