Quick Tip: Working with dataflow-created CDM folders in ADLSg2

If you’re using your own organizational Azure Data Lake Storage Gen2 account for Power BI dataflows, you can use the CDM folders that Power BI creates as a data source for other efforts, including data science with tools like Azure Machine Learning and Azure Databricks.

Image by Arek Socha from Pixabay
A world of possibilities appears before you…

This capability has been in preview since early this year, so it’s not really new, but there are enough pieces involved that it may not be obvious how to begin – and I continue to see enough questions about this topic that another blog post seemed warranted.

The key point is that because dataflows are writing data to ADLSg2 in CDM folder format, Azure Machine Learning and Azure Databricks can both read the data using the metadata in the model.json file.

This json file serves as the “endpoint” for the data in the CDM folder; it’s a single resource that you can connect to, and not have to worry about the complexities in the various subfolders and files that the CDM folder contains.

This tutorial is probably the best place to start if you want to know more[1]. It includes directions and sample code for creating and consuming CDM folders from a variety of different Azure services – and Power BI dataflows. If you’re one of the people who has recently asked about this, please go through this tutorial as your next step!


[1] It’s the best resource I’m aware of  – if you find a better one, please let me know!

Power BIte: Creating dataflows by importing model.json

This week’s Power BIte is the third in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by importing a model.json file previously exported from Power BI, the dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service.
Data location Data stored in the CDM folder defined for the newly created dataflow
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace.

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-11-06-14-50-53-981--Code

At the top of the file we can see the dataflow name on line 2…

…and that’s pretty much all that’s important here. The rest of the model.json file will exactly match what was exported from the Power BI portal, and will look a lot like this or this. Boom.

For a little more detail (and more pictures, in case you don’t want to watch a four minute video) check out this post from last month, when this capability was introduced.

If this information doesn’t make sense yet, please hold on. We still have one more incoming Power BIte in this series, and then we’ll have the big picture view.

I guarantee[3] it will make as much sense as anything on this blog.


[1] New videos every Monday morning![2]

[2] Did you notice that I just copied the previous post and made some small edits to it? That seemed very appropriate given the topic…

[3] Or your money back.

Diversity of Perspective

I blogged last October[0] about the challenges I faced when trying to use the new Html.Table function in Power Query. A key part of my challenge was the gap between my perspective, and the perspective of the team shipping and documenting this feature.

At first I chalked this off as an old dog[1] trying to learn a new trick[2], but I’ve been thinking about this since then, and I’m not sure that this is the case. I think that it may have been the result of differing perspectives – and differing expectations resulting from those perspectives.

Image by S. Hermann & F. Richter from Pixabay
This may or may not be a photo of Matthew

My perspective is that as a data integration tool, Power Query will work the way that my ~20 years as a data professional have trained me to expect data integration tools will work. If there’s a query language or formula language or expression language that is required to access a specific source, I expect that language to be identified and documented in the tool.

The Power Query team, on the other hand, may have had a different perspective here. I haven’t explicitly asked them[3], but I suspect that their perspective is that it’s 2018, and anyone working with HTML data already knows what CSS selectors are, and either knows how to use them, or where to look to learn enough to use them.

I don’t know which perspective is more valid. Part of me believes that mine is, and bemoans the time I spent struggling to achieve a simple goal, because the documentation didn’t connect the dots for me. But I also note that no one – not in comments here, not on Twitter – has said that they were similarly challenged.

But I can say this: A difference in perspective meant that what was delivered wasn’t what was needed, at least by one user.

Another example of this type of mismatch is one I see too often[4] at conferences: Microsoft presenters using Microsoft’s specialized vocabulary when speaking with non-Microsoft audiences. This typically takes the form of using internal code names and acronyms, rather than official product names – if you’ve been to more than a handful of Microsoft events you’ve probably seen this yourself. I think the worst example I’ve seen was when a presenter mentioned that an upcoming feature was coming “in the scandium time frame.[5]

Every culture – whether it’s centered in a geographical region, a profession, a religion, or a large corporation – has a specialized internal nomenclature. It enables members to communicate more efficiently. This isn’t a bad thing – it’s natural and good, and helps teams and groups deliver on their goals and priorities.

But problems can and do arise when one party doesn’t take the other party’s background into consideration. This is where having a diverse team can help.

When a team has a diverse makeup, it makes it more likely that potential problems will be prevented before they need to be identified, and identified before they need to be fixed.

If you want to be more efficient and to produce products and services (and documentation!) that delivers on your customers’ needs the first time, every time, by default, your team makeup should reflect the customers who use your product. If you look around and everyone on your team looks the same, this should be a warning sign that customers who don’t look like you probably don’t have the same experience that you do.

And if you don’t see that as a problem, you should probably look elsewhere for your problem. Try looking in the mirror.

Update: Two days after this post was published, David Heinemeier Hansson posted a blistering example of why diversity is so important, using his wife’s experience with Apple’s new credit card to drive the point home. I strongly recommend reading the whole thread.


[0] I started writing this post in November 2018, and it’s been languishing in my drafts folder ever since. I’m making an effort to clean up my drafts by the end of the year, so hopefully this one will actually see the light of day before it’s 2020. Fingers crossed…

[1] Me.

[2] CSS selectors.

[3] I feel like I’m enough of a problem child most days, so I try not to bother them unless it’s really necessary.

[4] Although thankfully not nearly as often as I used to.

[5] If you know what this means, you work on the Azure team[6]. Sadly, the people in the audience did not work on the Azure team. Thankfully, someone in the audience stood up and asked for clarification.

[6] When I worked on the Azure team I still didn’t know. I was constantly asking for clarification in meeting after meeting and email after email. Maybe I am just slow…

Power BI dataflows November updates

If you head on over to the official Power BI blog today you’ll see this announcement, and if you’re like me there will be a few things that immediately jump out at you:

2019-11-04-15-08-37-362--msedge

  1. That’s almost two features per day
  2. Dataflows should have a lower-case D
  3. Miguel’s profile picture is even older than Matthew’s profile picture

All snark aside[1], Miguel and the whole dataflows team have been awfully busy, and it’s great to see their work available to Power BI authors. I won’t attempt to repeat what’s in the announcement, but I will highlight the new capabilities that have me most excited:

  • Support for data profiling in Power Query Online – we’ve had this in Power BI Desktop for a while, but it’s just as important for dataflows as it is for datasets.
  • Better support for files and folders – a lot of the data I play with[2] these days is in folders full of text files, and Power Query Online hasn’t had the best experience for working with this type of data.
  • Better support for query parameters – there are lots of scenarios[3] where having parameterized queries makes working with dataflows easier, and now Power Query Online makes it easier to work with query parameters.

Do yourself a favor and check out the whole list. Odds are there’s something you’ve been waiting for that will excite you as much as these new capabilities excite me.

And I can’t wait to hear what they are…


[1] No, I don’t believe that’s possible either, but it is nice to see that you’ve been paying attention.

[2] Very little of my actual work involves data prep these days, so I need to find data to play with to avoid getting too bored.

[3] Like this one.

Power BIte: Creating dataflows with linked and computed entities

This week’s Power BIte is the second in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by defining new entities, the final dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via the mashup engine hosted in the Power BI service, using source data that is also managed by the Power BI service, taking advantage of locality of data.
Data location Data stored in the CDM folder defined for the dataflow for computed entities. Data for linked entities remains in source dataflow and is not moved or copied.
Data refresh The dataflow is refreshed based on the schedule and policies defined in the workspace.

Let’s look at the dataflow’s model.json metadata to see some of the details.

2019-11-04-07-00-30-025--Code

At the top of the file we can see the mashup definition, including the query names and load settings on lines 11 through 35 and the Power Query code for all of the entities on line 37. This will look awfully familiar from the last Power BIte post.

Things start to get interesting and different when we look at the entity definitions:

2019-11-04-07-04-23-519--Code

On line 80 we can see that the Product entity is defined as a ReferenceEntity, which is how the CDM metadata format describes what Power BI calles linked entities. Rather than having its attribute metadata defined in the current dataflow’s model.json file, it instead identifies the source entity it references, and the CDM folder in which the source entity is stored, similar to what we saw in the last example. Each modelId value for a linked entity references the id value in the referenceModels section as we’ll see below.

The Customers with Addresses entity, defined starting on line 93, is the calculated entity built in the video demo. This entity is a LocalEntity, meaning that its data is stored in the current CDM folder, and its metadata includes both the location, and its full list of attributes.

The end of the model.json file highlights the rest of the differences between local and linked entities.

2019-11-04-07-16-41-335--Code

At line 184 we can see the partitions for the Customers with Addresses entity, including the URL for the data file backing this entity. Because the other entities are linked entities, their partitions are not defined in the current model.json.

Instead, the CDM folders where their data does reside are identified in the referenceModels section starting at line 193. The id values in this section match the modelId values for the model.json file, above, and the location values are the URLs to the model.json files that define the source CDM folders for the linked entities.

If this information doesn’t make sense yet, please hold on. We’ll have different values for the same attributes for other dataflow creation methods, and then we can compare and contrast them.

I guarantee[2] it will make as much sense as anything on this blog.


[1] New videos every Monday morning!

[2] Or your money back.

Never stop asking “stupid” questions

When I joined Microsoft, the team running my NEO[1] session shared a piece of advice with my “class” of fresh-faced new hires:

“Think of your first year as a grace period where you can ask any question you want, without anyone thinking it’s a stupid question for which you should already know the answer.”

This sounded like empowering wisdom at the time, but the unspoken side of it was damaging. Between the lines, I heard this message as well, and it was this part that stuck with me:

“You’ve got one year to figure things out, and after that you’d better have your act together and know everything – because if you keep asking stupid questions we’ll know that we made a mistake hiring you.”

I hope it’s obvious that this wasn’t the intent of the advice, but I’ve spoken to enough people over the years to know that I’m far from the only one to take it this way.

bored-3126445_640
How Matthew pictures everyone reacting when he asks questions.

In retrospect, I believe that I should have known better, but I let this unspoken message find a home in my brain, and I listened to it. I remained quiet – and remained ignorant – when I should have been asking questions.

Over the past few years[2], I have finally broken this self-limiting habit. Day after day, and meeting after meeting, I’m the guy asking the questions that others are thinking, and wishing they could ask, but don’t feel comfortable or confident enough to speak up. I’m the guy asking “why” again and again until I actually understand. And people are noticing.

How do I know that others want to ask the same questions?

I know because they tell me. Sometimes they say thank you in the meetings, and sometimes ask their own follow-up questions. Sometimes they stop me in the hall after the meeting to say thank you. And on a few occasions people have set up 1:1 meetings with me to ask about what I do, and how they can learn to do the same.

How do I know that people are noticing?

A few of the people I’ve interrupted to ask “why” have also set up time with me to discuss how they can better communicate and prepare to have more useful and productive meetings.[3] These are generally more experienced team members who appreciate that my questions are highlighting unstated assumptions, or helped them identify areas where they needed a clearer story to communicate complex topics.

I’ve been with Microsoft since October 2008 – a little over 11 years. I’ve been working in the industry since the mid 90s. At this point in my career it’s easy for me to ask “stupid” questions because the people I work with know that I’m not stupid.[4][5]

This isn’t true for everyone. If you’re younger, new to role or new to career, or from an underrepresented group[6], you may not have the position of relative safety that I have today. My simple strategy of “asking lots of questions in large groups” may not work for you, and I don’t have tested advice for what will.

My suggestion is to ask those questions in small groups or 1:1 situations where the risk is likely to be lower, and use this experience to better understand your team culture… but I would love to hear your experience and advice no matter what you do. Just as your questions will be different from mine while still being useful, your experience and advice will be different, and will be useful and helpful in different ways.

Whatever approach you take, don’t be quiet. Don’t stop asking questions.

Especially the stupid questions.

Those are the best ones.

 

P.S. While this post has been scheduled and waiting to go live, there have been two new articles that showed up on my radar that feel very relevant to this topic:

  1. The Harvard Business Review posted on Cracking the Code of Sustained Collaboration and how leaders can build and encourage cultures where these behaviors are rewarded.
  2. Ex-Microsoftie James Whittaker posted on Speaking Truth to Power, which presents critical observations of three eras of Microsoft, with examples of how different generations of leaders have affected the corporate culture.

These are very different articles, but they’re both fascinating, and well worth the time to read.


[1] New Employee Orientation. Sadly not anything to do with Keanu Reeves.

[2] I wish I could say that this was a deliberate strategic move on my part, but it was more reactive. Credit goes to the rapid pace of change and my inability to even pretend I could keep up.

[3] Yes, that means meetings where fewer people interrupt to ask “why.”

[4] I’m waiting for the flood of snarky comments from my teammates on this one…

[5] Or in any event, my asking questions is unlikely to change any minds on this particular point.

[6] If you’re not a middle-aged, white, cisgender man…

Power BI and ADLSg2 – but not dataflows

Back in July[1] the Power BI team announced the availability of a new connector for Azure Data Lake Storage Gen2.

It's a data lake. Get it?
When Matthew closes his eyes and pictures a data lake, this is what he sees.

In recent weeks I’ve been starting to hear questions that sound like these:

Question: Is this ADLSg2 connector how you get to the data behind dataflows?

Answer: No. Dataflows are how you get to the data behind dataflows.

Question: Is this how I can access dataflows if I don’t use Power BI Premium?

Answer: No. Dataflows are not a Premium-only feature.

Question: Can I use the ADLSg2 connector to work with CDM folder data?

Answer: Yes, but why would you?

If your data is already in CDM folders, using the ADLSg2 connector simply adds effort to consuming it in Power BI. You’ll be working with raw, untyped text files instead of working with strongly typed entities.

If your ADLSg2 data is already in CDM folders, strongly consider attaching the CDM folder as a dataflow. This means less up-front work for you, and less ongoing work for the users who need to get insights from the data.

Question: Why do we need an ADLSg2 connector if we have dataflows?

Answer: Now that is a good question!

Power BI dataflows store their data in CDM folder format, and they can be configured to store those CDM folders in your organization’s ADLSg2 data lake. In addition to this, you can attach a CDM folder in ADLSg2 as an external dataflow, making its data available to Power BI users even though the data ingress is taking place through another tool like Azure Data Factory.

But ADLSg2 is much, much more[2] than a repository for dataflows or CDM folders. ADLSg2 supports all sorts of file and blob data, not just CDM folders. And sometimes you need to work with that data in Power BI.

The ADLSg2 connector exists for these scenarios, when your data is not stored in CDM folders. With this connector, users in Power BI Desktop can connect to ADLSg2 resources and work with the files they contain, similar to the existing HDFS and Folder connectors.


[1] Yes, this is another catch-up post that has been waiting to be finished. No, I do not have any reason to believe that 2020 will be any more forgiving than 2019 has been.

[2] I could have linked to the product documentation or the official product page, but I believe that Melissa‘s blog does the best job summing up ADLSg2 in a single post.