Where does Power BI end and Microsoft Fabric begin?

Over the holiday weekend I joined Adam Saxton for a Fabric-focused Guy in a Cube live stream. It was a ton of fun, as these streams always are, and we ended up going for 90 minutes instead of the scheduled 60[1].

Come for the hard-earned wisdom. Stay for the ridiculous facial expressions.

During the stream there were a few questions (including this one) where the answer involved understanding where Power BI ends and Microsoft Fabric begins. I suspect that this will be a common source of confusion in these early days of Fabric, and I hope this post will help clarify things.

Let’s begin at the beginning, with Power BI. When you think about Power BI, you’re probably thinking about something like this: Power BI as a SaaS cloud service for business intelligence.

The reality has always been a little more nuanced than this. Under the hood, Power BI has always included two main components: BI artifacts that operate on a general-purpose data-centric SaaS foundation.

Power BI includes artifacts like reports, datasets, dataflows, and more. Each artifact includes a set of user experiences and the underlying capabilities required to make those experiences work – things like the Analysis Services Tabular engine for datasets.

Power BI also includes things like workspaces, authentication and authorization, tools for content sharing, discovery, and management, administration and governance, and more. These foundational capabilities apply to every BI artifact.

When Power BI added support for dataflows, paginated reports, or goals, they worked just like you expected them to work based on your past experiences with datasets and reports. This familiarity was a function of that shared SaaS foundation, but for most people there was never a reason to think about Power BI as a service foundation with an increasing number of workloads running on that foundation.

With the introduction of Microsoft Fabric the distinction between the workloads and the service foundation becomes more apparent, because there are more workloads for more practitioner personas – and because the foundation includes OneLake[2], a shared capability that is used by all workloads, but isn’t fully abstracted into the service foundation.

With Fabric, the collection of artifacts and experiences running on top of the service foundation has expanded significantly, but those new experiences will be familiar to Power BI users because they run on the same foundation that Power BI experiences have been running on for years.

As you’re exploring Fabric you may find yourself asking questions like “will X feature from Power BI work for this new Fabric thing?” The most likely answer is if the feature is shared across multiple Power BI artifacts and experiences, it will work for new Fabric artifacts and experiences as well. While Fabric is in preview there will be more exceptions to this rule[3], but as Fabric moves to general availability these exceptions will be fewer.

With Fabric, the BI experiences and capabilities you’ve used and loved in Power BI will remain, and will continue to improve. And because all Fabric workloads run on the SaaS foundation that has supported Power BI for years, as that foundation grows and improves each workload will benefit from those improvements.

I’ll close with a gif. I’m generally not a big fan of gifs, but I think this one will illustrate and reinforce the story I’m trying to tell. Enjoy!

 


[1] Despite this bonus time we ended up leaving tons of questions unanswered. I think there might be enough interest in Fabric to keep the cube guys busy for a while.

[2] OneLake is a very big deal in Fabric, so if you haven’t already read my Introducing Microsoft Fabric and Microsoft Fabric and OneLake posts, you might want to check them out for additional information and context.

[3] If you watched the question and answer from the live stream, you’ll see an example of this.

 

Introducing Microsoft Fabric

This week at Microsoft Build, we announced the future.

With an introduction like that, I should probably remind everyone that this is my personal blog, my personal perspective, and my personal opinions. Although I am a Microsoft employee, I am not speaking for or otherwise representing my employer with this post or anything else on this blog.

With that disclaimer out of the way, let’s get back to the future. Let’s get back to Microsoft Fabric.

According to the official documentation, “Microsoft Fabric is an all-in-one analytics solution for enterprises that covers everything from data movement to data science, real-time analytics, and business intelligence.” Fabric is implemented as a SaaS service for all sorts of data, analytics, and BI capabilities, including:

I’ve been working on Fabric for around 18 months[1], and I could not be more excited to finally share it with the world. I don’t own any of the features coming in Fabric, but my team and I have been running an NDA private preview program with thousands of users from hundreds of customer organizations around the world building solutions using Fabric, and providing feedback to the product team.

This introductory blog post won’t attempt to be super technical or comprehensive. Instead, I’m going to share the information I’ve shared most frequently and consistently during the Fabric private preview – the information and context that will help you get started, and help put that more technical information into context.

For folks who are already familiar with Power BI[2], Fabric is going to feel familiar from day one. This is because the SaaS foundation on which Fabric is built is the Power BI service you already use every day.

The SaaS foundation of Microsoft Fabric

The foundation is evolving and improving, and there are new capabilities in lots of places, but the foundation of Fabric is the foundation of Power BI. This means that from day one you already know how to use it:

  • Workspaces – Fabric workspaces behave like Power BI workspaces, but with more item types available.
  • Navigation – If you know how to move around the Power BI portal you know how to move around the Fabric portal, because it works the same way.
  • Collaboration and content management – You can collaborate and share with Fabric items and workspaces just like you do with Power BI.
  • Capacities – New Fabric workloads use the capacity-based compute model used by Power BI Premium. If you don’t already have a capacity, you can start a free trial.
  • Administration – Fabric administration works like Power BI administration, and the Fabric admin portal is the evolution of the Power BI admin portal. To enable the Fabric preview in your Power BI tenant or for a specific capacity, you can use the admin portal.
  • Much, much more – I won’t try to list everything here, because there’s already so much documentation available.

At this point you probably get the idea. If you’re familiar with Power BI, you’re going to have an easy time getting used to Fabric. Power BI will continue to evolve and grow, and there are a lot of exciting improvements coming to Power BI in Fabric[3] even without taking the new capabilities into account.

But what about those new capabilities? What about all the new data integration, data science, data engineering, data warehousing, and real-time analytics capabilities? How familiar will they be?

That’s a slightly more complicated question. In a lot of ways these new Fabric workloads are the evolution of existing Azure data services, including Azure Synapse, Azure Data Factory, and Azure Data Explorer. These established PaaS services have been updated and enhanced to run on the Fabric shared SaaS foundation, and their user experiences have been integrated into the Fabric portal.

If you’re already familiar with Azure Synapse, Azure Data Factory, and/or Azure Data Explorer, the new capabilities in Fabric will probably be familiar too. You already know how to work with pipelines and notebooks, and you already know how to write SQL and KQL queries – in Fabric you’ll just be doing these familiar things in a new context.

There are a few key Fabric concepts that I’ve seen more new-to-Power BI preview customers as questions about. If you or your colleagues are more Azure-savvy than Power-savvy, you’ll probably want to pay attention to:

  • Capacities – Fabric uses capacities for compute across all experiences[4], which provides a consistent billing and consumption model, but which will necessitate a change in thinking for folks who are used to other service-specific approaches to billing and consumption.
  • Workspaces – Other services don’t have the same concept of a workspace as Power BI and Fabric do… but some of them have different concepts with the same name. Since workspaces are a crucial tool for content creation, organization, and security, understanding them and how they work will be important for success with Fabric.
  • A “managed” SaaS data service – In most data services, the “catalog” of items and their relationships is expressed through the metadata of a given instance. This means that capabilities like discovery, lineage, impact analysis are either absent, limited in scope, or only available through integration with an external data catalog or similar service. Fabric, like Power BI, maintains an internal data catalog for all items in the tenant, and their relationships to each other. This information is exposed through APIs and integrated into experiences like the workspace lineage view and the data hub, making it easier to discover, understand, and use data.

In addition to things in Fabric that will be familiar to people with Power BI experience and things in Fabric that will be familiar to people with Azure data experience, there’s one huge part of Fabric that is going to be new to everyone: OneLake.

OneLake is a SaaS data lake that is a key component of the Fabric SaaS foundation[5]. Every Fabric tenant includes a single OneLake instance, and all Fabric experiences work natively with data in the lake.

  • OneLake is open – OneLake is built on ADLS Gen2. You can store any type of file, and use the same APIs you use when connecting to ADLS Gen2. Storing data in OneLake doesn’t mean it’s locked into Fabric – it means it can be used where and how you need it to be used.
  • Delta by default – Fabric experiences store their data in OneLake in parquet delta files. Delta is an open source, compressed columnar format that supports ACID transactions, and is supported by a wide range of tools.
  • Store once, use everywhere – Because there’s one OneLake, data can be ingested and stored once and used where it’s needed. You can have a single set of delta files that are exposed as a lakehouse and manipulated using notebooks, while at the same time are exposed as a warehouse and manipulated using SQL, and exposed as a Power BI tabular dataset in DirectLake mode. This decoupling of storage and compute is enabled by OneLake, and I expect it to be one of the most significantly game-changing aspects of Fabric as a whole.
  • OneLake is integrated – Being open makes it easy for you to store your data in OneLake while using it with whatever tools and compute engines you choose. OneLake shortcuts allow you to keep your data where you have it today while logically exposing it as if it were stored in OneLake.

OneLake takes the familiar concept of a data lake, and puts it where no one sems to expect it: at the center of work, where it makes sense, deeply integrated into the tools and experiences used by everyone contributing to a project or product.

With all of these new and familiar capabilities coming together into a single SaaS platform, the next thing that Fabric delivers is a comprehensive set of user experiences.

Modern data projects often involve a wide range of practitioners – data scientists, data engineers, data developers, BI developers, report authors, and more. Before Fabric, each practitioner persona would typically work in their own set of tools and experiences, each of which had its own strengths and weaknesses and capabilities. When taken together, this means that most projects involve significant integration effort to make the output of one tool work with the next tools in the value chain – and often there are tradeoffs made to accommodate the mismatch between tools.

With Fabric, each task and persona has a purpose-built set of experiences that all work natively with the same data in OneLake. The result is that data practitioners can focus on delivering value through their data work.. not on building integrations so their tools will work together. Teams can set up workspaces that contain the  data and items they need – lakehouses, warehouses, notebooks, spark jobs, pipelines, dataflows, dataset, reports, and more. Data in one workspace can be used in other workspaces as needed, and because of OneLake it can be stored once and used multiple times without duplication.

During the Fabric private preview, the chief data officer of a well-known global organization[6] said something to the effect of:

With Fabric I can finally be a Chief Data Officer instead of being a Chief Integration Officer.

And this is why I believe Fabric represents the future of data.

Think back 10-12 years when the first generation of PaaS data services were becoming available. Many data practitioners looked at them and dismissed them as solutions to imaginary problems – why would we ever need a cloud service when we had these beautiful database servers in our own data centers, with IO subsystems we’ve designed to our own specs and fine-tuned to the nth degree[7]? It took time for people to realize the value and advantage of the cloud, but today there are entire classes of problems that simply don’t exist anymore because of the cloud.

I believe that the integrated, open, flexible, SaaS nature of Fabric means that we’re at an inflection point as significant for data as the advent of the cloud. Fabric will eliminate entire classes of problems that we take for granted today – in a few years we will take this new platform and this new paradigm for granted, and wonder how we ever thought those problems were an acceptable part of our professional lives.

Welcome to Fabric. Welcome to the future of data.

Ok, that’s my post. Where should you go from here? In addition to all of the links above, you should definitely check out the official blog post for the official big picture. You should also join us Wednesday and Thursday for a “simulive” virtual event as we go deeper into many of the key capabilities now available in Fabric.

I’ll see you there.


[1] Old man voice: Back when I was your age we called Fabric “Trident” and we weren’t allowed to talk about it in public because if the Kaiser heard about it our boys fighting in France would be at risk! Let me tell you about the time I…

[2] If you’re reading this post on this blog, I suspect this includes you. I’d love to know if you agree with my “feels familiar from day one” assertion.

[3] I’m working under the assumption that the interwebs will be flooded today with blogs and announcements and guys in cubes, so I’ll leave it up to you to track down what’s exciting for you.

[4] Yes, you need a capacity for all new Fabric experiences. Power BI licensing is not changing, but to work with the new Fabric capabilities you need a capacity to run them on. Fabric capacities are available in smaller SKUs than Power BI capacities.

[5] You probably noticed it in the diagram image above, sitting there in the middle all integrated and important.

[6] You know this company, but since this is my personal blog I’m probably not going to get their permission to name them. Also, as I write this blog post I can’t find the verbatim customer quote, so you’ll need to rely on my imperfect memory for this one.

[7] As I type this in 2023, I can’t remember the last time I worked with an on-prem production database. It was probably 2011 or 2012. It feels like something from another age, another life.

Data Moshpit

You read that correctly. Data moshpit.

This September the greatest thing in the history of great things happening will happen. On Saturday September 16, in a metal club in Berlin, data platform MVPs Ben Kettner and Frank Geisler are JOINing[1] the two best things in the world: data and heavy metal.

Data Moshpit is a one-day community event with a heavy metal theme. The sessions will be metal. The speakers will be metal. The atmosphere will be metal. There will be beer and food and if my experience at metal shows in Germany have taught me anything, lots of brothers of metal and sisters of steal wearing leather and denim and the coolest jean jacket vests with giant Manowar back patches.

The Data Moshpit call for speakers is open now, and closes on July 7th. You can submit metal-themed sessions of your own, or just check out the exciting sessions already submitted.

If you’ve got a heart of steel and data for blood, this is the one event of 2023 that you cannot afford to miss. And if you’re not into metal[2], you should come anyway. It will be a great opportunity to connect with the community and learn something new. I hope to see you there!


[1] Yes, I made this joke. Don’t be CROSS.

[2] You can still be my friend, don’t worry.

Roche’s Maxim of Data Transformation

According to the internet, a maxim is a succinct formulation of a fundamental principle, general truth, or rule of conduct.[1] Maxims tend to relate to common situations and topics that are understandable by a broad range of people.

Topics like data transformation.

Roche’s Maxim of Data Transformation[2] states:

Data should be transformed as far upstream as possible, and as far downstream as necessary.

In this context “upstream” means closer to where the data is originally produced, and “downstream” means closer to where the data is consumed.

By transforming data closer to its ultimate source costs can be reduced, and the value added through data transformation can be applied to a greater range of uses. The farther downstream a given transformation is applied, the more expensive it tends to be – often because operations are performed more frequently – and the smaller the scope of potential value through reuse.

I’ve been using this guideline for many years, but I only started recognizing it as a maxim in the past year or so. The more I work with enterprise Power BI customers the more I realize how true and how important it is – and how many common problems could be avoided if more people thought about it when building data solutions.

Please note that this maxim is generalizable to data solutions implementing using any tools or technology. The examples below focus on Power BI because that’s where I spend my days, but these principles apply to every data platform I have used or seen used.

In day-to-day Power BI conversations, perhaps the most common question to which Roche’s Maxim applies is about where to implement a given unit of logic: “Should I do this in DAX or in Power Query?”

Short answer: Do it in Power Query.

If you’re ever faced with this question, always default to Power Query if Power Query is capable of doing what you need – Power Query is farther upstream. Performing data transformation in Power Query ensures that when your dataset is refreshed the data is loaded into the data model in the shape it needs to be in. Your report logic will be simplified and thus easier to maintain, and will likely perform better[3] because the Vertipaq engine will need to do less work as users interact with the report.[4]

But what if you need data transformation logic that depends on the context of the current user interacting with the report – things like slicers and cross-filtering? This is the perfect job for a DAX measure, because Power Query doesn’t have access to the report context. Implementing this logic farther downstream in DAX makes sense because it’s necessary.

Another common question to which Roche’s Maxim applies is also about where to implement a given unit of logic: “Should I do this in Power BI or in the data warehouse?”

Short answer: Do it in the data warehouse.

If you’re ever faced with this question, always default to transforming the data into its desired shape when loading it into the data warehouse – the data warehouse is farther upstream. Performing data transformation when loading the data warehouse ensures that any analytics solution that uses the data has ready access to what it needs – and that every solution downstream of the warehouse is using a consistent version of the data.

From a performance perspective, it is always better to perform a given data transformation as few times as possible, and it is best to not need to transform data at all.[5] Data transformation is a costly operation – transforming data once when loading into a common location like a data warehouse, data mart, or data lake, is inherently less costly than transforming it once for every report, app, or solution that uses that common location.

A much less common question to which Roche’s Maxim applies might be “What about that whole ‘not transforming at all’ pattern you mentioned a few paragraphs back – how exactly does that dark magic work?”

Short answer: Have the data already available in the format you need it to be in.

That short answer isn’t particularly useful, so here are two brief stories to illustrate what I mean.

Many years ago I was working with an oil & gas company in an engagement related to master data management. This company had a fundamental data problem: the equipment on their drilling platforms around the world was not standardized, and different meters reported the same production data differently. These differences in measurement meant that all downstream reporting and data processing could only take place using the least common denominator across their global set of meters… and this was no longer good enough. To solve the problem, they were standardizing on new meters everywhere, and updating their data estate to take advantage of the new hardware. My jaw dropped when I learned that the cost of upgrading was upwards of one hundred million dollars… which was a lot of money at the time.

Much more recently I was working with a retail company with over 5,000 locations across North America. They had similar challenges with similar root causes: their stores did not have consistent point of sale (POS) hardware[6], which meant that different stores produced different data and produced some common data at different grain, and analytics could only take place using the least common denominator data from all stores. Their solution was also similar: they upgraded all POS systems in all stores. I don’t have a dollar amount to put with this investment, but it was certainly significant – especially in an industry where margins are traditionally small and budgets traditionally very conservative.

Both of these stories illustrate organizations taking Roche’s Maxim to the extreme: they transformed their key data literally as far upstream as possible, by making the necessary changes to produce the data in its desired form.[7]

Each of these stories included both technical and non-technical factors. The technical factors revolve around data. The non-technical factors revolve around money. Each company looked at the cost and the benefit and decided that the benefit was greater. They implemented an upstream change that will benefit every downstream system and application, which will simplify their overall data estate, and which corrects a fundamental structural problem in their data supply chain that could only be mitigated, not corrected, by any downstream change.

There’s one additional part of Roche’s Maxim that’s worth elaborating on – what does “necessary” mean? This post has looked at multiple scenarios that emphasize the “as far upstream as possible” part of the maxim – what about the “as far downstream as necessary” part?

Some factors for pushing transformations downstream are technical, like the DAX context example above. Other technical factors might be the availability of data – you can’t produce a given output unless you have all necessary inputs. Others may be organizational – if data is produced by a 3rd party, your ability to apply transformations before a given point may be constrained more by a contract than by technology.

Still other factors may be situational and pragmatic – if team priorities and available resources prevent you from implementing a unit of data transformation logic in the data warehouse, it may be necessary to implement it in your Power BI solution in order to meet project deadlines and commitments.

These are probably the most frustrating types of “necessary” factors, but they’re also some of the most common. Sometimes you need to deliver a less-than-ideal solution and incur technical debt that you would prefer to avoid. The next time you find yourself in such a situation, keep this maxim in mind, and remember that even though it may be necessary to move that data transformation logic downstream today, tomorrow is another day, with different constraints and new opportunities.

Callout: This may be my maxim, but this isn’t the first blog post on the topic. Stuart Box from the UK data consultancy BurningSuit blogged back in March and was in first with this excellent article.


Update June 2021: The video from my June 11 DataMinutes presentation is now available, so if you prefer visual content, this video might be a good place to start.

 

Update May 2022: The fine folks at Greyskull Analytics have added a wonderful t-shirt to their online store. If you want to look cooler than I ever will, you might want to head on over to https://greyskullanalytics.com/ to order yours.

 

Update August 2022: There also a video from SQLBits, in case you want a slightly longer version presented to a live audience.

 

Update October 2022: I had the pleasure of visiting Patrick’s cube, and we recorded a video together. You should check it out.


[1] Or a men’s magazine. I really really wanted to use this more pop-culture meaning to make a “DQ” joke playing on the men’s magazine “GQ” but after watching this post languish in my drafts for many months and this joke not even beginning to cohere, I decided I should probably just let it go and move on.

But I did not let it go. Not really.

[2] If you think that sounds pretentious when you read it, imagine how it feels typing it in.

[3] The performance benefit here is not always obvious when working with smaller data volumes, but will become increasingly obvious as the data volume increases. And since the last thing you want to do in this situation is to retrofit your growing Power BI solution because you made poor decisions early on, why not refer to that maxim the next time you’re thinking about adding a calculated column?

[4] This post only spent a week or so in draft form, but during this week I watched an interesting work email conversation unfold. A Power BI customer was experiencing unexpected performance issues related to incremental refresh of a large dataset, and a DAX calculated column on a table with hundreds of millions of records was part of the scenario. The email thread was between members of the engineering and CAT teams, and a few points jumped out at me, including one CAT member observing “in my experience, calculated columns on large tables [can] increase processing times and also can greatly increase the time of doing a process recalc… it also depends on the complexity of the calculated column.”

I don’t have enough knowledge of the Veripaq engine’s inner workings to jump into the conversation myself, but I did sip my coffee and smile to myself before moving on with my morning. I checked back in on the conversation later on, and saw that a Power BI  group engineering manager (GEM) had shared this guidance, presented here with his approval:

“From a pure perf standpoint, its true that we can say:

  • The most efficient approach for a large fact table is to have all the columns be present in the source table (materialized views also might work), so that no extra processing is necessary during the import operation (either in Mashup or in DAX)
  • The next most efficient approach for a large fact table is usually going to be to have the computation be part of the M expression, so that it only needs to be evaluated for the rows in the partitions being processed
  • DAX calculated columns are a great option for flexibility and are particularly useful for dimension tables, but will be the least efficient compared to the above two options for large fact tables”

That sounds pretty familiar, doesn’t it? The GEM effectively summarized Roche’s Maxim, including specific guidance for the specific customer scenario. The details will differ from context to context, but I have never found a scenario to which the maxim did not apply.

Yes, this is a challenge for you to tell me where and how I’m wrong.

[5] Just as Sun Tzu said “To fight and conquer in all our battles is not supreme excellence; supreme excellence consists in breaking the enemy’s resistance without fighting,” supreme excellence in data transformation is not needing to transform the data at all.

[6] That’s “cash registers” to the less retail inclined readers.

[7] If you feel inclined to point out that in each of these stories there is additional data transformation taking place farther downstream, I won’t argue. You are almost certainly correct… but the Maxim still holds, as the key common transformations have been offloaded into the most upstream possible component in the data supply chain. Like a boss.[8]

[8] Like a supremely excellent[5] boss.

Session resources: Patterns for adopting dataflows in Power BI

Important: This post was written and published in 2020, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be more of an historical record and less of a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

This morning I presented a new webinar for the Istanbul Power BI user group, covering one of my favorite subjects: common patterns for successfully using and adopting dataflows in Power BI.

This session represents an intersection of my data culture series in that it presents lessons learned from successful enterprise customers, and my dataflows series in that… in that it’s about dataflows. I probably didn’t need to point out that part.

The session slides can be downloaded here: 2020-09-23 – Power BI Istanbul – Patterns for adopting dataflows in Power BI

The session recording is available for on-demand viewing. The presentation is around 50 minutes, with about 30 minutes of dataflows-centric Q&A at the end. Please check it out, and share it with your friends!

 

Real customers, real stories

This is my personal blog – I try to be consistently explicit in reminding all y’all about this when I post about topics that are related to my day job as a program manager on the Power BI CAT team. This is one of those posts.

If I had to oversimplify what I do at work, I’d say that I represent the voice of enterprise Power BI customers. I work with key stakeholders from some of the largest companies in the world, and ensure that their needs are well-represented in the Power BI planning and prioritization process, and that we deliver the capabilities that these enterprise customers need[1].

Looking behind this somewhat grandiose summary[2], a lot of what I do is tell stories. Not my own stories, mind you – I tell the customers’ stories.

Image by Daria Głodowska from Pixabay
It was the best of clouds, it was the worst of clouds.

On an ongoing basis, I ask customers to tell me their stories, and I help them along by asking these questions:

  • What goals are you working to achieve?
  • How are you using Power BI to achieve these goals?
  • Where does Power BI make it hard for you to do what you need to do?

When they’re done, I have a pretty good idea what’s going on, and do a bunch of work[3] to make sure that all of these stories are heard by the folks responsible for shipping the features that will make these customers more successful.

Most of the time these stories are never shared outside the Power BI team, but on occasion there are customers who want to share their stories more broadly. My amazing teammate Lauren has been doing the heavy lifting[4] in getting them ready to publish for the world to see, and yesterday the fourth story from her efforts has been published.

You should check them out:

  1. Metro Bank: Metro Bank quickly delivers business efficiency gains without requiring involvement from IT
  2. Cummins: Cummins uses self-service BI to increase productivity and reduce unnecessary costs
  3. Veolia: Environmental services company builds sustainable, data-driven solutions with Power BI and Azure
  4. Avanade: Microsoft platform–focused IT consulting company innovates with Power BI and Azure to improve employee retention
  5. Cerner: Global healthcare solutions provider moves to the cloud for a single source of truth in asset and configuration management

Update: Apparently the Cerner story was getting published while I was writing this post. Added to the list above.

I know that some people will look at these stories and discount them as marketing – there’s not a lot I can do to change that – but these are real stories that showcase how real customers are overcoming real challenges using Power BI and Azure. Being able to share these stories with the world is very exciting for me, because it’s an insight into the amazing work that these customers are doing, and how they’re using Power BI and Azure services to improve their businesses and to make people’s lives better. They’re demonstrating the art of the possible in a way that is concrete and real.

And for each public story, there are scores of stories that you’ll probably never hear. But the Power BI team is listening, and as long as they keep listening, I’ll keep helping the customers tell their stories…


[1] This makes me sound much more important than I actually am. I should ask for a raise.

[2] Seriously, if I do this, shouldn’t I be be a VP or Partner or something?

[3] Mainly boring work that is not otherwise mentioned here.

[4] This is just one more reason why having a diverse team is so important – this is work that would be brutally difficult for me, and she makes it look so easy!

 

Power BI and ACLs in ADLSg2

In addition to using Azure Data Lake Storage Gen2 as the location for Power BI dataflows data, Power BI can also use ADLSg2 as a data source. As organizations choose ADLSg2 as the storage location for more and more data, this capability is key to enabling analysts and self-service BI users to get value from the data in the lake.

boje-2914324_640
Oh buoy, that is one big data lake!

But how do you do this in as secure a manner as possible, so that the right users have the minimum necessary permissions on the right data?

The short answer is that you let the data source handle secure access to the data it manages. ADLSg2 has a robust security model, which supports both Azure role-based access control (RBAC) and POSIX-like access control lists (ACLs)[1].

The longer answer is that this robust security model may make it more difficult to know how to set up permissions in the data lake to meet your analytics and security requirements.

Earlier this week I received a question from a customer on how to get Power BI to work with data in ADLSg2 that is  secured using ACLs. I didn’t know the answer, but I knew who would know, and I looped in Ben Sack from the dataflows team. Ben answered the customer’s questions and unblocked their efforts, and he said that I could turn them into a blog post. Thank you, Ben![2]

Here’s what you should know:

1 – If you’re using ACLs, you must at least specify a filesystem name in the URL to load in the connector (or if you access ADLS Gen2 via API or any other client).

i.e. Path in Power BI Connector must at least be: https://storageaccountname.dfs.core.windows.net/FileSystemName/

2 – For every file you want to read its contents, all parent folders and filesystem must have the “x” ACL. And the file must have a “r” ACL.

i.e. if you want to access the file: https://StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1/File1.csv

3 – For files you want to list, all parent folders and filesystem must have the “x” ACL. The immediate parent folder must also have a “r” ACL.

i.e. if you want to view and access the files in this subfolder: https://StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1/

4 – Default ACLs are great way to have ACLs propagate to child items. But they have to be set before creating subfolders and files, otherwise you need to explicitly set ACLs on each item.[3]

5 – If permission management is going to be dynamic, use groups as much as possible rather than assigning permissions to individual users[4]. First, ACL the groups to folders/files and then manage access via membership in the group.

6 – If you have an error accessing a path that is deep in the filesystem, work your way from the filesystem level downwards, fixing ACL settings in each step.

i.e. if you are having trouble accessing https:/StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1/SubFolder2/File(s)

First try: https://StorageAccountName.dfs.core.windows.net/FileSystemName

Then: https://StorageAccountName.dfs.core.windows.net/FileSystemName/SubFolder1

And so on.

Update: James Baker, a Program Manager on the Azure Storage team has published on GitHub a PowerShell script to recursively set ACLs. Thanks to Simon for commenting on this post to make me aware of it, Josh from the Azure support team for pointing me to the GitHub repo, and of course to James for writing the actual script!


[1] This description is copied directly from the ADLSg2 documentation, which you should also read before acting on the information in this post.

[2] Disclaimer: This post is basically me using my blog as a way to get Ben’s assistance online so more people can get insights from it. If the information is helpful, all credit goes to Ben. If anything doesn’t work, it’s my fault. Ok, it may also be your fault, but it’s probably mine.

[3] This one is very important to know before you begin, even though it may be #3 on the list.

[4] This is a best practice pretty much everywhere, not just here.

New resource: Generating CDM folders from Azure Databricks

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

Most of my blog posts that discuss the integration of Azure data services and Power BI dataflows via Common Data Model folders[1][2][3] include links to a tutorial and sample originally published in late 2018 by the Azure team. This has long been the best resource to explain in depth how CDM folders fit in with the bigger picture of Azure data.

Now there’s something better.

Microsoft Solutions Architect Ted Malone has used the Azure sample as a starting point for a GitHub project of his own, and has extended this sample project to start making it suitable for more scenarios.

2019-12-20-15-39-41-744--msedge

The thing that has me the most excited (beyond having Ted contributing to a GitHub repo, and having code that works with large datasets) is the plan to integrate with Apache Atlas for lineage and metadata. That’s the good stuff right there.

If you’re following my blog for more than just Power BI and recipes, this is a resources you need in your toolkit. Check it out, and be sure to let Ted know if it solves your problems.


[1] Power BIte: Creating dataflows by attaching external CDM folders

[2] Quick Tip: Working with dataflow-created CDM folders in ADLSg2

[3] Dataflows, CDM folders and the Common Data Model

Power BIte: Creating dataflows by attaching external CDM folders

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

This week’s Power BIte is the fourth and final entry in a series of videos[1] that present different ways to create new Power BI dataflows, and the results of each approach.

When creating a dataflow by attaching an external CDM folder, the dataflow will have the following characteristics:

Attribute Value
Data ingress path Ingress via Azure Data Factory, Databricks, or whatever Azure service or app has created the CDM folder.
Data location Data stored in ADLSg2 in the CDM folder created by the data ingress process.
Data refresh The data is refreshed based on the execution schedule and properties of the data ingress process, not by any setting in Power BI.

The key to this scenario is the CDM folder storage format. CDM folders provide a simple and open way to persist data in a data lake. Because CDM folders are implemented using CSV data files and JSON metadata, any application can read from and write to CDM folders. This includes multiple Azure services that have libraries for reading and writing CDM folders and 3rd party data tools like Informatica that have implemented their own CDM folder connectors.

CDM folders enable scenarios like this one, which is implemented in a sample and tutorial published on GitHub by the Azure data team:

  • Create a Power BI dataflow by ingesting order data from the Wide World Importers sample database and save it as a CDM folder
  • Use an Azure Databricks notebook that prepares and cleanses the data in the CDM folder, and then writes the updated data to a new CDM folder in ADLS Gen2
  • Attach the CDM folder created by Databricks as an external dataflow in Power BI[2]
  • Use Azure Machine Learning to train and publish a model using data from the CDM folder
  • Use an Azure Data Factory pipeline to load data from the CDM folder into staging tables in Azure SQL Data Warehouse and then invoke stored procedures that transform the data into a dimensional model
  • Use Azure Data Factory to orchestrate the overall process and monitor execution

That’s it for this mini-series!

If all this information still doesn’t make sense yet, now is the time to ask questions.


[1] New videos every Monday morning!

[2] I added this bullet to the list because it fits in with the rest of the post – the other bullets are copied from the sample description.

Quick Tip: Working with dataflow-created CDM folders in ADLSg2

Important: This post was written and published in 2019, and the content below may no longer represent the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.

If you’re using your own organizational Azure Data Lake Storage Gen2 account for Power BI dataflows, you can use the CDM folders that Power BI creates as a data source for other efforts, including data science with tools like Azure Machine Learning and Azure Databricks.

Image by Arek Socha from Pixabay
A world of possibilities appears before you…

This capability has been in preview since early this year, so it’s not really new, but there are enough pieces involved that it may not be obvious how to begin – and I continue to see enough questions about this topic that another blog post seemed warranted.

The key point is that because dataflows are writing data to ADLSg2 in CDM folder format, Azure Machine Learning and Azure Databricks can both read the data using the metadata in the model.json file.

This json file serves as the “endpoint” for the data in the CDM folder; it’s a single resource that you can connect to, and not have to worry about the complexities in the various subfolders and files that the CDM folder contains.

This tutorial is probably the best place to start if you want to know more[1]. It includes directions and sample code for creating and consuming CDM folders from a variety of different Azure services – and Power BI dataflows. If you’re one of the people who has recently asked about this, please go through this tutorial as your next step!


[1] It’s the best resource I’m aware of  – if you find a better one, please let me know!