How much have they been processed since they were produced?
How much metadata is needed to understand what they are?
What are these?
How do you know?
How much have they been processed since they were produced?
How much metadata is needed to understand what they are?
Why do this payload and the one before it have a standard metadata package, even though the payloads are from different sources? What is the scope of the standard? Under what authority is the standard defined, and enforced?
What are these?
How do you know?
How much have they been processed since they were produced?
Without metadata, how do you evaluate the contents?
Without metadata, would you bother to evaluate the contents, or would you pass them by and instead look for payloads with complete metadata?
What are these?
How much have they been processed since they were produced?
Can you infer important details from the payload container format,even though the primary metadata is missing? Is this enough metadata for you to evaluate the payload for use?
How does the complexity of a payload relate to the complexity of the metadata? How does it relate to your requirements for considering the metadata to be complete enough?
Do you need more metadata to understand a payload that has been highly processed?
Is it easier or harder to use a simpler payload? How does the complexity of the desired application factor into your answer?
Important: This post was written and published in 2018, and the content below no longer represents the current capabilities of Power BI. Please consider this post to be an historical record and not a technical resource. All content on this site is the personal output of the author and not an official resource from Microsoft.
One of the exciting new preview capabilities in the October 2018 release of Power BI Desktop is support for data profiling in the Power Query editor. Having per-column data profile information available in the query editor is very useful to help understand the data you’re working with…
…but what about understanding data in a broader context?
The Power Query function language “M” contains a Table.Profile function that accepts a table as input and returns a table containing the data profile for the input table.[1] You can use this in Power BI Desktop, but the value of this, at least now that there is a data profiling UI, is limited in scope.
This is where dataflows can help.
Remember the Excel-like, automatically-updating capabilities of linked and computed dataflow entities?[2] The most common use case for linked entities is for data transformation, but with Table.Profile you can also use linked entities to collect, consolidate, and maintain data profile information for the data stored in dataflow entities.
And it’s surprisingly simple.
Start with a workspace[3] and a dataflow, and add linked entities to it for each of the entities you want to profile.
For each linked entity in the dataflow, perform the following steps:
Right-click on the entity in the query editor and select “Reference” from the context menu to create a computed entity
Rename the new computed entity to include the word “profile”
Right-click on the renamed entity and select “Advanced editor” from the context menu
In the advanced editor, add a new query step that uses the Table.Profile function
Like this:
The edited query is very simple, and because all of the edits made to one query will apply without modification to each of the other queries, once the first one is done it’s just a copy and paste for each new profile entity. You can make this easier by putting the comma at the beginning of the profile line, rather than at the end of the source line, but it will work either way.
let
Source = Site
,Profile = Table.Profile(Source) // note the placement of the comma
in
Profile
When you’re done, you’ll have a dataflow that contains data profiles for each linked entity, regardless of the workspace in which the linked entity originated.
Best of all, because the data profiles are stored in Power BI dataflow entities, which are in turn persisted in CDM Folders in Azure Data Lake Storage gen2, they can be consumed and processed in any tool for further analysis.
One of the biggest challenges for data governance is having current and accurate metadata available for enterprise data assets. Data profiles are only one part of this, but they’re a significant part. Because of the nature of linked entities in Power BI, we can now have up-to-date column-level profiles for our data, and can have it without a major engineering effort, and without any complex orchestration or management.
Life is good.
[1] If I understand correctly, the new feature in Power BI Desktop uses this function.
[2] If you don’t, you should probably read this post before you continue.
[3] Remember: to use linked entities this needs to be a new “v2” workspace, and it needs to be backed by Power BI Premium dedicated capacity.
I think about metadata a lot.[1] I probably think about metadata more than I think about swords, and that’s saying something.
I believe my love affair with metadata may have its roots in my college years when I took several anthropology courses from Dr. Ivan Brady. Dr. Brady changed the way I looked at the world, and I will never forget his most frequently used saying:
“Context is practically everything when it comes to determining meaning.”
— Dr. Ivan Brady
Dr. Brady wasn’t talking about metadata, but the statement still applies. Metadata provides context that is lacking from data. Metadata allows a user to understand the meaning of the data – is source, its purpose, its scope, its intended uses – without needing to explore the data itself in exhaustive detail.
In the context of enterprise data, metadata is absolutely vital. But not all metadata is created equal. Some metadata is swords, and some metadata is WiFi.
Please bear with me for a moment – I promise I’m going somewhere with this.
Oakeshott’s typology of medieval and early renaissance swords is among his most influential and most lasting works. Though his work was not entirely original, it was certainly groundbreaking. Dr. Jan Peterson had previously developed a typology for Viking swords consisting of twenty-six categories. Peterson’s typology was simplified by Dr. R. E. M. Wheeler in short order to only seven categories (Types I–VII). This simplified typology was then slightly expanded by Oakeshott by the addition of two transitional types into its current nine categories (Types I–IX). From this basis, Oakeshott began work on his own thirteen-category typology of the medieval sword ranging from Type X to Type XXII.
What made Oakeshott’s typology unique was that he was one of the first people either within or outside of academia to seriously and systematically consider the shape and function of the blades of European Medieval swords as well as the hilt, which had been the primary criteria of previous scholars. His typology traced the functional evolution of European swords over a period of five centuries, starting with the late Iron Age Type X, and took into consideration many factors: the shape of blades in cross section, profile taper, fullering, whether blades were stiff and pointed for thrusting or broad and flexible for cutting, etc. This was a breakthrough. Oakeshott’s books also dispelled many popular cliches about Western swords being heavy and clumsy. He listed the weights and measurements of many swords in his collection which have become the basis for further academic work as well as templates for the creation of high quality modern replicas.
And although the quote above doesn’t mention it, in addition to the primary types X through XXII, there are multiple subtypes as well, denoted by a lower-case letter following the roman numeral of the primary type.[4]
To summarize:
Oakeshott was working from a sample of data that wasn’t necessarily representative, and for which no meaningful metadata existed. He needed to reverse engineer the metadata from the available data, and to manually assign structure and consistency to it.
Earlier efforts to provide metadata for this data domain had focused on structural characteristics of the data, rather than the functional characteristics in which Oakeshott was interested.
Oakeshott was building on the efforts of earlier data stewards and expanded the work that they had done in one data domain, while also defining more comprehensive metadata for a new, larger, data domain.
Oakeshott’s work revealed significant discrepancies between the actual data and users’ perceptions of the data, and in doing so it enabled significant new opportunities to work with that data at scale.
Each metadata category is defined using an arcane and obtuse combination of letters and numbers to describe its members, such as Xa, XIIIb, and XVIIIb.
Even if you’ve never held a sword[3], this probably sounds familiar.
A lot of the data used in enterprise analytics wasn’t created with any metadata in mind. Other than table names, object names, and data types[5], there often isn’t much to go on. In order to understand the data, you need to look at and work with the data, at length. Efforts to develop structured metadata for these existing sources is more data archaeology than it is data science, and it is often difficult to know if you have all of the data, if you have taken into consideration every possible permutation of values… You get the idea. It’s hard, and it’s often very difficult to have strong confidence in the results you reach. Reverse-engineered metadata is better than no metadata, but…
But it’s better to take metadata into account right from the beginning, and to build it at the same time you’re building the data. Like WiFi.
Really.
OK, on to WiFi, in particular the IEEE 802.11 standard, also from Wikipedia:
The standard and amendments provide the basis for wireless network products using the Wi-Fi brand. While each amendment is officially revoked when it is incorporated in the latest version of the standard, the corporate world tends to market to the revisions because they concisely denote capabilities of their products. As a result, in the marketplace, each revision tends to become its own standard.
Let’s summarize this as well:
The metadata was defined before the data was created, rather than being inferred from existing data.
The metadata includes functional and structural characteristics, based on agreed-up requirements.
All data is validated against the metadata in a consistent and standard manner as it is created.
Each metadata category is defined using an arcane and obtuse combination of letters and numbers to describe its members, such as 802.11ax, 802.11b, and 802.11n.
Each approach to metadata adds value, but it should be obvious that prioritizing metadata in your data architecture is key to data consistency, interoperability, and reuse.
When I buy a sword[6], I can use the Oakeshott type as a concrete way to describe and discuss the sword with its maker, or with my sword-loving friends. This is inherently valuable. But there are many swords that don’t fall neatly into this classification, which reduces that value.
When I buy wireless networking equipment, all I need to do is to look at the standards it implements. From this metadata I can immediately and authoritatively know what other networking equipment it will work with, and what functional characteristics it will implement.
Is your metadata swords, or is it WiFi? Would you rather have swords, or WiFi?
I really think about metadata a lot…
[1] I never metadata I didn’t like.
[2] If you’ve been watching Forged in Fire: Knife or Death, you’ve heard this name before. And if you know anything about swords and their classification, you cringed and cried out in pain when you heard this term misused by the hosts of the show.
[4] My favorite arming sword is a type XIIIb. It’s name is Joy.
[5] And if you’re using a data lake, you’ll be lucky to have this much.
[6] It will be an Angus Trim type XVII longsword, the younger twin of this one. It will be ready in January. I know this because I ordered it already. No, I haven’t told my wife yet, but she will understand.