Quick Tip: Factoring your dataflow entities

This post started as a response to this question from Mark, who was commenting on last week’s data lineage post:

How would you decide how big or how small to make each artifact in the lineage, in terms of the amount of transformations taking place inside the artifact? In my case they would only be shared with 2-3 other users.

For instance I could go all out and have every step that would previously take place in a query editor result in a new link in the data lineage chain, but that would probably be overkill.

I agree that “one step per dataflow” would be overkill, but beyond that the answer is largely “it depends.”

Image by Esi Grünhagen from Pixabay

The approach I generally take is to break the end to end data preparation down into blocks that look like this:

  1. Staging – getting the source data into the system (in this case dataflow, but could be data mart, data warehouse, data lake, etc.) with zero or minimal transformations
  2. Cleansing – correcting known data quality and format problems from the staged data
  3. Transformation 1 – getting the cleansed data into the shape required for intended downstream purposes
  4. Enrichment – adding data from other sources, which have ideally already gone through steps 1 through 3
  5. Transformation 2 – getting the cleansed and enriched data into the shape required for analysis, typically as dimensions and facts

The final step may also be performed in the queries that are used to create the final tabular model when creating a dataset in Power BI Desktop. If a given dimension is likely to be used in multiple datasets, implement it as a dataflow entity. If it isn’t, implement it as a table in your dataset.

These guidelines tend to create a moderate number of easily maintainable entities, but they’re obviously the bare minimum – take what works for you, and discard the rest.

I feel like I’m dating myself with this link[1], but I definitely recommend looking at the Kimball Group’s techniques for data warehousing and BI: resources link. Ralph Kimball and his amazing team know more about this stuff than I will ever forget (or something like that) and there’s a huge volume of guidance available. Do yourself a favor and check it out.


[1] I assume there are newer resources out there, but when I was your age it was the Kimball Method or the… synonym for highway that rhymes with method.

3 thoughts on “Quick Tip: Factoring your dataflow entities

  1. Pingback: Dataflows in Power BI – BI Polar

  2. Thanks Matthew for turning this into a new blog post!
    The steps you mentioned provide a great guideline for beginners to Dataflow architecture.
    Added a Kimball book to my Safari O’reilly read playlist too, good stuff.

    Like

  3. Pingback: Figuring Dataflow Boundaries – Curated SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s