Web Scraping with Html.Table in Power Query

The older I get, the more consistently I encounter things that make me feel old. And it’s not just the usual things, like those damned kids on my lawn. It’s things like technology. And sometimes it’s things like new features in Power BI.

Let’s take, for example, the new “from web” experience that was introduced in Power BI Desktop over the summer, and which is generally available (out of preview) this month. This is a cool feature that at first glance looks pretty straightforward, but it caused me no end of problems.

I started off by reading the available information:

When it became clear that I wasn’t going to be able to do what I needed with only the UI, I started looking into the Power Query functions being used in the query created by the UI. The code in my PBIX file didn’t make any sense to me, so I look in more places, like:

And still, I couldn’t quite figure out how to solve my problem. I was frustrated and annoyed , not just because I couldn’t do what I needed, but because I was convinced that there was some simple, easy thing that I was missing.

And I was right.

This brings me back to feeling old.

When I was your age[1], we only had one HTML tag, and it was <marquee> and we were glad to have it. We didn’t need any of those fancy stylesheets that kids these days have cascading all over the place, because we had H1 and H2 and that was more than enough for any page we could think of.

All pretense aside, I didn’t know that CSS selectors existed before I started trying to use the Html.Table function in Power Query, and that was a real problem. The inline help mentioned CSS selectors in the function description, but without more prior knowledge and context, this wasn’t helpful to me. At all. So I asked for help.

One of the developers on the Power BI team took the time to listen to what I was trying to accomplish, and gave me the code that I’d been so consistently failing to write on my own[2]. Once I had something simple that worked, I was able – with a little more research and some reformatting – to build the queries that I needed, but I don’t think I could have done it without that help.

So, working under the assumption that I’m not the only old guy out there who knows data pretty well, but hasn’t kept up to date on web technologies like CSS, I wanted to share the story and the code.

Here’s the pattern I was trying to implement. This is a relatively common pattern in my experience, although the details are different each time:

  1. Extract a set of links and related data from a parent web page
  2. For each of the links extracted from the parent, extract a set of data from the child web page, where each of the child pages has a consistent format
  3. Use the data from the parent and the child pages to build a data set for reporting and analysis

The specific solution I needed to implement was this:

  1. Extract album name, release date, song names, and links to the individual song lyric pages, from this page: https://www.azlyrics.com/m/manowar.html
  2. Extract the lyrics from each individual song page, like this one: https://www.azlyrics.com/lyrics/manowar/fightingtheworld.html
  3. Perform mission-critical analysis on the lyrics.[3]

Let’s take a look at my learning process step by step.

I started off by using the UI, stepping through the new “From web” experience in Power BI Desktop. I knew that this wouldn’t give me everything I needed – you can’t get the URLs behind a link, you can only get the text – but I hoped it would give me a starting point.

from web

It gave me this:

Editor

I’m sharing this query as an image and not as text because I want to highlight one of the challenges I faced: the formatting. Because the Html.Table function accepts a complex parameter, I needed to “pretty format” it before I could visually parse it and compare the specific syntax to the available help, and to the source of the web page I was trying to use as a data source.

Even then, I was still stuck. There were not enough points of reference between the HTML and the query for me to relate them and to start modifying the query script. After fighting with this for longer than I care to admit, I asked for help. In return, I got this query:

let
    Source = Web.BrowserContents("https://www.azlyrics.com/m/manowar.html"),
    Custom1 = Html.Table(Source, {{"Song", "a"}, {"Song Link", "a", each [Attributes][href]}}, [RowSelector="div#listAlbum > a[href^='../lyrics/']"])
in
    Custom1

Before I could use it, I had to reformat it like this:

let
    Source = Web.BrowserContents("https://www.azlyrics.com/m/manowar.html"),
    Custom1 = Html.Table
        (
        Source,
            {
                {"Song", "a"},
                {"Song Link", "a", each [Attributes][href]}
            },
            [RowSelector="div#listAlbum > a[href^='../lyrics/']"]
        )
in
    Custom1

That was much better! With this code, formatted so it was clear how the various parts of the columnNameSelectorPairs parameter related to each other, I could finally make some progress on my own. In short order I had this query:

 let
    Source = Web.BrowserContents("https://www.azlyrics.com/m/manowar.html"),
    Custom1 = Html.Table
        (
        Source,
            {
                {"Song", "a"}
               ,{"Song Link", "a", each [Attributes][href]}
               ,{"Album", "div.album"}
            },
            [RowSelector="div#listAlbum > a[href^='../lyrics/'], .album , b"]
        ),
    #"Filled Down" = Table.FillDown(Custom1,{"Album"}),
    #"Inserted Text Between Delimiters" = Table.AddColumn(#"Filled Down", "Release Year", each Text.BetweenDelimiters([Album], "(", ")"), type text),
    #"Inserted Text Between Delimiters1" = Table.AddColumn(#"Inserted Text Between Delimiters", "Album Name", each Text.BetweenDelimiters([Album], """", """"), type text),
    #"Filtered Rows" = Table.SelectRows(#"Inserted Text Between Delimiters1", each [Song] <> null),
    #"Replaced Value" = Table.ReplaceValue(#"Filtered Rows","../","https://www.azlyrics.com/",Replacer.ReplaceText,{"Song Link"}),
    #"Removed Columns" = Table.RemoveColumns(#"Replaced Value",{"Album"}),
    #"Reordered Columns" = Table.ReorderColumns(#"Removed Columns",{"Album Name", "Release Year", "Song", "Song Link"}),
    #"Invoked Custom Function" = Table.AddColumn(#"Reordered Columns", "Lyrics", each fn_GetLyrics([Song Link])),
    #"Expanded Lyrics" = Table.ExpandTableColumn(#"Invoked Custom Function", "Lyrics", {"Lyrics"}, {"Lyrics.1"}),
    #"Renamed Columns" = Table.RenameColumns(#"Expanded Lyrics",{{"Lyrics.1", "Lyrics"}}),
    #"Removed Columns1" = Table.RemoveColumns(#"Renamed Columns",{"Song Link"})
in
    #"Removed Columns1"

The function fn_GetLyrics uses the same technique to pull the actual lyrics from the child pages:

(lyricsPageURL) =>

let
   Source = Web.BrowserContents(lyricsPageURL),
   Custom1 = Html.Table
   (
      Source,
      {
         {"Lyrics", "div"}
      }

      ,[RowSelector="br+ div"]
   ),
   #"Kept First Rows" = Table.FirstN(Custom1,1)
in
   #"Kept First Rows"

And now I have the data I need – I just need to analyze it.

There were two additional tools that were invaluable to my success.

The first tool is the CSS selectors reference at W3Schools.com,  which provides the syntax for each CSS selector, an example of the syntax, and a description of each example. Just what I needed!

The second tool is the SelectorGadget Chrome extension, which  lets you click on elements on the page, and see the CSS selector that would select them. This may not have been an absolute requirement, but it definitely saved me time and swearing.

Hopefully this has been useful. I may be the only old dog out there trying to learn new tricks, but then again, I may not be. Ether way, get off my lawn!

 


[1] Please go back and read that in your best “crotchety old man” voice. Thank you.

[2] Thanks again, Ehren!

[3] Or maybe build a demo for the BI Power Session at the PASS Summit next month. It’s hard to tell the difference sometimes.

9 thoughts on “Web Scraping with Html.Table in Power Query

  1. Pingback: Authoring Power BI Dataflows in Power BI Desktop – BI Polar

    1. You’re asking me to remember something from last year? 😉

      Yes, I do remember having problems with this specific site. I opened a bug with the PQ team, and they fixed it. As of this morning, I can load this page in both Power BI Desktop and Power Query Online. FWIW, I am using the November build of Desktop, so I cannot attempt a repro in the October build.

      Like

  2. Pingback: Diversity of Perspective – BI Polar

  3. Came to read about the webscraping but now I’m on a trip down Memory Lane and “We’re warriors, warriors of the world” is blasting at full power (in my mind thought, the kid’s asleep). Thanks 🙂

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s