The older I get, the more consistently I encounter things that make me feel old. And it’s not just the usual things, like those damned kids on my lawn. It’s things like technology. And sometimes it’s things like new features in Power BI.
Let’s take, for example, the new “from web” experience that was introduced in Power BI Desktop over the summer, and which is generally available (out of preview) this month. This is a cool feature that at first glance looks pretty straightforward, but it caused me no end of problems.
I started off by reading the available information:
- The announcement: https://powerbi.microsoft.com/en-us/blog/power-bi-desktop-july-2018-feature-summary/
- The documentation: https://docs.microsoft.com/en-us/power-bi/desktop-connect-to-web-by-example
- The video: https://www.youtube.com/watch?v=MhrhIXNtelw
When it became clear that I wasn’t going to be able to do what I needed with only the UI, I started looking into the Power Query functions being used in the query created by the UI. The code in my PBIX file didn’t make any sense to me, so I look in more places, like:
- The inline help in Power BI Desktop:
- Chris Webb’s blog: https://blog.crossjoin.co.uk/2018/08/30/power-bi-extract-urls-web-page/
- Miguel Escobar’s blog: https://www.poweredsolutions.co/2018/05/14/new-web-scraping-experience-in-power-bi-power-query-using-css-selectors/
And still, I couldn’t quite figure out how to solve my problem. I was frustrated and annoyed , not just because I couldn’t do what I needed, but because I was convinced that there was some simple, easy thing that I was missing.
And I was right.
This brings me back to feeling old.
When I was your age[1], we only had one HTML tag, and it was <marquee> and we were glad to have it. We didn’t need any of those fancy stylesheets that kids these days have cascading all over the place, because we had H1 and H2 and that was more than enough for any page we could think of.
All pretense aside, I didn’t know that CSS selectors existed before I started trying to use the Html.Table function in Power Query, and that was a real problem. The inline help mentioned CSS selectors in the function description, but without more prior knowledge and context, this wasn’t helpful to me. At all. So I asked for help.
One of the developers on the Power BI team took the time to listen to what I was trying to accomplish, and gave me the code that I’d been so consistently failing to write on my own[2]. Once I had something simple that worked, I was able – with a little more research and some reformatting – to build the queries that I needed, but I don’t think I could have done it without that help.
So, working under the assumption that I’m not the only old guy out there who knows data pretty well, but hasn’t kept up to date on web technologies like CSS, I wanted to share the story and the code.
Here’s the pattern I was trying to implement. This is a relatively common pattern in my experience, although the details are different each time:
- Extract a set of links and related data from a parent web page
- For each of the links extracted from the parent, extract a set of data from the child web page, where each of the child pages has a consistent format
- Use the data from the parent and the child pages to build a data set for reporting and analysis
The specific solution I needed to implement was this:
- Extract album name, release date, song names, and links to the individual song lyric pages, from this page: https://www.azlyrics.com/m/manowar.html
- Extract the lyrics from each individual song page, like this one: https://www.azlyrics.com/lyrics/manowar/fightingtheworld.html
- Perform mission-critical analysis on the lyrics.[3]
Let’s take a look at my learning process step by step.
I started off by using the UI, stepping through the new “From web” experience in Power BI Desktop. I knew that this wouldn’t give me everything I needed – you can’t get the URLs behind a link, you can only get the text – but I hoped it would give me a starting point.
It gave me this:
I’m sharing this query as an image and not as text because I want to highlight one of the challenges I faced: the formatting. Because the Html.Table function accepts a complex parameter, I needed to “pretty format” it before I could visually parse it and compare the specific syntax to the available help, and to the source of the web page I was trying to use as a data source.
Even then, I was still stuck. There were not enough points of reference between the HTML and the query for me to relate them and to start modifying the query script. After fighting with this for longer than I care to admit, I asked for help. In return, I got this query:
let Source = Web.BrowserContents("https://www.azlyrics.com/m/manowar.html"), Custom1 = Html.Table(Source, {{"Song", "a"}, {"Song Link", "a", each [Attributes][href]}}, [RowSelector="div#listAlbum > a[href^='../lyrics/']"]) in Custom1
Before I could use it, I had to reformat it like this:
let Source = Web.BrowserContents("https://www.azlyrics.com/m/manowar.html"), Custom1 = Html.Table ( Source, { {"Song", "a"}, {"Song Link", "a", each [Attributes][href]} }, [RowSelector="div#listAlbum > a[href^='../lyrics/']"] ) in Custom1
That was much better! With this code, formatted so it was clear how the various parts of the columnNameSelectorPairs parameter related to each other, I could finally make some progress on my own. In short order I had this query:
let Source = Web.BrowserContents("https://www.azlyrics.com/m/manowar.html"), Custom1 = Html.Table ( Source, { {"Song", "a"} ,{"Song Link", "a", each [Attributes][href]} ,{"Album", "div.album"} }, [RowSelector="div#listAlbum > a[href^='../lyrics/'], .album , b"] ), #"Filled Down" = Table.FillDown(Custom1,{"Album"}), #"Inserted Text Between Delimiters" = Table.AddColumn(#"Filled Down", "Release Year", each Text.BetweenDelimiters([Album], "(", ")"), type text), #"Inserted Text Between Delimiters1" = Table.AddColumn(#"Inserted Text Between Delimiters", "Album Name", each Text.BetweenDelimiters([Album], """", """"), type text), #"Filtered Rows" = Table.SelectRows(#"Inserted Text Between Delimiters1", each [Song] <> null), #"Replaced Value" = Table.ReplaceValue(#"Filtered Rows","../","https://www.azlyrics.com/",Replacer.ReplaceText,{"Song Link"}), #"Removed Columns" = Table.RemoveColumns(#"Replaced Value",{"Album"}), #"Reordered Columns" = Table.ReorderColumns(#"Removed Columns",{"Album Name", "Release Year", "Song", "Song Link"}), #"Invoked Custom Function" = Table.AddColumn(#"Reordered Columns", "Lyrics", each fn_GetLyrics([Song Link])), #"Expanded Lyrics" = Table.ExpandTableColumn(#"Invoked Custom Function", "Lyrics", {"Lyrics"}, {"Lyrics.1"}), #"Renamed Columns" = Table.RenameColumns(#"Expanded Lyrics",{{"Lyrics.1", "Lyrics"}}), #"Removed Columns1" = Table.RemoveColumns(#"Renamed Columns",{"Song Link"}) in #"Removed Columns1"
The function fn_GetLyrics uses the same technique to pull the actual lyrics from the child pages:
(lyricsPageURL) => let Source = Web.BrowserContents(lyricsPageURL), Custom1 = Html.Table ( Source, { {"Lyrics", "div"} } ,[RowSelector="br+ div"] ), #"Kept First Rows" = Table.FirstN(Custom1,1) in #"Kept First Rows"
And now I have the data I need – I just need to analyze it.
There were two additional tools that were invaluable to my success.
The first tool is the CSS selectors reference at W3Schools.com, which provides the syntax for each CSS selector, an example of the syntax, and a description of each example. Just what I needed!
The second tool is the SelectorGadget Chrome extension, which lets you click on elements on the page, and see the CSS selector that would select them. This may not have been an absolute requirement, but it definitely saved me time and swearing.
Hopefully this has been useful. I may be the only old dog out there trying to learn new tricks, but then again, I may not be. Ether way, get off my lawn!
[1] Please go back and read that in your best “crotchety old man” voice. Thank you.
[2] Thanks again, Ehren!
[3] Or maybe build a demo for the BI Power Session at the PASS Summit next month. It’s hard to tell the difference sometimes.
Pingback: Authoring Power BI Dataflows in Power BI Desktop – BI Polar
Thank you Matthew, I also falls into the “older” category with similar frustration. Happy hunting. Prinsloo
LikeLiked by 1 person
I just want to say thanks a lot brother! You’ve saved me many heart & head aches!
LikeLike
Thank you!
That sort of feedback makes it all worthwhile. 😉
LikeLike
Hi Matt,
Great work (I’m old too).
In the Oct 2019 Update of Power BI, I get an error “Something went wrong”
when trying to load https://www.azlyrics.com/lyrics/manowar/fightingtheworld.html
Did you come across this last year when initially attempting to use the Get Data –> Web functionality?
LikeLike
You’re asking me to remember something from last year? 😉
Yes, I do remember having problems with this specific site. I opened a bug with the PQ team, and they fixed it. As of this morning, I can load this page in both Power BI Desktop and Power Query Online. FWIW, I am using the November build of Desktop, so I cannot attempt a repro in the October build.
LikeLike
Pingback: Diversity of Perspective – BI Polar
Came to read about the webscraping but now I’m on a trip down Memory Lane and “We’re warriors, warriors of the world” is blasting at full power (in my mind thought, the kid’s asleep). Thanks 🙂
LikeLiked by 1 person
They still include this song on almost every tour. That memory could be a concert yet to come!
LikeLike