July is “Power Query Month”, where the theme is focused less on visualization and more on data preparation. Every Power BI challenge requires Power Query to get and transform data, but now we’re going deeper in July! If you haven’t checked out Dave’s challenge from last week, make sure to check it out here.
This week we are going to use Power Query to scrape data from a website utilizing a function to loop through the web scrape. We are then going to take that data and create the report that is embedded below.
- In Power Query, create a Web query that scrapes years 2000 through 2020 from the website: https://www.baseball-reference.com/leagues/majors/YYYY-standings.shtml
- First, create a list of the desired years using a Blank Query
- Second, use another Blank Query to create a Function that cycles through the base website for every one of the desired years
- Convert list of years to table, add a column using the Invoke Custom Function method
- Expand table
- Filter out “Average” from the field [Tm]
- Change data types as appropriate
- Create a new column in Power Query from the standings query that is Games Above .500 = [W] – [L]
- Create a second Web query that pulls in the data from the Team ID’s page from the follow website: https://www.baseball-reference.com/about/team_IDs.shtml
- Filter out all rows that have a Last Year before the first year that you pulled the data for (2000 in my instance)
- Create a data model relating the Tm field in the fact table to the Full Team Name in the dimension table
- You will find that there is one team name in the fact table, the Los Angeles Angels, that is not found in the dimension table. The Franchise ID that this value needs to be assigned to is LAA. Feel free to solve for adding this value to the dimension table in whatever way you see fit. I duplicated the original query, filtered down to one value with the wrong name, replaced values, then appended to original query.
- Using bar charts, create a small multiple chart showing each of the team’s seasons results with the Year being the X axis, the amount of games over or under .500 being the Y axis, and the Franchise IDs being the Small Multiples
- Differentiate seasons that teams won more than they lost by using different colors for the bars
This week’s dataset will be scraped from the web from the following two websites:
After you finish your workout, share on Twitter using the hashtags #WOW2021 and #PowerBI, and tag @JSBaucke, @MMarie, @shan_gsd and @dataveld. Also make sure to fill out the Submission Tracker so that we can count you as a participant this week in order to track our participation throughout the year.
3 thoughts on “2021 Week 28 | Power BI: Web Scraping with Power Query”
Hey Spencer! It’s easier and quicker to use the UI to create the parameter and the function. Check Patrick’s video: Do I need to be an expert in Power Query M? https://www.youtube.com/watch?v=Z0U9UL9ORh8
Loved this challenge. Couldn’t work out how you managed to have all the axis showing without the scroll bar so to see the tip on using the custom page size and then expanding was really useful
In addition to the missing Los Angeles Angels, I witnessed another problem.
Cleveland Indians received a new name due to some controversies, and is now called Cleveland Guardians.
I simply replaced all team name values in the fact table.