How to Build a Football Dataset with Web Scraping
By Otávio Simões Silveira, Economist, Aspiring Data Scientist
To deal with this problem, using Selenium can be an interesting option. Selenium works by opening an automated browser and then it’s capable of accessing the entire content and of interacting with the page.
Understanding the Website
The Premier League website makes the scraping of multiples matches pretty simple with its very straight forward URLs. The URL for a match consists basically of “https://www.premierleague.com/match/” followed by a unique match ID.
Each ID consists of a number and the IDs for all matches of each season are sequenced. For instance, the entire 2019/20 season goes from 46605 to 46984. All we need to do then is to loop through this interval and collect the data from each match.
We’ll use Liverpool 5 to 3 win over Chelsea as an example in this article. This game ID is 46968. You can type this ID after “premierleague.com/match/” to go to the page so you can follow along with the scraping process that will be described in the article. Refer back to the page always that necessary.
To begin with the code, we’ll make our imports and initialize two empty lists, one for dealing with errors, which will be explained later in the article, and the other to store the data of every match we scrape.
Within the loop, the URL will be created using the match ID, the driver object will be instantiated, and we’ll set up Selenium. No advanced configurations we’ll be used here. The
option.headless = True line states that we don’t want to actually see the browser opening and going to the website to collect the data. With that done, we’ll use the driver object to get the page.
And we’re now set to begin with the scraping. We’ll first collect the date of the and the teams involved in the match. We’ll also use Datetime to convert the date format from “Wed 22 Jul 2020” to 07/22/2020.
Each element is found through its Xpath, but it can also be found by name, class, tag, and more. Check all the selectors here.
Notice that we had to use the WebDriverWait and the
To scrape the final scores, we first need to get the text from inside what I call the score box, which returns the text “5–3”, and then to assign the home team score and the away team score.
The next step is to get the stats of the game. This data is a table under the stats tab on the page. We could simply read the page source using the Pandas
read_htmlfunction, but this part of the page is only rendered after we click on the tab.
The first thing to do then is to find the tab element and click on it with Selenium. After that, we can use the
read_htmlfunction. This function returns a list with all tables on the page stored as DataFrames. We then select the last element in the list, which is the one we are after. The scraping is now done, we can just quit the driver.
Selenium can be a little unstable sometimes and take too long to load the page. This can raise a couple of errors since we’re scraping hundreds of pages.
To deal with this, we’ll need the try and except clauses. If an error is raised while collecting the data, the code will append the match ID to the errors list and move on to the next match without crashing. When all the scraping is done, you can easily see this list to scrape only the matches that are missing. This is how the code for all this:
Manipulating the Stats
This is how the stats DataFrame looks right now:
Liverpool Unnamed: 1 Chelsea 0 50 Possession % 50 1 7 Shots on target 5 2 10 Shots 10 3 749 Touches 752 4 584 Passes 575 5 19 Tackles 9 6 14 Clearances 15 7 6 Corners 0 8 0 Offsides 3 9 1 Yellow cards 0 10 8 Fouls conceded 11
As we need to store all this in a row of a DataFrame, this format is not good. To fix this, we’ll create two dictionaries, one for each team, in which every key will represent a stat. This is the entire process:
Making the Data Consistent
Notice that we don’t have the red card stats in the stats DataFrame. That’s because there were no red cards in this game. When there are no occurrences of a stat, the website doesn't show that stat.
If this isn’t fixed, some rows will be longer than others the data will be inconsistent. To fix this, we’ll use a list containing all the expected stats and if any of the values in this list is not a key of the stats dictionaries (we only need to check one of them) then this stat we’ll be added as a key to both dictionaries with the value zero.
All that is left now is to create a new list with everything that was scraped for this match and append this list to the season list that contains all the matches.
When we’re finishing scraping all matches in the season, we can just transform the season list of lists into a DataFrame and export the data as a .csv file. The stats_check list was used to create a list used to name the DataFrames columns.
You can see the complete code here.
Finally, this is the data scraped:
380 matches. This is the entire Premier League 2019/20 season in a dataset! And you can do even more: if you use the ID 1 you’ll go back to the 1992/93 season. But the IDs aren’t linear from 1992 to today because at some point the IDs began to cover cup matches, youth matches, and women’s matches as well.
However, you can find the IDs for almost every Premier League match since the 2011/12 season here if you want to have a dataset with thousands and thousands of matches.
If you’re going for that, make sure to insert more pauses in your code, using the WebDriverWait or even the sleep function to avoid having your IP blocked for making too many requests to the website. Another possibility is to get in touch with a proxy provider, such as Infatica, as they’ll be able to provide you a better infrastructure of IP addresses to keep your code running.
And to go one step further, you can always scrape more data about each game. With a few more lines of code, you can have in your dataset information such as the referee, the stadium and the city where each match took place, the attendance, the halftime score, the goal scorers, the lineups, and much more!
Bio: Otávio Simões Silveira is economist and aspiring data scientist.
Original. Reposted with permission.
- Python, Selenium & Google for Geocoding Automation: Free and Paid
- Automate your Python Scripts with Task Scheduler: Windows Task Scheduler to Scrape Alternative Data
- A step-by-step guide for creating an authentic data science portfolio project