How to Build a Football Dataset with Web Scraping

This article covers using Selenium to scrape JavaScript rendered content.

By Otávio Simões Silveira, Economist, Aspiring Data Scientist

When scraping a website with Python using libraries such as BeautifulSoup, requests, or urllib it’s common to have some trouble accessing some parts of the website. That's because these parts are generated on the client-side, using JavaScript, which these libraries can’t handle.

To deal with this problem, using Selenium can be an interesting option. Selenium works by opening an automated browser and then it’s capable of accessing the entire content and of interacting with the page.

This article will cover the scraping of JavaScript rendered content with Selenium using the Premier League website as an example and scraping the stats of every match in the 2019/20 season.

Understanding the Website

The Premier League website makes the scraping of multiples matches pretty simple with its very straight forward URLs. The URL for a match consists basically of “https://www.premierleague.com/match/” followed by a unique match ID.

Each ID consists of a number and the IDs for all matches of each season are sequenced. For instance, the entire 2019/20 season goes from 46605 to 46984. All we need to do then is to loop through this interval and collect the data from each match.

We’ll use Liverpool 5 to 3 win over Chelsea as an example in this article. This game ID is 46968. You can type this ID after “premierleague.com/match/” to go to the page so you can follow along with the scraping process that will be described in the article. Refer back to the page always that necessary.

Scraping…

To begin with the code, we’ll make our imports and initialize two empty lists, one for dealing with errors, which will be explained later in the article, and the other to store the data of every match we scrape.

Within the loop, the URL will be created using the match ID, the driver object will be instantiated, and we’ll set up Selenium. No advanced configurations we’ll be used here. The option.headless = True line states that we don’t want to actually see the browser opening and going to the website to collect the data. With that done, we’ll use the driver object to get the page.

And we’re now set to begin with the scraping. We’ll first collect the date of the and the teams involved in the match. We’ll also use Datetime to convert the date format from “Wed 22 Jul 2020” to 07/22/2020.

Each element is found through its Xpath, but it can also be found by name, class, tag, and more. Check all the selectors here.

Notice that we had to use the WebDriverWait and the expected_conditions when collecting the match date. That’s because this is one of the parts of the page generated using JavaScript, and so we need to wait for the element to be rendered in order to avoid raising an error.

If we tried to collect the match date using, let’s say, requests and BeautifulSoup only, we wouldn’t be able to access this information since BeautifulSoup can’t parse JavaScript rendered content.

To scrape the final scores, we first need to get the text from inside what I call the score box, which returns the text “5–3”, and then to assign the home team score and the away team score.

The next step is to get the stats of the game. This data is a table under the stats tab on the page. We could simply read the page source using the Pandas read_htmlfunction, but this part of the page is only rendered after we click on the tab.

The first thing to do then is to find the tab element and click on it with Selenium. After that, we can use the read_htmlfunction. This function returns a list with all tables on the page stored as DataFrames. We then select the last element in the list, which is the one we are after. The scraping is now done, we can just quit the driver.

Error Handling

Selenium can be a little unstable sometimes and take too long to load the page. This can raise a couple of errors since we’re scraping hundreds of pages.

To deal with this, we’ll need the try and except clauses. If an error is raised while collecting the data, the code will append the match ID to the errors list and move on to the next match without crashing. When all the scraping is done, you can easily see this list to scrape only the matches that are missing. This is how the code for all this:

Manipulating the Stats

This is how the stats DataFrame looks right now:

    Liverpool       Unnamed: 1  Chelsea
0          50     Possession %       50
1           7  Shots on target        5
2          10            Shots       10
3         749          Touches      752
4         584           Passes      575
5          19          Tackles        9
6          14       Clearances       15
7           6          Corners        0
8           0         Offsides        3
9           1     Yellow cards        0
10          8   Fouls conceded       11

As we need to store all this in a row of a DataFrame, this format is not good. To fix this, we’ll create two dictionaries, one for each team, in which every key will represent a stat. This is the entire process:

Making the Data Consistent

Notice that we don’t have the red card stats in the stats DataFrame. That’s because there were no red cards in this game. When there are no occurrences of a stat, the website doesn't show that stat.

If this isn’t fixed, some rows will be longer than others the data will be inconsistent. To fix this, we’ll use a list containing all the expected stats and if any of the values in this list is not a key of the stats dictionaries (we only need to check one of them) then this stat we’ll be added as a key to both dictionaries with the value zero.

All that is left now is to create a new list with everything that was scraped for this match and append this list to the season list that contains all the matches.

When we’re finishing scraping all matches in the season, we can just transform the season list of lists into a DataFrame and export the data as a .csv file. The stats_check list was used to create a list used to name the DataFrames columns.

You can see the complete code here.

Wrapping up

Finally, this is the data scraped:

Image by Author

380 matches. This is the entire Premier League 2019/20 season in a dataset! And you can do even more: if you use the ID 1 you’ll go back to the 1992/93 season. But the IDs aren’t linear from 1992 to today because at some point the IDs began to cover cup matches, youth matches, and women’s matches as well.

However, you can find the IDs for almost every Premier League match since the 2011/12 season here if you want to have a dataset with thousands and thousands of matches.

If you’re going for that, make sure to insert more pauses in your code, using the WebDriverWait or even the sleep function to avoid having your IP blocked for making too many requests to the website. Another possibility is to get in touch with a proxy provider, such as Infatica, as they’ll be able to provide you a better infrastructure of IP addresses to keep your code running.

And to go one step further, you can always scrape more data about each game. With a few more lines of code, you can have in your dataset information such as the referee, the stadium and the city where each match took place, the attendance, the halftime score, the goal scorers, the lineups, and much more!

Keep scraping!

I hope you’ve enjoyed this and that it can maybe be useful somehow. If you have a question, a suggestion, or just want to be in touch, feel free to contact through Twitter, GitHub, or Linkedin.

Bio: Otávio Simões Silveira is economist and aspiring data scientist.

Original. Reposted with permission.

Related: