How to Use Python and Machine Learning to Predict Football Match Winners

We will be learning web scraping and training supervised machine-learning algorithms to predict winning teams.



How to Use Python and Machine Learning to Predict Football Match Winners
Image by Freepik

 

Introduction

 

Python is one of the most versatile programming languages out there. Over the years, Python programming has grown to become the most popular programming language for building various machine learning applications.

A key element of such applications is often to carry out some kind of prediction based on the data available for processing. Predictions have the facet of uncertainty that is tackled very easily using Python programming. 

Here, in this article, we will try to tackle one such problem. With the help of Python programming, we will try to predict the results of a football match.

Since this problem involves a certain level of uncertainty, Python programming might just be the best option to study and solve this. And that is exactly what we will try to accomplish here. 

 

Overview

 

Football is a sport, which like any other sport, involves multiple such elements that are truly unpredictable in nature.

It is well-known that football matches often turn out to be different than what one would have anticipated.

In such a scenario, predicting football match winners comes as a challenge. However, even if we cannot know the events of a particular match beforehand, we can know the events that occurred in the past matches.

This data becomes the key element in carrying out a successful prediction when needed. This is the basis of a data science problem, studying the data statistics of the past to predict a likely future. 

Thus, in this problem, we will base our results on the data derived from the past matches. We will carry out a statistical study on the basis of the past data and predict the most likely winner in a football match.

To do so, we will be using supervised machine learning to build an algorithm for the detection using Python programming.  

 

Problem Statement

 

This article aims to perform:

  1. Web-scraping to collect data of past football matches
  2. Supervised Machine Learning using detection models to predict the results of a football match on the basis of collected data
  3. Evaluation of the detection models

 

Steps Involved

 

1. Web-scraping

 

Web-scraping is the method of extracting relevant data for huge chunks of data available on different websites on the internet.

The data that is to be extracted is mostly unstructured and in the HTML format. This data is scraped in a manner that converts it to becoming structured and in the form of a list easily accessible for processing applications later.  

For web-scraping to be carried out successfully, we need to narrow our search down to a website which contains data about the football matches in particular.

Once that is fixed, we will use the URL to the website to gain access to the HTML script of the page mainly.

Using this HTML code, the scraper will convert it to the required output format as needed (may be a spreadsheet or a list or a CSV/JSON file so on). 

For the sake of this problem, we will be carrying out web-scraping on the data available on the website: FBref.com 

The steps involved can be:

  1. Navigate to the “Competitions” section of the above-mentioned website.
  2. Select any mentioned competition (such as Premier League 2022-23) whose results you want to extract for making predictions on.
  3. Go to the “Scores & Fixtures” section under the selected competition section.

The scores would be used to make predictions so we would need to web-scrape that information. Thus, copy the URL of the page.

For this case (let’s say, Premier League), the link would be: https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures#sched_2022-2023_9_1

You could also get the link to some other competition as needed.

  1. However, it is to be noted that we could also use any other website for carrying out the detection as well.

    For instance, we could web-scrape the results of a match off of Wikipedia itself simply by providing the link to the match scores, such as, https://en.wikipedia.org/wiki/2022_FIFA_World_Cup

  2. For performing actual web-scraping, the copied URL would need to be provided to the web-scraping script or code for extracting the relevant match data.
  3. The script would be used to combine all the games in one season into a list or a .csv file.
  4. The copied URL from above would be given as input, along with the id of the tables containing information about the championship.
  5. The compiled list comprising all the matches would be received as output.
  6. The information that is unnecessary is omitted, such as the player statistical data.
  7. The information is restricted to contain only match data mapped to team data so that predictions as to which team will win can be made.
  8. The result is appended to contain the data about matches and teams (omitting player-specific information) with the help of a Data frame.

This is majorly how web-scraping is done and the extracted data is the past data on the basis of which predictions will be made about future winners.

Let us understand this with the help of the following code snippets:

First, we will import the necessary libraries.

import pandas as pd
from bs4 import BeautifulSoup
import requests

 

Next, we will use Beautiful Soup to create a soup to extract the HTML code for the website.

url = 'https://en.wikipedia.org/wiki/2022_FIFA_World_Cup'
res = requests.get(url)
content = res.text
soup = BeautifulSoup(content, 'lxml')

 

Then, we will extract information for the matches on the basis of which we would predict, for instance, the data for the FIFA World Cup matches.

match_data = soup.find_all('div', class_='footballbox')

 

Next, we will extract the data/scores for the home and away teams.

for match in match_data:
    home_team.append(match.find('th', class_='fhome').get_text())
    score.append(match.find('th', class_='fscore').get_text())
    away_team.append(match.find('th', class_='faway').get_text())

 

Finally, we will store the data in a DataFrame to be exported to a .csv file.

dict_football = {'home_team': home_team, 'score': score, 'away_team': away_team}
df_football = pd.DataFrame(dict_football)

df_football.to_csv("fifa_worldcup_data.csv", index=False)

 

2. Data Pre-Processing

 

It becomes crucial to process the data prior to running actual detection models on it. Thus, we will do the same in this scenario as well.

The steps include creating a variable to store the mean value of the scores won in previous matches.

This is because detection can only be made on the data that is already available to us since we do not have access to any future data.

We will calculate the average for the different variables storing information about the season matches.

Along with this, we will also store moving averages for various other variables. 

The scores for a team were summed with each win quantified as 3, a draw point as 2, and a loss as 1. These values were used to sum all the scores of a team in the past few matches. 

Next, to ensure that the distinction between home team and away team is made, we can do appropriate calculations.

However, for this case, we can assume that the results need to be derived for the FIFA World Cup.

Since the tournament comprises matches on neutral grounds, we can ignore the concept of home team and away team in this particular case.

If at all, we need to consider them, we have to keep in mind to subtract the results of the home team from that of the away team to check if the home team is superior or not to the away team.

 

3. Implementing Prediction Models

 

For carrying out the actual detection, we can use different kinds of prediction models. In this case, we will consider 3-4 models for implementing the actual prediction. The models being considered for the prediction here, are as follows:

 

Poisson Distribution

 

Poisson distribution is a prediction algorithm that is used for detecting how probable an event is by defining the probability within a fixed interval and having a constant mean rate.

A Poisson distribution predicts how many times an event might occur in a particular interval. This means that it helps provide a measure of the probability of an event, rather than a simple probable or not probable outcome.

This is why it is suitable for multi-classification problems generally, but works just as well for binary problems too (considering the two classes as the multi-classes in the dataset). 

The code snippets used for the implementation is as follows:

Defining a function “predict” to calculate points for Home Team and Away Team.

def predict(home_team, away_team):

    # Calculate the value of lambda (λ) for both Home Team and Away Team.
    if home_team in df_football.index and away_team in df_football.index:
        lambda_home_team = df_football.at[home_team,'GoalsScored'] * df_football.at[away_team,'GoalsConceded']
        lambda_away_team = df_football.at[away_team,'GoalsScored'] * df_football.at[home_team,'GoalsConceded']

 

Next, use the formula for Poisson distribution to calculate the value of “p” as can be seen below.

This value is then used to calculate respective probabilities for draw (pr_draw), home team as winner (pr_home) and away team as winner (pr_away).

p = poisson.pmf(x, lambda_home_team) * poisson.pmf(y, lambda_away_team)
if x == y:
    pr_draw += p
elif x > y:
    pr_home += p
else:
    pr_away += p

 

The points for both Home Team and Away Team are calculated separately and then used to make the final prediction.

points_home_team = 3 * pr_home + pr_draw
points_away_team = 3 * pr_away + pr_draw

 

This is how we can make a basic prediction for a football game winner with the help of a machine learning model (in this case, Poisson distribution).

This particular approach can be extended to other models as well by simply changing the formula for the predictive model under consideration.

The final result would be then evaluated for different models in the form of a comparative study to ensure that we get the best results using the most appropriate model available out there. 

Let us take a brief look at the various other models we can also use for making a similar prediction.

 

Support Vector Machine

 

SVM or Support Vector Machine is an algorithm based on supervised machine learning.

It is majorly used for classification problems. It classifies by creating a boundary between the various kinds of data.

Since it operates as a separation between two data entities, it can be thought of as a binary classification solution majorly.

But it can be modified or extended to multi-class classifications as well.

To carry out an SVM prediction using Python programming, we can use the following:

svc_predict = svm.SVC()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30)
svc_predict.fit(x_train, y_train)

 

Here, svc_predict is the SVM calculation for the training data denoted as x_train and y_train here. The x_train and y_train comprise the data on which the model is trained while x_test and y_test denotes the data on which the model is tested.

 

KNN

 

K-Nearest Neighbours or KNN is an algorithm which is also based on supervised machine learning.

It performs classification of data with the help of class labels. Basically, the classes are labelled to create a separation.

Every data entity belonging to the same type has the same class label.

For regression cases, the prediction is made by taking the average of the “K” nearest neighbours.

The distance between neighbours is usually the Euclidean distance between them.

However, any other distance metric could also be used for the same.

knn_predict = KNeighborsClassifier()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30)
knn_predict.fit(x_train, y_train)

 

Logistic Regression 

 

Logistic regression is a linear model for binary classification problems.

It can be used to make predictions about how probable an event is and this is why we use it for the case.

In the case of a logistic regression, the dependent variable is bounded in the range between 0 and 1.

This is why it works well for binary classification problems, such as a win or lose scenario for a football match.

logistic_predict = LogisticRegression()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30)
logistic_predict.fit(x_train, y_train)

 

4. Evaluating Results Using Metrics

 

To evaluate the results obtained through the use of different models, we can use metrics to map which model performed better than the rest.

Here, we could calculate accuracy to determine the quality of performance of the models. The formula for the same might be stated as below:

Accuracy = (True Positives + True Negatives) / 

(True Positives + False Negatives + True Negatives + False Positives)

A true positive is a correctly predicted positive outcome. Similarly, a true negative is a correctly predicted negative outcome.

A false negative is a wrongly predicted negative outcome. Similarly, a false positive is a wrongly predicted positive outcome. 

To check for accuracy, we need to compare the predicted outputs with the real outputs. This is how we can check which model makes a prediction which is the closest to the actual result.

 

Conclusion

 

The particular problem was a complex one and still we could achieve the result easily with the help of Python programming.

Even though the results are not absolutely accurate, the algorithm still shows how Python programming is changing the world everyday.

The algorithm can predict the results logically with ease, a task which, perhaps, humans cannot achieve without prior information about the games.

Using such prediction models, we can finetune them and achieve even better results in future.

Hope you have understood how to predict the data by using python and machine learning. You can learn more about python from free resources such as KDnuggets, Scaler, or freecodecamp.

Happy Learning!
 
 
Vaishnavi Amira Yada is a technical content writer. She have knowledge of Python, Java, DSA, C, etc. She found herself in writing and she loved it.