The Machine Learning Field Guide
This straightforward guide offers a structured overview of all machine learning prerequisites needed to start working on your project, including the complete data pipeline from importing and cleaning data to modelling and production.
By Kamron Bhavnagri, Up and coming machine learning engineer/data scientist.
We all start with either a dataset or a goal in mind. Once we've found, collected or scraped our data, we pull it up, and witness the overwhelming sight of merciless cells of numbers, more numbers, categories, and maybe some words! A naive thought crosses our mind, to use our machine learning prowess to deal with this tangled mess... but a quick search reveals the host of tasks we'll need to consider before training a model!
Once we overcome the shock of our unruly data, we look for ways to battle our formidable nemesis. We start by trying to get our data into Python. It is relatively simple on paper, but the process can be slightly... involved. Nonetheless, a little effort was all that was needed (lucky us).
Without wasting any time, we begin data cleaning to get rid of the bogus and expose the beautiful. Our methods start simple - observe and remove. It works a few times, but then we realise... it really doesn't do us justice! To deal with the mess, though, we find a powerful tool to add to our arsenal: charts! With our graphs, we can get a feel for our data, the patterns within it, and where things are missing. We can interpolate (fill in) or remove missing data.
Finally, we approach our highly anticipated challenge, data modelling! With a little research, we find out which tactics and models are commonly used. It is a little difficult to decipher which one we should use, but we still manage to get through it and figure it all out!
We can't finish a project without doing something impressive, though. So, a final product, website, app, or even a report will take us far! We know first impressions are important, so we fix up the GitHub repository and make sure everything's well documented and explained. Now we are finally able to show off our hard work to the rest of the world!
Chapter 1 - Importing Data
Data comes in all kinds of shapes and sizes, and so the process we use to get everything into code often varies.
Let's be real, importing data seems easy, but sometimes... it's a little pesky.
The hard part about data cleaning isn't the coding or theory, but instead our preparation! When we first start a new project and download our dataset, it can be tempting to open up a code editor and start typing... but this won't do us any good. If we want to get a head start, we need to prepare ourselves for the best and worst parts of our data. To do this, we'll need to start basic, by manually inspecting our spreadsheet/s. Once we understand the basic format of the data (filetype along with any particularities) we can move onto getting it all into Python.
When we're lucky and just have one spreadsheet we can use the Pandas read_csv function (letting it know where our data lies):
In reality, we run into way more complex situations, so look out for:
- The file starts with unneeded information (which we need to skip)
- We only want to import a few columns
- We want to rename our columns
- Data includes dates
- We want to combine data from multiple sources into one place
- Data can be grouped together
Although we're discussing a range of scenarios, we normally only deal with a few at a time.
Our first few problems (importing specific parts of our data/renaming columns) are easy enough to deal with using a few parameters, like the number of rows to skip, the specific columns to import and our column names:
We parse to concat a list of spreadsheets (which we import just like before). The list can, of course, be attained in any way (so a fancy list comprehension or a casual list of every file both work just as well), but just remember that we need dataframes, not filenames/paths!
After all the data is inside a Pandas dataframe, we need to double-check that our data is formatted correctly. In practice, this means checking each series datatype and making sure they are not generic objects. We do this to ensure that we can utilize Pandas inbuilt functionality for numeric, categorical, and date/time values. To look at this, just run DataFrame.dtypes. If the output seems reasonable (i.e., numbers are numeric, categories are categorical, etc), then we should be fine to move on. However, this normally is not the case, and as we need to change our datatypes! This can be done with Pandas DataFrame.astype. If this doesn't work, there should be another Pandas function for that specific conversion:
If we need to analyse separate groups of data (i.e., maybe our data is divided by country), we can use Pandas groupby. We can use groupby to select particular data, and to run functions on each group separately:
Other more niche tricks like multi/hierarchical indices can also be helpful in specific scenarios but are more tricky to understand and use.
Chapter 2 - Data Cleaning
Data is useful, data is necessary. However, it needs to be clean and to the point! If our data is everywhere, it simply won't be of any use to our machine learning model.
Everyone is driven insane by missing data, but there's always a light at the end of the tunnel.
The easiest and quickest way to go through data cleaning is to ask ourselves:
What features within our data will impact our end-goal?
By end-goal, we mean whatever variable we are working towards predicting, categorising or analysing. The point of this is to narrow our scope and not get bogged down in useless information.
Once we know what our primary objective features are, we can try to find patterns, relations, missing data, and more. An easy and intuitive way to do this is graphing! Quickly use Pandas to sketch out each variable in the dataset, and try to see where everything fits into place.
Once we have identified potential problems or trends in the data, we can try and fix them. In general, we have the following options:
- Remove missing entries
- Remove full columns of data
- Fill in missing data entries
- Resample data (i.e., change the resolution)
- Gather more information
To go from identifying missing data to choosing what to do with it, we need to consider how it affects our end-goal. With missing data, we remove anything which doesn't seem to have a major influence on the end result (i.e., we couldn't find a meaningful pattern) or where there just seems too much missing to derive value. Sometimes we also decide to remove very small amounts of missing data (since it's easier than filling it in).
If we've decided to get rid of information, Pandas DataFrame.drop can be used. It removes columns or rows from a dataframe. It is quite easy to use, but remember that Pandas does not modify/remove data from the source dataframe by default, so inplace=True must be specified. It may be useful to note that the axis parameter specifies whether rows or columns are being removed.
When not removing a full column, or particularly targeting missing data, it can often be useful to rely on a few nifty Pandas functions. For removing null values, DataFrame.dropna can be utilized. Do keep in mind though that, by default, dropna completely removes all missing values. However, setting either the parameter how to all or setting a threshold (thresh, representing how many null values are required for it to delete) can compensate for this.
If we've got small amounts of irregular missing values, we can fill them in several ways. The simplest is DataFrame.fillna that sets the missing values to some preset value. The more complex, but flexible option is interpolation using DataFrame.interpolate. Interpolation essentially allows anyone to simply set the method they would like to replace each null value with. These include the previous/next value, linear, and time (the last two deduce based on the data). Whenever working with time, time is a natural choice, and otherwise, make a reasonable choice based on how much data is being interpolated and how complex it is.
As seen above, interpolate needs to be passed in a dataframe purely containing the columns with missing data (otherwise, an error will be thrown).
Resampling is useful whenever we see regularly missing data or have multiple sources of data using different timescales (like ensuring measurements in minutes and hours can be combined). It can be slightly difficult to intuitively understand resampling, but it is essential when you average measurements over a certain timeframe. For example, we can get monthly values by specifying that we want to get the mean of each month's values:
The "M" stands for month and can be replaced with "Y" for the year and other options.
Although the data cleaning process can be quite challenging, if we remember our initial intent, it becomes a far more logical and straight forward task! If we still don't have the needed data, we may need to go back to phase one and collect some more. Note that missing data indicates a problem with data collection, so it's useful to carefully consider and note down where it occurs.
Chapter 3 - Visualisation
Visualisation sounds simple, and it is, but it's hard to... not overcomplicate. It's far too easy for us to consider plots as a chore to create. Yet, these bad boys do one thing very, very well - intuitively demonstrate the inner workings of our data! Just remember:
We graph data to find and explain how everything works.
Hence, when stuck for ideas, or not quite sure what to do, we basically can always fall back on identifying useful patterns and meaningful relationships. It may seem iffy, but it is really useful.
Our goal isn't to draw fancy hexagon plots, but instead to picture what is going on, so absolutely anyone can simply interpret a complex system!
A few techniques are undeniably useful:
- Resampling when we have too much data
- Secondary axis when plots have different scales
- Grouping when our data can be split categorically
To get started graphing, simply use Pandas .plot() on any series or dataframe! When we need more, we can delve into MatPlotLib, Seaborn, or an interactive plotting library.
For 90% of the time, this basic functionality will suffice (more info here), and where it doesn't, a search should reveal how to draw particularly exotic graphs.
Chapter 4 - Modelling
A Brief Overview
Now finally, for the fun stuff - deriving results. It seems so simple to train a scikit-learn model, but no one goes into the details! So, let's be honest here, not every dataset, nor model are equal.
Our approach to modelling will vary widely based on our data. There are three especially important factors:
- Typeof problem
- Amountof data
- Complexityof data
Our type of problem comes down to whether we are trying to predict a class/label (called classification), a value (called regression), or to group data (called clustering). If we are trying to train a model on a dataset where we already have examples of what we're trying to predict, then we call our model supervised, if not, unsupervised. The amount of available data and how complex it is foreshadows how simple a model will suffice. Data with more features (i.e., columns) tend to be more complex.
The point of interpreting complexity is to understand which models are too good or too bad for our data.
Models goodness of fit informs us of this! If a model struggles to interpret our data (too simple), we can say it underfits, and if it is completely overkill (too complex), we say it overfits. We can think of it as a spectrum from learning nothing to memorising everything. We need to strike a balance, to ensure our model is able to generalise our conclusions to new information. This is typically known as the bias-variance tradeoff. Note that complexity also affects model interpretability.
Complex models take substantially more time to train, especially with large datasets. So, upgrade that computer, run the model overnight, and chill for a while!
Splitting up data
Before training a model, it is important to note that we will need some dataset to test it on (so we know how well it performs). Hence, we often divide our dataset into separate training and testing sets. This allows us to test how well our model can generalise to new unseen data. This normally works because we know our data is decently representative of the real world.
The actual amount of test data doesn't matter too much, but 80% train and 20% test is often used.
In Python with Scikit learn the train_test_split function does this:
train_data, test_data = train_test_split(data)
Cross-validation is where a dataset is split into several folds (i.e., subsets or portions of the original dataset). This tends to be more robust and resistant to overfitting than using a single test/validation set! Several scikit-learn functions help with cross-validation. However, it's normally done straight through a grid or random search (discussed below).
cross_val_score(model, input_data, output_data, cv=5)
There are some factors our model cannot account for, and so we set certain hyperparameters. These vary model to model, but we can either find optimal values through manual trial and error or a simple algorithm like grid or random search. With grid search, we try all possible values (brute force) and with random search random values from within some distribution/selection. Both approaches typically use cross-validation.
Grid search in scikit-learn works through a parameters dictionary. Each entry key represents the hyperparameter to tune, and the value (a list or tuple) is the selection of values to choose from:
After we've created the grid, we can use it to train the models, and extract the scores:
The important thing here is to remember that we need to train on the training and not testing data. Even though cross-validation is used to test the models, we're ultimately trying to get the best fit on the training data and will proceed to test each model on the testing set afterward:
Random search in scikit-learn works similarly but is slightly more complex as we need to know what type of distribution each hyperparameter takes in. Although it, in theory, can yield the same or better results faster, that changes from situation to situation. For simplicity, it is likely best to stick to a grid search.
Using a model
With scikit-learn, it's as simple as finding our desired model name and then just creating a variable for it. Check the links to the documentation for further details! For example,
- Linear/Logistic Regression
Linear regression is trying to fit a straight line to our data. It is the most basic and fundamental model. There are several variants of linear regression, like lasso and ridge regression (which are regularisation methods to prevent overfitting). Polynomial regression can be used to fit curves of higher degrees (like parabolas and other curves). Logistic regression is another variant that can be used for classification.
- Support Vector Machines
Just like with linear/logistic regression, support vector machines (SVMs) try to fit a line or curve to data points. However, with SVM the aim is to maximise the distance between a boundary and each point (instead of getting the line/curve to go through each point).
The main advantage of support vector machines is their ability to use different kernels. A kernel is a function that calculates the similarity. These kernels allow for both linear and non-linear data while staying decently efficient. The kernels map the input into a higher-dimensional space, so a boundary becomes present. This process is typically not feasible for large numbers of features. A neural network or another model will then likely be a better choice!
- Neural Networks
All the buzz is always about deep learning and neural networks. They are complex, slow, and resource-intensive models that can be used for complex data. Yet, they are extremely useful when encountering large unstructured datasets.
When using a neural net, make sure to watch out for overfitting. An easy way is through tracking changes in error with time (known as learning curves).
Deep learning is an extremely rich field, so there is far too much to discuss here. In fact, scikit-learn is a machine learning library, with little deep learning abilities (compared to PyTorch or TensorFlow).
- Decision Trees
Decision trees are simple and quick ways to model relationships. They are basically a tree of decisions that help decide on what class or label a datapoint belongs too. Decision trees can be used for regression problems too. Although simple, to avoid overfitting, several hyperparameters must be chosen. These all, in general, relate to how deep the tree is and how many decisions are to be made.
We can group unlabeled data into several clusters using k-means. Normally the number of clusters present is a chosen hyperparameter.
K-means works by trying to optimize (reduce) some criterion (i.e., function) called inertia. It can be thought of as trying to minimize the distance from a set of centroids to each data point.
- Random Forests
Random forests are combinations of multiple decision trees trained on random subsets of the data (bootstrapping). This process is called bagging and allows random forests to obtain a good fit (low bias and low variance) with complex data.
The rationale behind this can be likened to democracy.
One voter may vote for a bad candidate, but we'd hope that the majority of voters make informed, positive decisions.
For regression problems, we average each decision tree's outputs, and for classification, we choose the most popular one. This might not always work, but we generally assume it will (especially with large datasets with multiple columns).
Another advantage with random forests is that insignificant features shouldn't negatively impact performance because of the democratic-like bootstrapping process!
Hyperparameter choices are the same as those for decision trees but with the number of decision trees as well. For the reasons above, more trees equal less overfitting!
Note that random forests use random subsets with the replacement of rows and columns!
Ensemble models like AdaBoost or XGBoost work by stacking one model on top of another. The assumption here is that each successive weak learner will correct for the flaws of the previous one (hence called boosting). Hence, the combination of models should provide the advantages of each model without its potential pitfalls.
The iterative approach means previous models’ performances effects current models, and better models are given a higher priority. Boosted models perform slightly better than bagging models (a.k.a. random forests), but are also slightly more likely to overfit. The scikit-learn library provides AdaBoost for classification and regression.
Chapter 5 - Production
This is the last but potentially most important part of the process. We've put in all this work, and so we need to go the distance and create something impressive!
After trying most of these, I honestly would recommend sticking to Streamlit, because it is so much easier than the others!
Here it is important to start with a vision (simpler the better) and try to find out which parts are most important. Then try and specifically work on those. Continue until completion! For websites, a hosting service like Heroku will be needed, so the rest of the world can see the amazing end-product of all our hard work.
Even if none of the above options above suit the scenario, a report or article covering what we've done, what we've learned, and any suggestions or lessons learned along with a well documented GitHub repository are indispensable! Make sure that readme file is up to date.
Original. Reposted with permission.
- Easy Guide To Data Preprocessing In Python
- Data Cleaning: The secret ingredient to the success of any Data Science Project
- 3 Advanced Python Features You Should Know