How I Consistently Improve My Machine Learning Models From 80% to Over 90% Accuracy
Data science work typically requires a big lift near the end to increase the accuracy of any model developed. These five recommendations will help improve your machine learning models and help your projects reach their target goals.
Photo by Ricardo Arce on Unsplash.
If you’ve completed a few data science projects of your own, then you probably realized by now that achieving an accuracy of 80% isn’t too bad! But in the real world, 80% won’t cut it. In fact, most companies that I’ve worked for expect a minimum accuracy (or whatever metric they’re looking at) of at least 90%.
Therefore, I’m going to talk about 5 things that you can do to significantly improve your accuracy. I highly recommend that you read all five points thoroughly because there are a lot of details that I’ve included that most beginners don’t know.
By the end of this, you should understand that there are many more variables than you think that play a role in dictating how well your machine learning model performs.
With that said, here are 5 things that you can do to improve your machine learning models!
1. Handling Missing Values
One of the biggest mistakes I see is how people handle missing values, and it’s not necessarily their fault. A lot of material on the web says that you typically handle missing values through mean imputation, replacing null values with the mean of the given feature, and this usually isn’t the best method.
For example, imagine we have a table showing age and fitness score and imagine that an eighty-year-old has a missing fitness score. If we took the average fitness score from an age range of 15 to 80, then the eighty-year-old will appear to have a much higher fitness score that they actually should.
Therefore, the first question you want to ask yourself is why the data is missing to begin with.
Next, consider other methods in handling missing data, aside from mean/median imputation:
- Feature Prediction Modeling: Referring back to my example regarding age and fitness scores, we can model the relationship between age and fitness scores and then use the model to find the expected fitness score for a given age. This can be done via several techniques, including regression, ANOVA, and more.
- K Nearest Neighbour Imputation: Using KNN imputation, the missing data is filled with a value from another similar sample, and for those who don’t know, the similarity in KNN is determined using a distance function (i.e., Euclidean distance).
- Deleting the row: Lastly, you can delete the row. This is not usually recommended, but it is acceptable when you have an immenseamount of data to start with.
2. Feature Engineering
The second way you can significantly improve your machine learning model is through feature engineering. Feature engineering is the process of transforming raw data into features that better represent the underlying problem that one is trying to solve. There’s no specific way to go about this step, which is what makes data science as much of an art as it as a science. That being said, here are some things that you can consider:
- Converting a DateTime variable to extract just the day of the week, the month of the year, etc…
- Creating bins or buckets for a variable. (e.g., for a height variable, can have 100–149 cm, 150–199 cm, 200–249 cm, etc.)
- Combining multiple features and/or values to create a new one. For example, one of the most accurate models for the titanic challenge engineered a new variable called “Is_women_or_child” which was True if the person was a woman or a child and false otherwise.
3. Feature Selection
The third area where you can vastly improve the accuracy of your model is feature selection, which is choosing the most relevant/valuable features of your dataset. Too many features can cause your algorithm to overfit, and too little features can cause your algorithm to underfit.
There are two main methods that I like to use that you can use to help you with selecting your features:
- Feature importance: some algorithms, like random forests or XGBoost, allow you to determine which features were the most “important” in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others.
- Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features.
4. Ensemble Learning Algorithms
One of the easiest ways to improve your machine learning model is to simply choose a better machine learning algorithm. If you don’t already know what ensemble learning algorithms are, now is the time to learn it!
Ensemble learning is a method where multiple learning algorithms are used in conjunction. The purpose of doing so is that it allows you to achieve higher predictive performance than if you were to use an individual algorithm by itself.
Popular ensemble learning algorithms include random forests, XGBoost, gradient boost, and AdaBoost. To explain why ensemble learning algorithms are so powerful, I’ll give an example with random forests:
Random forests involve creating multiple decision trees using bootstrapped datasets of the original data. The model then selects the mode (the majority) of all of the predictions of each decision tree. What’s the point of this? By relying on a “majority wins” model, it reduces the risk of error from an individual tree.
For example, if we created one decision tree, the third one, it would predict 0. But if we relied on the mode of all 4 decision trees, the predicted value would be 1. This is the power of ensemble learning!
5. Adjusting Hyperparameters
Lastly, something that is not often talked about, but is still very important, is adjusting the hyperparameters of your model. This is where it’s essential that you clearly understand the ML model that you’re working with. Otherwise, it can be difficult to understand each hyperparameter.
Take a look at all of the hyperparameters for Random Forests:
class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None
For example, it would probably be a good idea to understand what min_impurity_decrease is, so that when you want your machine learning model to be more forgiving, you can adjust this parameter! ;)
Original. Reposted with permission.
- Making sense of ensemble learning techniques
- How to Evaluate the Performance of Your Machine Learning Model
- 4 Tips for Advanced Feature Engineering and Preprocessing