Getting Started with Scikit-learn for Classification in Machine Learning

The tutorial will introduce you to the scikit-learn module and its various features. It will also give you a brief overview of the multiclass classification problem through various algorithms.



Getting Started with Scikit-learn for Classification in Machine Learning
Image by Editor

 

Scikit-learn is one of the most commonly used machine-learning libraries built in python. Its popularity can be attributed to its easy and consistent code structure which is friendly for beginner developers. Also, there is a high level of support available along with flexibility to integrate third-party functionalities which makes the library robust and suitable for production. The library contains multiple machine learning models for classification, regression, and clustering. In this tutorial, we will explore the problem of multiclass classification through various algorithms. Let’s dive right into it and build our scikit-learn models.

 

Install the Latest Version

 

pip install scikit-learn

 

Loading the Dataset

 

We will use the “Wine” dataset available in the datasets module of scikit-learn. This dataset consists of 178 samples and 3 classes in total. The dataset is already pre-processed and converted to feature vectors hence, we can directly use it to train our models.

from sklearn.datasets import load_wine 
 X, y = load_wine(return_X_y=True)

 

Creating Training and Testing Data

 

We will keep 67% of the data for training and the rest 33% for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

 

Now, we will experiment with 5 different models of differing complexities and evaluate their results on our dataset.

 

Logistic Regression

 

model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test)
 
print("Accuracy Score: ", accuracy_score(y_pred_lr, y_test))
print(classification_report(y_pred_lr, y_test))

 

Output

Accuracy Score:  0.9830508474576272
 
              precision    recall  f1-score   support
 
           0       1.00      0.95      0.98        21
           1       0.96      1.00      0.98        23
           2       1.00      1.00      1.00        15
 
    accuracy                           0.98        59
   macro avg       0.99      0.98      0.98        59
weighted avg       0.98      0.98      0.98        59

 

K-Nearest Neighbors

 

model_knn = KNeighborsClassifier(n_neighbors=1)
model_knn.fit(X_train, y_train)
y_pred_knn = model_knn.predict(X_test)
 
print("Accuracy Score:", accuracy_score(y_pred_knn, y_test))
print(classification_report(y_pred_knn, y_test))

 

Output 

Accuracy Score: 0.7796610169491526
 
              precision    recall  f1-score   support
 
           0       0.90      0.78      0.84        23
           1       0.75      0.82      0.78        22
           2       0.67      0.71      0.69        14
 
    accuracy                           0.78        59
   macro avg       0.77      0.77      0.77        59
weighted avg       0.79      0.78      0.78        59

 

Upon changing the parameter ‘n_neighbors=2’ we observe a decrease in the value of accuracy. Hence, it shows that the data is simple enough and achieves better learning with a single neighbor to consider. 

Output 

Accuracy Score: 0.6949152542372882
 
              precision    recall  f1-score   support
 
           0       0.90      0.72      0.80        25
           1       0.75      0.69      0.72        26
           2       0.33      0.62      0.43         8
 
    accuracy                           0.69        59
   macro avg       0.66      0.68      0.65        59
weighted avg       0.76      0.69      0.72        59

 

Naive Bayes

 

from sklearn.naive_bayes import GaussianNB
 
model_nb = GaussianNB()
model_nb.fit(X_train, y_train)
y_pred_nb = model_nb.predict(X_test)
 
print("Accuracy Score:", accuracy_score(y_pred_nb, y_test))
print(classification_report(y_pred_nb, y_test))

 

Output

Accuracy Score: 1.0
 
              precision    recall  f1-score   support
 
           0       1.00      1.00      1.00        20
           1       1.00      1.00      1.00        24
           2       1.00      1.00      1.00        15
 
    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59

 

Decision Tree Classifier

 

from sklearn.tree import DecisionTreeClassifier
 
model_dtclassifier = DecisionTreeClassifier()
model_dtclassifier.fit(X_train, y_train)
y_pred_dtclassifier = model_dtclassifier.predict(X_test)
 
print("Accuracy Score:", accuracy_score(y_pred_dtclassifier, y_test))
print(classification_report(y_pred_dtclassifier, y_test))

 

Output

Accuracy Score: 0.9661016949152542
 
              precision    recall  f1-score   support
 
           0       0.95      0.95      0.95        20
           1       1.00      0.96      0.98        25
           2       0.93      1.00      0.97        14
 
    accuracy                           0.97        59
   macro avg       0.96      0.97      0.97        59
weighted avg       0.97      0.97      0.97        59

 

Random Forest Classifier

 

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV


def get_best_parameters():

    params = {
        "n_estimators": [10, 50, 100],
        "max_features": ["auto", "sqrt", "log2"],
        "max_depth": [5, 10, 20, 50],
        "min_samples_split": [2, 4, 6],
        "min_samples_leaf": [2, 4, 6],
        "bootstrap": [True, False],
    }

    model_rfclassifier = RandomForestClassifier(random_state=42)

    rf_randomsearch = RandomizedSearchCV(
        estimator=model_rfclassifier,
        param_distributions=params,
        n_iter=5,
        cv=3,
        verbose=2,
        random_state=42,
    )

    rf_randomsearch.fit(X_train, y_train)

    best_parameters = rf_randomsearch.best_params_

    print("Best Parameters:", best_parameters)

    return best_parameters


parameters_rfclassifier = get_best_parameters()

model_rfclassifier = RandomForestClassifier(
    **parameters_rfclassifier, random_state=42
)

model_rfclassifier.fit(X_train, y_train)

y_pred_rfclassifier = model_rfclassifier.predict(X_test)

print("Accuracy Score:", accuracy_score(y_pred_rfclassifier, y_test))
print(classification_report(y_pred_rfclassifier, y_test))

 

Output


Best Parameters: {'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 5, 'bootstrap': True}
Accuracy Score: 0.9830508474576272
 
              precision    recall  f1-score   support
 
           0       1.00      0.95      0.98        21
           1       0.96      1.00      0.98        23
           2       1.00      1.00      1.00        15
 
    accuracy                           0.98        59
   macro avg       0.99      0.98      0.98        59
weighted avg       0.98      0.98      0.98        59

 

In this algorithm, we performed some hyperparameter tuning to achieve the best accuracy. We defined a parameter grid consisting of multiple values to choose from for each parameter. Further, we used the Randomized Search CV algorithm to search the best parameter space for the model. Finally we feed the obtained parameters to the classifier and train the model. 

 

Comparison of Models

 

Models Accuracy Observations
Logistic Regression 98.30% Achieves great accuracy. Model is able to generalize well on the test dataset.
K-Nearest Neighbors 77.96% The algorithm is not able to learn the data representation well.
Naive Bayes 100% The model is less complex hence it overfits the data to obtain absolute accuracy.
Decision Tree Classifier 96.61% Achieves decent accuracy.
Random Forest Classifier 98.30% Being an ensemble-based approach it performs better than Decision Tree. Performing hyperparameter tuning makes it achieve similar accuracy to logistic regression.

 

Conclusion

 

In this tutorial, we learned how to get started to build and train machine learning models in scikit-learn. We implemented and evaluated a few algorithms to get a basic idea about their performance. One can always adopt advanced strategies for feature engineering, hyperparameter tuning or training to improve performance. To read more about the functionalities that scikit-learn offers, head over to the official documentation - Introduction to machine learning with scikit-learn, Machine Learning in Python with scikit-learn.

 
 
Yesha Shastri is a passionate AI developer and writer pursuing Master’s in Machine Learning from Université de Montréal. Yesha is intrigued to explore responsible AI techniques to solve challenges that benefit society and share her learnings with the community.