Beginner’s Guide to K-Nearest Neighbors in R: from Zero to Hero
This post presents a pipeline of building a KNN model in R with various measurement metrics.
By Leihua Ye, UC Santa Barbara
“If you live 5-min away from Bill Gates, I bet you are rich.”
In the world of Machine Learning, I find the K-Nearest Neighbors (KNN) classifier makes the most intuitive sense and easily accessible to beginners even without introducing any math notations.
To decide the label of an observation, we look at its neighbors and assign the neighbors’ label to the observation of interest. Certainly, looking at one neighbor may create bias and inaccuracy, and the KNN method has a set of rules and procedures to determine the best number of neighbors, e.g., examining k>1 neighbors and adopt majority rule to decide the category.
“To decide the label for new observations, we look at the closest neighbors.”
Measure of Distance
To choose the nearest neighbors, we have to define what distance is. For categorical data, there are Hamming Distance and Edit Distance. More information can be found here, as I won’t delve into the math in this post.
What is K-Fold Cross Validation?
In Machine Learning, Cross-Validation (CV) plays a crucial role in model selection and has a wide range of applications. In fact, CV has a rather straightforward design idea and also makes intuitive sense.
It can be briefly stated as follows:
- divide the data into K equally distributed chunks/folds
- choose 1 chunk/fold as a test set and the rest K-1 as a training set
- develop an ML model based on the training set
- compare predicted value VS true value on the test set only
- apply the ML model to the test set and repeat K times using each chunk
- add up the metrics score for the model and average over K folds
How to choose K?
As you probably noticed, the tricky part of CV is how to set the value for K. Let’s say the total sample size = n. Technically, we can set K to any value between 1 and n.
If k = n, we basically take 1 observation out as the training set and the rest n-1 cases as the test set. Then, repeat the process to the entire dataset. This is called “Leave-one-out cross-validation” (LOOCV).
Well, LOOCV requires a lot of computational power and may run forever if your dataset is big. Take a step back, there is no such thing as the best k value, and neither is it true that a higher k is a better k.
To choose the most appropriate k folds, we have to make a tradeoff between bias and variance. If k is small, we have a high bias but a low variance for estimating test error; if k is big, we have a low bias and a high variance.
“Hello Neighbor! Come On In.”
Implementation in R
1. Software Preparation
After loading and cleaning the original dataset, it is a common practice to visually examine the distribution of our variables, checking for seasonality, patterns, outliers, etc.
As can be seen, the Outcome Variables (Banking Service Subscription) are not equally distributed, with many more “No”s than “Yes”s.
This is unsurprisingly inconvenient for supervised learning when we try to classify future labels correctly. As expected, the rate of false positive would be high as a lot of minority cases would be classified as the majority label.
In fact, the unbalanced distribution may prefer a non-parametric ML classifier, as my other post (Rare Event Classification Using 5 Classifiers) shows KNN performs the best after comparing it to other ML methods. This may be caused by the underlying maths and statistical assumptions between parametric and non-parametric models.
2. Data Split
As mentioned above, we need to split the dataset into a training set and a test set and adopt k-fold cross-validation to pick the best ML model. A rule of thumb, we stick to the “80–20” division: we train ML models on 80% of the data and test it on the rest 20%. Slightly different for Time Series data, we change to 90% VS 10%.
So far, we have finished data preparations and move on to model selection.
3. Train Models
Let’s create a new function (“calc_error_rate”) to record the misclassification rate. The function calculates the rate when the predicted label using the training model does not match with the actual outcome label. It measures classification accuracy.
Then, we need another function, “do.chunk()”, to do k-fold Cross Validation. The function returns a data frame of the possible values of folds. The main purpose of this step is to select the best K value for KNN.
The upcoming step is to find the number of k that minimizes validation error
Therefore, the best number of neighbors is 20 after using 10-fold cross-validation.
4. Some Model Metrics
The training error is 0.10.
The test error is 0.11.
Based on the above confusion matrix, we can calculate the following values and prepare for plotting the ROC curve.
Accuracy = (TP +TN)/(TP+FP+FN+TN)
TPR/Recall/Sensitivity = TP/(TP+FN)
Precision = TP/(TP+FP)
Specificity = TN/(TN+FP)
FPR = 1 — Specificity = FP/(TN+FP)
F1 Score = 2*TP/(2*TP+FP+FN) = Precision*Recall /(Precision +Recall)
As you may notice, test accuracy rate + test error rate = 1, and I’m providing multiple ways of calculating each value.
In conclusion, we have learned what KNN is and built a pipeline of building a KNN model in R. More importantly, we have learned the underlying idea behind K-Fold Cross-validation and how to cross-validate in R.
Enjoy reading this one? If so, please check my other posts on Machine Learning and programming.
A Big Challenge: How to Predict Rare Events using 5 Machine Learning Methods
Which ML method works best when the outcome variable is highly imbalanced? What are the tradeoffs?
Machine Learning 101: Predicting Drug Use Using Logistic Regression In R
Basics, link functions, and plots
Machine Learning 102: Logistic Regression With Polynomial Features
How to classify when there are nonlinear components
Unsupervised Machine Learning: Using PCA and Hierarchical Clustering To Analyze Genes and Leukemia
A real-life application of unsupervised learning
Image Compression In 10 Lines of R Code
An innovative way of using PCA in dimension reduction
Bio: Leihua Ye (@leihua_ye)is a Ph.D. Candidate at the UC, Santa Barbara. He has 5+ years of research and professional experience in Quantitative UX Research, Experimentation & Causal Inference, Machine Learning, and Data Science.
Original. Reposted with permission.
- Introduction to k-Nearest Neighbors
- Classifying Heart Disease Using K-Nearest Neighbors
- How to Visualize Data in Python (and R)