Avoiding Overfitting, Class Imbalance, & Feature Scaling Issues: The Machine Learning Practitioner’s Notebook
Machine learning practitioners encounter three persistent challenges that can undermine model performance: overfitting, class imbalance, and feature scaling issues.

Image by Editor
# Introduction
Machine learning practitioners encounter three persistent challenges that can undermine model performance: overfitting, class imbalance, and feature scaling issues. These problems appear across domains and model types, yet effective solutions exist when practitioners understand the underlying mechanics and apply targeted interventions.
# Avoiding Overfitting
Overfitting occurs when models learn training data patterns too well, capturing noise rather than generalizable relationships. The result — impressive training accuracy paired with disappointing real-world performance.
Cross-validation (CV) provides the foundation for detecting overfitting. K-fold CV splits data into K subsets, training on K-1 folds while validating on the remaining fold. This process repeats K times, producing robust performance estimates. The variance across folds also provides valuable information. High variance suggests the model is sensitive to particular training examples, which is another indicator of overfitting. Stratified CV maintains class proportions across folds, particularly important for imbalanced datasets where random splits might create folds with wildly different class distributions.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# Assuming X and y are already defined
model = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
Data quantity matters more than algorithmic sophistication. When models overfit, collecting additional training examples often delivers better results than hyperparameter tuning or architectural changes. There is a consistent pattern where doubling training data typically improves performance in predictable ways, though each additional batch of data helps a bit less than the previous one. However, acquiring labeled data carries financial, temporal, and logistical costs. When overfitting is severe and additional data is obtainable, this investment frequently outperforms weeks of model optimization. The key question becomes whether there is a point at which model improvement through additional data plateaus, suggesting that algorithmic changes would provide better returns.
Model simplification offers a direct path to generalization. Reducing neural network layers, limiting tree depth, or decreasing polynomial feature degree all constrain the hypothesis space. This constraint prevents the model from fitting overly complex patterns that will not generalize. The art lies in finding the sweet spot — complex enough to capture genuine patterns, yet simple enough to avoid noise. For neural networks, techniques like pruning can systematically remove less important connections after initial training, maintaining performance while reducing complexity and improving generalization.
Ensemble methods reduce variance through diversity. Bagging trains multiple models on bootstrap samples of the training data, then averages predictions. Random forests extend this by introducing feature randomness at each split. These approaches smooth out individual model idiosyncrasies, reducing the likelihood that any single model's overfitting will dominate the final prediction. The number of trees in the ensemble matters: too few and the variance reduction is incomplete, but beyond a few hundred trees, additional trees typically provide diminishing returns while increasing computational cost.
Learning curves visualize the overfitting process. Plotting training and validation error as training set size increases reveals whether models suffer from high bias (both errors remain high) or high variance (large gap between training and validation error). High bias suggests the model is too simple to capture the underlying patterns; adding more data will not help. High variance indicates overfitting. The model is too complex for the available data, and adding more examples should improve validation performance.
Learning curves also show whether performance has plateaued. If validation error continues decreasing as training set size increases, gathering more data will likely help. If both curves have flattened, model architecture changes become more promising.
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10))
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
plt.xlabel('Training examples')
plt.ylabel('Score')
plt.legend()
Data augmentation artificially expands training sets. For images, transformations like rotation or flipping create valid variations. Text data benefits from synonym replacement or back-translation. Time series can incorporate scaling or window slicing. The key principle is that augmentations should create realistic variations that preserve the label, helping the model learn invariances to these transformations. Domain knowledge guides the selection of appropriate augmentation strategies. Horizontal flipping makes sense for natural images but not for text images containing letters, while back-translation works well for sentiment analysis but may introduce semantic drift for technical documentation.
# Addressing Class Imbalance
Class imbalance emerges when one class significantly outnumbers others in training data. A fraud detection dataset might contain as many as 99.5% legitimate transactions and as few as 0.5% fraudulent ones. Standard training procedures optimize for majority class performance, effectively ignoring minorities.
Metric selection determines whether imbalance is properly measured. Accuracy misleads when classes are imbalanced: predicting all negatives achieves 99.5% accuracy in the fraud example while catching zero fraud cases. Precision measures positive prediction accuracy, while recall captures the fraction of actual positives identified. F1 score balances both through their harmonic mean. Area under the receiver operating characteristic (AUC-ROC) curve evaluates performance across all classification thresholds, providing a threshold-independent assessment of model quality. For heavily imbalanced datasets, precision-recall (PR) curves and area under the precision-recall (AUC-PR) curve often provide clearer insights than ROC curves, which can appear overly optimistic due to the large number of true negatives dominating the calculation.
from sklearn.metrics import classification_report, roc_auc_score
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"AUC-ROC: {auc:.3f}")
Resampling strategies modify training distributions. Random oversampling duplicates minority examples, though this risks overfitting to repeated instances. Synthetic Minority Over-sampling Technique (SMOTE) generates synthetic examples by interpolating between existing minority samples. Adaptive Synthetic (ADASYN) sampling focuses synthesis on difficult-to-learn regions. Random undersampling discards majority examples but loses potentially valuable information, working best when the majority class contains redundant examples. Combined approaches that oversample minorities while undersampling majorities often work best in practice.
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
Class weight adjustments modify the loss function. Most scikit-learn classifiers accept a class_weight parameter that penalizes minority class misclassifications more heavily. Setting class_weight='balanced' automatically computes weights inversely proportional to class frequencies. This approach keeps the original data intact while adjusting the learning process itself. Manual weight setting allows fine-grained control aligned with business costs: if missing a fraudulent transaction costs the business 100 times more than falsely flagging a legitimate one, setting weights to reflect this asymmetry optimizes for the actual objective rather than balanced accuracy.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)
Specialized ensemble methods handle imbalance internally. BalancedRandomForest undersamples the majority class for each tree, while EasyEnsemble creates balanced subsets through iterative undersampling. These approaches combine ensemble variance reduction with imbalance correction, often outperforming manual resampling followed by standard algorithms. RUSBoost combines random undersampling with boosting, focusing subsequent learners on misclassified minority instances, which can be particularly effective when the minority class exhibits complex patterns.
Decision threshold tuning optimizes for business objectives. The default 0.5 probability threshold rarely aligns with real-world costs. When false negatives cost far more than false positives, lowering the threshold increases recall at the expense of precision. Precision-recall curves guide threshold selection. Cost-sensitive learning incorporates explicit cost matrices into threshold selection, choosing the threshold that minimizes expected cost given the business's specific cost structure. The optimal threshold often differs dramatically from 0.5. In medical diagnosis, where missing a serious condition is catastrophic, thresholds as low as 0.1 or 0.2 might be appropriate.
Targeted data collection addresses root causes. While algorithmic interventions help, gathering more minority class examples provides the most direct solution. Active learning identifies informative samples to label. Collaboration with domain experts can surface previously overlooked data sources, addressing fundamental data collection bias rather than working around it algorithmically. Sometimes imbalance reflects legitimate rarity, but often it stems from collection bias. Majority cases are easier or cheaper to gather, and addressing this through deliberate minority class collection can fundamentally resolve the problem.
Anomaly detection reframes extreme imbalance. When the minority class represents less than 1% of data, treating the problem as outlier detection rather than classification often performs better. One-class Support Vector Machines (SVM), isolation forests, and autoencoders excel at identifying rare patterns. These unsupervised or semi-supervised approaches sidestep the classification framework entirely. Isolation forests work particularly well because they exploit the fundamental property of anomalies — they are easier to isolate through random partitioning since they differ from normal patterns in multiple dimensions.
# Resolving Feature Scaling Issues
Feature scaling ensures that all input features contribute appropriately to model training. Without scaling, features with larger numeric ranges can dominate distance calculations and gradient updates, distorting learning.
Algorithm selection determines scaling necessity. Distance-based methods like K-Nearest Neighbors (KNN), SVM, and neural networks require scaling because they measure similarity using Euclidean distance or similar metrics. Tree-based models remain invariant to monotonic transformations and do not require scaling. Linear regression benefits from scaling for numerical stability and coefficient interpretability. In neural networks, feature scaling is critical because gradient descent struggles when features live on different scales. Large-scale features produce large gradients that can cause instability or require very small learning rates, dramatically slowing convergence.
Scaling method selection depends on data distribution. StandardScaler (z-score normalization) transforms features to have zero mean and unit variance. Formally, for a feature \( x \):
\[
z = \frac{x - \mu}{\sigma}
\]
where \( \mu \) is the mean and \( \sigma \) is the standard deviation. This works well for approximately normal distributions. MinMaxScaler rescales features to a fixed range (typically 0 to 1), preserving zero values and working well when distributions have hard boundaries. RobustScaler uses the median and interquartile range (IQR), remaining stable when outliers exist. MaxAbsScaler divides by the maximum absolute value, scaling to the range of -1 to 1 while preserving sparsity, which is ideal for sparse data.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# StandardScaler: (x - mean) / std
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# MinMaxScaler: (x - min) / (max - min)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)
# RobustScaler: (x - median) / IQR
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)
Proper train-test separation prevents data leakage. Scalers must be fit only on training data, then applied to both training and test sets. Fitting on the entire dataset allows information from test data to influence the transformation, artificially inflating performance estimates. This simulates production conditions where future data arrives without known statistics. The same principle extends to CV: each fold should fit its scaler on its training portion and apply it to its validation portion.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit and transform
X_test_scaled = scaler.transform(X_test) # Transform only
Categorical encoding requires special handling. One-hot encoded features already exist on a consistent 0-1 scale and should not be scaled. Ordinal encoded features may or may not benefit from scaling depending on whether their numeric encoding reflects meaningful intervals. The best practice is to separate numeric and categorical features in preprocessing pipelines. ColumnTransformer facilitates this separation, allowing different transformations for different feature types.
Sparse data presents unique challenges. Scaling sparse matrices can destroy sparsity by making zero values non-zero, dramatically increasing memory requirements. MaxAbsScaler preserves sparsity. In some cases, skipping scaling entirely for sparse data proves optimal, particularly when using tree-based models. Consider a document-term matrix where most entries are zero; StandardScaler would subtract the mean from each feature, turning zeros into negative numbers and destroying the sparsity that makes text processing feasible.
Pipeline integration ensures reproducibility. The Pipeline class chains preprocessing and model training, ensuring all transformations are tracked and applied consistently during deployment. Pipelines also integrate seamlessly with CV and grid search, ensuring that all hyperparameter combinations receive proper preprocessing. The saved pipeline object contains everything needed to process new data identically to training data, reducing deployment errors.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
Target variable scaling requires inverse transformation. When predicting continuous values, scaling the target variable can improve training stability. However, predictions must be inverse transformed to return to the original scale for interpretation and evaluation. This is particularly important for neural networks where large target values can cause gradient explosion, or when using activation functions like sigmoid that output bounded ranges.
from sklearn.preprocessing import StandardScaler
y_scaler = StandardScaler()
y_train_scaled = y_scaler.fit_transform(y_train.reshape(-1, 1))
# After training and prediction
# predictions_scaled = model.predict(X_test)
predictions_original = y_scaler.inverse_transform(
predictions_scaled.reshape(-1, 1))
# Conclusion
Overfitting, class imbalance, and feature scaling represent fundamental challenges in machine learning practice. Success requires understanding when each problem appears, recognizing its symptoms, and applying appropriate interventions. Cross-validation detects overfitting before deployment. Thoughtful metric selection and resampling address imbalance. Proper scaling ensures features contribute appropriately to learning. These techniques, applied systematically, transform problematic models into reliable production systems that deliver genuine business value. The practitioner's notebook should contain not just the techniques themselves but the diagnostic approaches that reveal when each intervention is needed, enabling principled decision-making rather than trial-and-error experimentation.
Rachel Kuznetsov has a Master's in Business Analytics and thrives on tackling complex data puzzles and searching for fresh challenges to take on. She's committed to making intricate data science concepts easier to understand and is exploring the various ways AI makes an impact on our lives. On her continuous quest to learn and grow, she documents her journey so others can learn alongside her. You can find her on LinkedIn.