5 Excel AI Lessons I Learned the Hard Way

This article transforms the unwelcome experiences into five comprehensive frameworks that will elevate your Excel-based machine learning work.

By Rachel Kuznetsov, KDnuggets on November 26, 2025 in Artificial Intelligence

5 Excel AI Lessons I Learned the Hard Way

Image by Editor

# Introduction

For many organizations, particularly those in regulated industries or with limited technical infrastructure, Excel and its XLMiner add-in serve as the primary platform for predictive modeling and machine learning workflows.

Yet Excel's accessibility masks a critical gap: the difference between running models and building trustworthy analytics systems. Working on a loan approval prediction project, I discovered that Excel-based machine learning fails not due to algorithmic limitations, but due to some practices that are frequently overlooked.

This article transforms the unwelcome experiences into five comprehensive frameworks that will elevate your Excel-based machine learning work.

# Lesson 1: Multiple Methods for Outlier Detection

Outlier handling is more art than science, and premature removal can eliminate legitimate extreme values that carry important information. In one case, all residential asset values above the 95th percentile were removed using a simple IQR calculation, under the assumption they were errors. Later analysis revealed the removal of legitimate ultra-high-value properties, a relevant segment for large loan approvals.

The lesson: Use multiple detection methods and manual review before removal. Create a comprehensive outlier detection framework.

In a new sheet adjacent to the main data, create detection columns:

Column A: Original Value (residential_assets_value)
Column B: IQR Method
=IF(A2 > QUARTILE.INC($A$2:$A$4270,3) + 1.5*(QUARTILE.INC($A$2:$A$4270,3)-QUARTILE.INC($A$2:$A$4270,1)), "Outlier_IQR", "Normal")
Column C: 3-Sigma Method
=IF(ABS(A2-AVERAGE($A$2:$A$4270)) > 3*STDEV($A$2:$A$4270), "Outlier_3SD", "Normal")
Column D: Percentile Method
=IF(A2 > PERCENTILE.INC($A$2:$A$4270,0.99), "Outlier_P99", "Normal")
Column E: Combined Flag
=IF(COUNTIF(B2:D2,"Outlier*")>=2, "INVESTIGATE", "OK")
Column F: Manual Review [Notes after investigating]
Column G: Final Decision [Keep/Remove/Transform]

This multi-method approach revealed patterns in my loan data:

Values flagged by all three methods (IQR, 3-sigma, and percentile): Likely errors
Values flagged by IQR but not 3-sigma: Legitimate high values in skewed distributions
Values flagged only by percentile: The extreme but valid cases I almost lost

The "Manual Review" column is crucial. For each flagged observation, document findings such as: "Legitimate luxury property, verified against public records" or "Likely data entry error, value exceeds market maximum by 10x."

# Lesson 2: Always Set Random Seeds

Few experiences are more frustrating than presenting excellent model results, then being unable to reproduce those exact numbers when preparing the final report. This scenario occurred with a classification tree model: One day's validation accuracy was 97.3%, but next day's was 96.8%. The difference seems small, but it undermines credibility. Makes the audience wonder which number is real and how much can this analysis be trusted.

The lesson: The culprit is random partitioning without a fixed seed. Most machine learning algorithms involve randomness at some stage.

Data partitioning: Which observations go into training vs. validation vs. test sets
Neural networks: Initial weight randomization
Some ensemble methods: Random feature selection

XLMiner uses random processes for partitioning the data. Running the same model twice with identical parameters yields slightly different results because the training/validation split differs each time.

The solution is simple but non-obvious. When using XLMiner's partition functionality (found in most model dialogs):

Check the box labeled "Set seed" (it's unchecked by default)
Enter a specific integer: 12345, 42, 2024, or any memorable number
Document this seed value in the Model Log

Now, every time the model is run with this seed:

Identical training/validation/test splits
Identical model performance metrics
Identical predictions for the same observations
Perfect reproducibility

Here is an example from the loan approval dataset without seed (three runs of identical logistic regression):

Run 1: Validation Accuracy = 92.4%, F1 = 0.917
Run 2: Validation Accuracy = 91.8%, F1 = 0.923
Run 3: Validation Accuracy = 92.1%, F1 = 0.919

And with with seed=12345 (three runs of identical logistic regression):

Run 1: Validation Accuracy = 92.1%, F1 = 0.928
Run 2: Validation Accuracy = 92.1%, F1 = 0.928
Run 3: Validation Accuracy = 92.1%, F1 = 0.928

The difference matters enormously for credibility. When tasked with recreating the analysis, it can be done with confidence, knowing the numbers will match.

Important caveat: The seed controls randomness in partitioning and initialization, but it doesn't make the analysis immune to other changes. If data is modified (adding observations, changing transformations) or model parameters are adjusted, results will still differ, as they should.

# Lesson 3: Proper Data Partitioning: The Three-Way Split

Related to reproducibility is partition strategy. XLMiner's default settings create a 60/40 training/validation split. This seems reasonable until the question arises: where's the test set?

A common mistake involves building a neural network, tuning it based on validation performance, then reporting those validation metrics as final results.

The lesson: Without a separate test set, the optimization occurs directly on the data being reported, inflating performance estimates. The correct partitioning strategy uses three sets.

1. Training Set (50% of Data)

Where the model learns patterns
Used to fit parameters, coefficients, or weights
For the loan dataset: ~2,135 observations

2. Validation Set (30% of Data)

For model selection and hyperparameter tuning
Used to compare different models or configurations
Helps select the best pruned tree, optimal cutoff value, or ideal neural network architecture
For the loan dataset: ~1,280 observations

3. Test Set (20% of Data)

"Final exam" - only score once
Used ONLY after all modeling decisions are complete
Provides unbiased estimate of real-world performance
For the loan dataset: ~854 observations

Critical rule: Never iterate on test set performance. The moment a model is selected because "it performs better on the test set," that test set becomes a second validation set, and performance estimates become biased.

This is my workflow now:

Set seed to 12345
Partition 50/30/20 (training/validation/test)
Build multiple model variants, evaluating each on validation set only
Select the best model based on validation performance and business requirements
Score the test set exactly once with the chosen model
Report test set performance as the expected real-world outcome

Here is an example from the loan approval project:

Model Version	Training Acc	Validation Acc	Test Acc	Selected?
Logistic Regression (all vars)	90.6%	89.2%	Not scored	No
Logistic Regression (stepwise)	91.2%	92.1%	Not scored	No
Classification Tree (depth=7)	98.5%	97.3%	Not scored	Yes
Classification Tree (depth=5)	96.8%	96.9%	Not scored	No
Neural Network (7 nodes)	92.3%	90.1%	Not scored	No

After selecting the Classification Tree (depth=7) based on validation performance, the test set was scored exactly once: 97.4% accuracy. This test accuracy represents the expected production performance.

# Lesson 4: The Training/Validation Gap: Spotting Overfitting Before It Hurts

The problem: Initial look at the classification tree results from the project report seem promising.

Training Data Performance:

Accuracy: 98.45%
Precision: 99%
Recall: 96%
F1 Score: 98.7%

The model appeared successful thus far until the focus got shifted to the validation results.

Validation Data Performance:

Accuracy: 97.27%
Precision: 98%
Recall: 94%
F1 Score: 97.3%

The difference seemed small, just 1.18% accuracy difference. But determining whether the gap constituted a problem required a systematic framework.

The lesson: It's critical to understand when the model memorizes instead of learns.

The practical solution: Create an overfitting monitor. Build a simple but systematic comparison sheet that makes overfitting obvious.

Step 1: Create the Comparison Framework

Here is a model performance comparison in the "Overfitting_Monitor" sheet:

Metric	Training	Validation	Gap	Gap %	Status
Accuracy	98.45%	97.27%	1.18%	1.20%	✓ Good
Precision	99.00%	98.00%	1.00%	1.01%	✓ Good
Recall	96.27%	94.40%	1.87%	1.94%	✓ Good
F1 Score	98.76%	97.27%	1.49%	1.51%	✓ Good
Specificity	96.56%	92.74%	3.82%	4.06%	? Watch

And here are the interpretation rules:

Gap < 3%: ✅ Good - Model generalizing well
Gap 3-5%: ❓ Watch - Acceptable but monitor closely
Gap 5-10%: ⚠️ Concerning - Possible overfit, consider simplification
Gap > 10%: ❌ Problem - Definite overfit, must address

And this is the detailed analysis:

Overall Assessment: GOOD
Reasoning: All major metrics within 2% gap. Specificity gap slightly higher but still acceptable. Model appears to generalize well.
Recommendation: Proceed with test set evaluation.

Step 2: Add the calculation formulas

Cell: Gap (for Accuracy)
=[@Training] - [@Validation]

Cell: Gap % (for Accuracy)
=([@Training] - [@Validation]) / [@Training]

Cell: Status (for Accuracy)

=IF([@[Gap %]]<0.03, "✓ Good",
    IF([@[Gap %]]<0.05, "? Watch",
       IF([@[Gap %]]<0.10, "⚠ Concerning", "✗ Problem")))

Step 3: Create a Visual Overfitting Chart

Build a side-by-side bar chart comparing training vs. validation for each metric. This makes patterns instantly visible:

When bars are close, the model generalizes well. When training bars are much longer than validation bars, there is overfitting.

Comparing Across Different Models

The real value comes from comparing overfitting patterns across the model options. Here is the comparison for the "Model_Overfitting_Comparison" sheet:

Model	Train Acc	Val Acc	Gap	Overfitting Risk
Logistic Regression	91.2%	92.1%	-0.9%	Low (negative gap)
Classification Tree	98.5%	97.3%	1.2%	Low
Neural Network (5 nodes)	90.7%	89.8%	0.9%	Low
Neural Network (10 nodes)	95.1%	88.2%	6.9%	High – Reject this
Neural Network (14 nodes)	99.3%	85.4%	13.9%	Very High – Reject this

Interpretation: The neural network with 10+ nodes is clearly overfitting. Despite high training accuracy (99.3%), validation accuracy drops to 85.4%. The model memorized training data patterns that don't generalize.

Best choice: Classification tree

High performance (97.3% validation)
Minimal overfitting (1.2% gap)
Good balance of complexity and generalization

Here are some simple ways to reduce overfitting when spotted:

For Classification Trees: Reduce maximum depth or increase minimum samples per node
For Neural Networks: Reduce number of nodes or layers
For Logistic Regression: Remove variables or use stepwise selection
For All Models: Add more training data if possible

# Lesson 5: Implement Data Validation for Categorical Variables

Data entry errors are silent killers of machine learning projects. A single typo, such as "gradute" instead of "graduate", creates a third category in what should be a binary variable. The model now has an unexpected feature value it's never seen during training, potentially causing errors during deployment or, worse, silently producing incorrect predictions.

The prevention: Excel's data validation feature. Here's the implementation protocol for categorical variables:

In a hidden sheet (name it "Validation_Lists"), create lists of valid values:

For education: Create a column with entries "Graduate" and "Not Graduate"
For self-employment: Create a column with entries "Yes" and "No"
For loan status: Create a column with entries "Approved" and "Rejected"

In the data entry sheet:

Select the entire column for a categorical variable (e.g., column containing education data)
Data → Data Validation → Settings tab
Allow: List
Source: Navigate to the hidden validation sheet and select the appropriate list
Error Alert tab: Style = Stop, with a clear message: "Only 'Graduate' or 'Not Graduate' are valid entries"

Now it's impossible to enter invalid values. Users see a dropdown with valid choices, eliminating typographical errors entirely.

For numeric variables with known ranges, apply similar validation to prevent impossible values:

Credit scores: Must be between 300 and 900
Loan term: Must be between 1 and 30 years
Annual income: Must be greater than 0

Select the column, apply data validation, set:

Allow: Whole Number (or Decimal)
Data: between
Minimum: 300 (for credit scores)
Maximum: 900

# Final Thoughts

Here is a summary of the lessons outlined in the article.

5 Excel AI Lessons I Learned the Hard Way (click to enlarge)

The five practices in this article — multi-method outlier detection, setting random seeds, three-way data partitioning, monitoring the training-validation gap, and implementing data validation — share a common thread: they're all simple to implement but devastating to omit.

None of these practices require advanced statistical knowledge or complex programming. They don't demand additional software or expensive tools. Excel XLMiner is a powerful tool for accessible machine learning.

Rachel Kuznetsov has a Master's in Business Analytics and thrives on tackling complex data puzzles and searching for fresh challenges to take on. She's committed to making intricate data science concepts easier to understand and is exploring the various ways AI makes an impact on our lives. On her continuous quest to learn and grow, she documents her journey so others can learn alongside her. You can find her on LinkedIn.