The “Robust” Data Scientist: Winning with Messy Data and Pingouin
This article uncovers the craftsmanship of using robust statistics in data science processes: illustrating what to do when data fail tests due to not meeting standard assumptions.

Image by Editor
# Introduction
A harsh truth to begin with: textbook data science usually becomes a lie in the real world. Concepts and techniques are taught on finely curated, beautifully bell-curved data variables, but as soon as we venture into the wild of real projects, we are hit with lots of outliers, unduly skewed distributions, and indomitable variances.
A previous article on building an exploratory data analysis (EDA) pipeline with Pingouin showed how to detect, through tests, cases when the data violates a variety of assumptions like homoscedasticity and normality. But what if the tests fail? Throwing the data away isn't the solution: turning robust is.
This article uncovers the craftsmanship of using robust statistics in data science processes. These are mathematical methods particularly built to yield reliable and valid results even when the data does not meet classical assumptions or is pervaded by outliers and noise. By adopting a "choose your own adventure" approach, we will create a trio of scenarios using Python's Pingouin to manage the ugliest aspects within the data you may encounter in your daily work.
# Initial Setup
Let's start by installing (if needed) and importing Pingouin and Pandas, after which we will load the wine quality dataset available here.
!pip install pingouin pandas
import pandas as pd
import pingouin as pg
# Loading our messy, real-world-like dataset, containing red and white wine samples
url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv"
df = pd.read_csv(url)
# Take a small peek at what we are about to deal with
df.head()
If you looked at the previous Pingouin article, you already know this is a notoriously messy dataset that failed to meet several common assumptions. Now we will embark on three different "adventures", each highlighting a scenario, a core problem, and a proposed robust fix to address it.
// Adventure 1: When the Normality Test Fails
Suppose we run normality tests on two groups: white wine samples and red wine samples.
white_wine_alcohol = df[df['type'] == 'white']['alcohol']
red_wine_alcohol = df[df['type'] == 'red']['alcohol']
print("Normality test for White Wine Alcohol content:")
print(pg.normality(white_wine_alcohol))
print("\nNormality test for Red Wine Alcohol content:")
print(pg.normality(red_wine_alcohol))
You will find that neither distribution is normal, with extremely low p-values. Although non-normality itself doesn't directly signal outliers or skewness, a strong deviation from normality often suggests such characteristics may be present in the data. Comparing means through a t-test in this situation would be dangerous and likely to yield unreliable results.
The robust fix for a scenario like this is the Mann-Whitney U test. Instead of comparing averages, this test compares the ranks in the data — sorting all wines in a group from lowest to highest alcohol content, for instance. This rank-based approach is the master trick that strips outliers of their sometimes dangerous magnitude. Here's how:
# Separating our two groups
red_wine = df[df['type'] == 'red']['alcohol']
white_wine = df[df['type'] == 'white']['alcohol']
# Running the robust Mann-Whitney U test
mwu_results = pg.mwu(x=red_wine, y=white_wine)
print(mwu_results)
Output:
U_val alternative p_val RBC CLES
MWU 3829043.5 two-sided 0.181845 -0.022193 0.488903
Since the p-value is not below 0.05, there is no statistically significant difference in alcohol content between the two wine types — and this conclusion is guaranteed to be outlier-proof and skewness-proof.
// Adventure 2: When the Paired T-Test Fails
Say you now want to compare two measurements taken from the same subject — e.g. a patient's sugar level before and after a drug prototype, or two properties measured in the same bottle of wine. The focus here is on how the differences between paired measurements are distributed. When such differences are not normally distributed, a standard paired t-test will yield unreliable confidence intervals.
The ideal fix in this scenario is the Wilcoxon Signed-Rank Test: the robust sibling of the paired t-test, which works by observing the differences between columns and ranking their absolute values. In Pingouin, this test is called using pg.wilcoxon(), passing in the two columns containing the paired measures within the same subject — e.g. two types of wine acidity.
# Run the robust Wilcoxon signed-rank test for paired data
wilcoxon_results = pg.wilcoxon(x=df['fixed acidity'], y=df['volatile acidity'])
print(wilcoxon_results)
Result:
W_val alternative p_val RBC CLES
Wilcoxon 0.0 two-sided 0.0 1.0 1.0
The result above shows a statistically significant difference, or "perfect separation," between the two measurements. Not only are the two wine properties different, but they also operate at entirely different magnitude tiers across the dataset.
// Adventure 3: When ANOVA Fails
In this third and final adventure, we want to check whether residual sugar levels in wine differ significantly across distinct quality ratings — note that the latter range between 3 and 9, taking integer values, and can therefore be treated as discrete categories.
If Pingouin's Levene test of homoscedasticity fails dramatically — for instance, because sugar variance in mediocre wines is huge but very small in top-quality wines — a classical one-way ANOVA may produce misleading results, as this test assumes equal variances among groups.
The fix is Welch's ANOVA, which penalizes groups with high variance, thereby balancing out scales and making comparisons fairer across several categories. Here is how to run this robust alternative to traditional ANOVA using Pingouin:
# Run Welch's ANOVA to compare sugar across quality ratings
welch_results = pg.welch_anova(data=df, dv='residual sugar', between='quality')
print(welch_results)
Result:
Source ddof1 ddof2 F p_unc np2
0 quality 6 54.507934 10.918282 5.937951e-08 0.008353
Even where a one-way ANOVA might have struggled due to unequal variances, Welch's ANOVA delivers a solid conclusion. The very small p-value is clear evidence that residual sugar levels differ significantly across wine quality ratings. Bear in mind, however, that sugar is only a small piece of the puzzle influencing wine quality — a point underscored by the low eta-squared value of 0.008.
# Wrapping Up
Through three example scenarios, each pairing a messy-data problem with a robust statistical strategy, we have learned that being a skilled data scientist doesn't mean having perfect data or tuning it perfectly — it means knowing what to do when the data gets difficult for different reasons. Pingouin's functions implement a variety of robust tests that help escape the failed-assumptions trap and extract mathematically sound insights with little extra effort.
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.