The Statistical Ghost in Your AI
Simpson's Paradox and Why Your Model Is Probably Lying
Imagine a new drug is being tested in a clinical trial. The results are unambiguous:
In men, Drug A outperforms Drug B.
In women, Drug A outperforms Drug B.
The logical conclusion? Drug A is better overall. Right?
Wrong.
The overall analysis shows that Drug B has the highest success rate across all patients. How is that possible? Here are the concrete numbers:
Men (milder cases):
Drug A: 10 patients → 8 cured = 80%
Drug B: 90 patients → 63 cured = 70%
Women (more severe cases):
Drug A: 90 patients → 27 cured = 30%
Drug B: 10 patients → 2 cured = 20%
Total:
Drug A: 100 patients → 35 cured = 35% ← worse!
Drug B: 100 patients → 65 cured = 65% ← better?A beats B in men (80% vs. 70%) and in women (30% vs. 20%) — but B wins the overall comparison by a landslide (65% vs. 35%).
The reason: Drug A was administered predominantly to women, who had more severe cases and were therefore harder to cure. Drug B was given mainly to men with milder conditions. Disease severity — correlated with the gender of the test group — is the hidden confounder that upends the aggregate statistics.
No calculation error, no fraud, no manipulated data. Just a mathematical phenomenon that has been known for over a century and yet continues to cause fatal misjudgements: Simpson’s Paradox.
What Is Simpson’s Paradox?
Simpson’s Paradox — named after British statistician Edward H. Simpson, who formally described it in 1951, although Karl Pearson and Udny Yule had made similar observations decades earlier — describes a statistical phenomenon in which a trend or correlation visible across several subgroups of a dataset disappears or even reverses once those groups are merged into a single aggregate.
The core of the problem is easy to describe, but hard to internalise: Aggregates don’t lie — but they conceal. They conceal the structure of the data, the group weights, and the hidden variables that change everything.
Formally, the paradox can be expressed as follows:
If it holds that:
A₁/B₁ > C₁/D₁ (Group 1: success rate of X > success rate of Y)
A₂/B₂ > C₂/D₂ (Group 2: success rate of X > success rate of Y)it is nevertheless possible that:
(A₁+A₂)/(B₁+B₂) < (C₁+C₂)/(D₁+D₂)Meaning: X beats Y in every subgroup — but Y beats X in the overall comparison. How? Through differing group sizes that distort the weighting. The larger group dominates the aggregate statistic, regardless of which trend prevails within the subgroups.
The Berkeley Discrimination Case (1973)
This is probably the most well-known real-world example of Simpson’s Paradox — and it could have had serious legal consequences.
In 1973, the University of California, Berkeley, was accused of discriminating against women in its admissions process. The numbers seemed unambiguous:
Applications and admissions (Fall 1973):
Men: 8,442 applicants → 44% admitted
Women: 4,321 applicants → 35% admittedA nine-percentage-point gap — this looked like systematic discrimination. But when statisticians Peter Bickel, Eugene Hammel, and William O’Connell analysed the data at the departmental level, they found something astonishing: in most individual departments, women were admitted at equal or higher rates than men.
How was that possible?
The key lay in the choice of subject. Women were disproportionately represented in departments with very high admission hurdles — such as law or medicine — that accepted few applicants regardless of gender. Men, by contrast, applied more frequently to departments with considerably higher acceptance rates.
Simplified illustration:
Department A (high barrier, 10% acceptance rate):
Women: 400 applicants → 40 admitted (10%)
Men: 100 applicants → 10 admitted (10%)
Department B (low barrier, 50% acceptance rate):
Women: 100 applicants → 50 admitted (50%)
Men: 400 applicants → 200 admitted (50%)
Overall:
Women: 500 applicants → 90 admitted (18%)
Men: 500 applicants → 210 admitted (42%)No discrimination at the departmental level — but a massive gap in the aggregate statistic. The hidden variable was subject choice, acting as a confounder that produced the apparent trend.
The study by Bickel et al. was published in Science in 1975 and remains to this day a textbook example of Simpson’s Paradox in the real world.
Kidney Stone Treatment — The Medical Textbook Case
In 1986, C.R. Charig and colleagues published a study in the British Medical Journal that became one of the most cited medical examples of Simpson’s Paradox.
Two methods for treating kidney stones were compared:
Treatment A: Open surgical removal (invasive)
Treatment B: Percutaneous nephrolithotomy (minimally invasive)
The overall results:
Overall success rates:
Treatment A: 273/350 = 78%
Treatment B: 289/350 = 83%Treatment B seemed clearly superior. But broken down by stone size:
Small kidney stones (< 2 cm):
Treatment A: 81/87 = 93%
Treatment B: 234/270 = 87%
Large kidney stones (≥ 2 cm):
Treatment A: 192/263 = 73%
Treatment B: 55/80 = 69%In both subgroups, Treatment A is superior. Yet the aggregate statistic declares Treatment B the winner.
The reason: Treatment A was disproportionately used in severe cases (large stones) — precisely those with inherently lower chances of success. Treatment B was applied more frequently to milder cases. The unequal distribution of cases across treatment groups completely corrupted the overall comparison.
The confounder here was stone size, which influenced both the choice of treatment and the likelihood of success.
The Mechanism: Why Does This Happen?
Every instance of Simpson’s Paradox shares the same underlying structure. There are always three elements:
1. The outcome variable — what is being measured (admission rate, hospitalisation, treatment success).
2. The grouping variable — the characteristic used for comparison (gender, vaccination status, treatment method).
3. The confounder — a hidden variable that influences both group membership and the outcome variable (subject choice, age, stone size).
Mathematically, the paradox arises because merging groups implicitly performs a reweighting. Each subgroup receives a weight in the aggregate proportional to its size, and when groups differ greatly in size, the largest groups dominate the overall result, regardless of internal trends.
This can also be framed as a problem of weighted averages:
If p₁ = success rate in group 1 (weight w₁)
and p₂ = success rate in group 2 (weight w₂)
then the overall average is:
p_total = (w₁ · p₁ + w₂ · p₂) / (w₁ + w₂)
The paradox occurs when the weights w₁ and w₂
differ substantially between the categories being compared.Simply averaging the group rates without accounting for group sizes is mathematically not the same as the true overall average. This confusion is the most common error in practice.
For most of the 20th century, the Simpson’s Paradox was treated as a statistical curiosity, a classroom puzzle to humble overconfident students.
Today, it is something far more serious.
The Paradox Meets the Algorithm
We live in an age of machine learning. Algorithms now decide who gets a loan, who gets a job interview, whether a defendant is released on bail, and which patients are flagged for urgent medical care. These systems are trained on data — vast oceans of it. And the uncomfortable truth is that more data does not protect you from Simpson’s Paradox. In some cases, it makes the problem harder to detect.
The reason is structural. Most machine learning models are, at their core, pattern-recognition machines. They find correlations in the data and use them to make predictions. But correlations are slippery. They can shift, invert, or vanish entirely depending on how you slice the dataset. A model trained on aggregated data will learn the aggregate pattern — and that pattern may be the opposite of what is true in every individual subgroup that matters.
Worse, the black-box nature of modern deep learning models means the paradox can be completely invisible. There is no output that says “warning: your training data contains a confounding variable.” The model simply produces a number, and that number feels authoritative.
A Criminal Justice Algorithm and a Statistical Ghost
The most consequential recent example comes from the American criminal justice system. COMPAS — Correctional Offender Management Profiling for Alternative Sanctions — is a commercial algorithm used in courtrooms across the United States to predict the likelihood that a defendant will re-offend. Its scores have influenced bail decisions, sentencing, and parole hearings.
In 2016, the investigative newsroom ProPublica published a detailed analysis of COMPAS data from Broward County, Florida. Their finding was stark: Black defendants were nearly twice as likely as white defendants to be incorrectly flagged as high risk for future crime, while white defendants who did go on to re-offend were more often incorrectly labeled low risk.
Northpointe, the company behind COMPAS, pushed back. Their own analysis showed that the algorithm was equally accurate across racial groups — that is, its overall prediction accuracy was roughly the same for Black and white defendants.
Here is the unsettling part: both analyses were correct.
They were measuring different things. ProPublica was examining false-positive rates within each racial group. Northpointe was looking at overall calibration across the full population. In a population where the base rates of re-offending differ between groups — due to decades of structural inequalities in policing and incarceration — it is mathematically impossible to satisfy both definitions of fairness simultaneously. The aggregate picture and the subgroup picture told opposite stories. Simpson’s Paradox is embedded in the justice system.
The Vaccine That Looked Dangerous
In 2021, data from England circulated on social media that anti-vaccination advocates cited as proof of the supposed ineffectiveness of COVID-19 vaccines. The figures came from official reports by the UK Health Security Agency and were entirely accurate — but their interpretation was a textbook case of Simpson’s Paradox.
The data showed that among severely ill and deceased COVID patients, there were more vaccinated than unvaccinated individuals. Shocking at first glance. Could the vaccine actually be failing to protect?
The decisive confounder was age.
Simplified illustration:
Age group 60+ (high risk of severe illness):
Vaccinated: 900 of 1,000 people (90% vaccination rate)
Hospitalisation rate vaccinated: 1% → 9 hospitalisations
Hospitalisation rate unvaccinated: 10% → 10 hospitalisations
Age group under 60 (low risk):
Vaccinated: 100 of 1,000 people (10% vaccination rate)
Hospitalisation rate vaccinated: 0.1% → 0.1 hospitalisations
Hospitalisation rate unvaccinated: 0.5% → 4.5 hospitalisations
Total hospitalisations:
Vaccinated: ~9.1
Unvaccinated: ~14.5But, because the vaccination rate in the high-risk group was so high, the absolute majority of those hospitalised came from the vaccinated group — even though the vaccine protected in every age bracket.
The paradox arose from two compounding factors: the older population had both a higher vaccination rate and a higher inherent risk of severe illness. Those who looked only at the aggregate figures, without accounting for the age structure, reached the wrong conclusion.
Why AI Is Especially Vulnerable
For anyone working with data, machine learning, or AI systems, Simpson’s Paradox is no academic curiosity — it is an everyday hazard.
Bias in trained models
When a classification model is trained on aggregated data without accounting for the underlying group structure, it can learn patterns that appear valid at the overall level but are wrong for every subgroup. A credit scoring model might, for instance, deliver correct predictions in aggregate while systematically misclassifying every demographic subgroup.
A/B tests and product decisions
In product development, A/B tests are evaluated daily. Anyone looking only at the overall conversion rate, without segmenting by user group, risks declaring the worse variant the winner. Many tech companies have learned this lesson the hard way.
Fairness and algorithmic discrimination
Perhaps the most explosive application: algorithms that demonstrate fairness at the aggregate level can simultaneously discriminate against every individual demographic subgroup. This is not a hypothetical problem — it is an active field of research in algorithmic fairness, and Simpson’s Paradox explains why aggregate fairness metrics alone are never sufficient.
Evaluating language models
The phenomenon also arises in the benchmarking of Large Language Models (LLMs). A model can outperform a competitor on an aggregate benchmark while being worse in every single task category — if it has processed a disproportionately large number of easy test cases.
How to Recognise and Avoid Simpson’s Paradox
The good news: Simpson’s Paradox is avoidable — if you ask the right questions.
1. Always segment
Never look only at the aggregate statistic. Always ask: Does this trend hold within the relevant subgroups? Identify potential confounding variables before conducting an analysis.
2. Actively search for confounders
A confounding variable has two properties: it correlates with the grouping variable and with the outcome variable. In the Berkeley example, subject choice correlated with both gender and admission probability. This double correlation is the warning signal.
3. Question aggregated metrics
Whenever someone presents you with a single figure — an overall success rate, an average conversion rate, a combined score — always ask: What do the subgroups look like? How were the groups weighted?
4. Use causal models
The deepest solution lies in the shift from purely correlational statistics to causal inference. Techniques such as Directed Acyclic Graphs (DAGs), developed by Judea Pearl, help explicitly model confounding structures and enable controlled comparisons. Pearl himself addressed Simpson’s Paradox extensively in his book The Book of Why (2018), arguing that the paradox is ultimately a problem of missing causal structure — not a problem of statistics alone.
5. Be sceptical of suspiciously clean results
If an analysis seems too neat — if the data shows exactly what one side wants to prove — that is a warning sign. Simpson’s Paradox appears most frequently in politically or economically charged debates, precisely because the temptation to present aggregate figures without context is greatest there.
What This Means for Anyone Building with Data
Simpson’s Paradox is not a bug that will be patched in the next software release. It is a fundamental feature of aggregated data, and it will be with us as long as we make decisions based on numbers. But awareness is a powerful first line of defence.
For anyone building or evaluating AI systems, three habits matter most. First, always stratify your results — break down every metric by the subgroups that matter for your application before drawing conclusions from aggregate figures. Second, be suspicious of any model evaluation that does not report per-group performance alongside overall accuracy. A single accuracy score for a system that affects different populations differently is not just incomplete; it can be actively misleading. Third, and most fundamentally, ask causal questions before statistical ones. What is the underlying structure of this problem? What hidden variables might be at play?
Edward Simpson described his paradox in a dry academic paper in 1951. He probably could not have imagined that, seventy-five years later, his statistical ghost would be haunting courtrooms, pandemic dashboards, and hiring pipelines — silently inverting the truth inside systems trusted to make decisions at scale.
The data never lies. But it does not always tell the whole truth either. And in the age of AI, the difference between those two things has never mattered more.

