January 31, 2026
2 Comments

AI Bias in Action: The Real-World COMPAS Algorithm Case Study

Advertisements

You hear about AI bias all the time. It's this vague, futuristic worry. But what does it actually look like when it touches real lives? It looks like courtrooms, job applications, and hospital wards. It looks like the COMPAS algorithm.

This isn't a story about a glitch. It's a story about how bias gets baked into the very foundation of a system, often with the best of intentions, and then gets scaled to affect millions. If you want to understand AI bias, you need to start here.

The COMPAS Case Study: A System Under Scrutiny

COMPAS stands for Correctional Offender Management Profiling for Alternative Sanctions. Developed by Northpointe (now Equivant), it was used by judges and parole officers across the United States to assess a defendant's likelihood of committing a future crime. The idea was noble: use data to make sentencing and parole decisions more objective, less prone to a judge's personal whims.

Here's how it worked. The system asked a series of 137 questions. Some were straightforward (criminal history). Others were... murkier. Questions about your family, your friends, your neighborhood, your mindset. This data was fed into a proprietary algorithm that spat out a risk score from 1 to 10.

The Crucial Flaw: Judges started relying on these scores. A high score could mean a longer sentence or denial of parole. The algorithm's judgment became a key piece of someone's fate.

Then, in 2016, investigative journalists at ProPublica published a bombshell analysis. They looked at over 10,000 criminal defendants in Broward County, Florida, and compared COMPAS's predictions to what actually happened over two years.

The findings were stark, and they cut along racial lines.

Group Key Finding from ProPublica Analysis What This Means
Black Defendants Were almost twice as likely as white defendants to be labeled a higher risk but not actually re-offend. False Positive: The system was wrongfully flagging them as future dangers.
White Defendants Were more likely than black defendants to be labeled a lower risk but then go on to commit another crime. False Negative: The system was giving them an undeserved pass.

Think about that for a second. The error wasn't random. It was systematically skewed. A Black person was more likely to be unfairly punished by the algorithm's prediction. A white person was more likely to be unfairly advantaged.

One specific example from the report: A 19-year-old black woman was arrested for grand theft (stealing a kid's bicycle). She had no prior convictions. COMPAS gave her a high risk score. Meanwhile, a 41-year-old white man arrested for shoplifting had a history of armed robbery and drug charges. COMPAS gave him a low risk score.

The company defended its tool, arguing it was equally accurate across races if you measured accuracy a different way. But that's the thing about bias—it often depends on what you're measuring and what you value. Is the goal to be "accurate" on paper, or is it to be fair in its consequences? COMPAS highlighted this tension perfectly.

How Bias Sneaks Into AI: It's the Data, Not the Code

Most people assume bias is in the algorithm—some sinister line of code. In my experience, that's almost never the case. The bias is in the fuel: the data.

COMPAS's bias likely came from a few interconnected sources:

  • Historical Data Echoes: The system was trained on historical arrest and conviction data. But policing isn't neutral. Decades of over-policing in minority neighborhoods mean more arrests there, regardless of actual crime rates. The algorithm learned that "being from neighborhood X" was correlated with "being arrested," mistaking a symptom of societal bias for a predictor of criminality.

You see this everywhere. An AI hiring tool trained on a company's past hires will learn to prefer candidates from the same universities and backgrounds as the (often homogenous) existing team. It's not inventing bias; it's automating the status quo.

  • Proxy Variables: Even if you remove "race" from the data, the algorithm finds proxies. Zip code, income level, frequency of moving, even the type of car you drive—these can all act as stand-ins for race or socioeconomic status. COMPAS's questions about neighborhood and stability were likely riddled with these proxies.
  • Labeling Bias: What counts as "recidivism" (re-offending)? Is it being arrested again? Convicted? The label itself can be biased. If one group is more likely to be arrested for the same behavior, then using "arrest" as the label for the AI to learn from bakes in that policing bias from the start.

This is the subtle mistake many tech teams make. They focus obsessively on improving the algorithm's accuracy metric while treating the training data as a given, neutral truth. It's not. The data is a mirror, and it's reflecting all our historical and social imperfections.

Beyond Justice: Bias in Hiring, Healthcare, and Finance

COMPAS is the poster child, but the story repeats. Once you know what to look for, you see the pattern.

Hiring Algorithms That Filter Out Women

In 2018, Reuters reported that Amazon had scrapped an internal AI recruiting tool because it showed bias against women. The system was trained on resumes submitted to Amazon over a 10-year period, which were predominantly from men (a reflection of the tech industry's gender gap). The AI learned to downgrade resumes that included words like "women's" (as in "women's chess club captain") and penalized graduates of all-women's colleges. It wasn't programmed to be sexist. It learned that "successful Amazon employee patterns" in the data looked male.

Healthcare Algorithms That Prioritize White Patients

A 2019 study published in Science found widespread racial bias in a healthcare algorithm used on over 200 million people in the US. The algorithm was designed to identify patients with complex health needs who would benefit from extra care programs. It used past healthcare costs as a proxy for health needs.

Here's the catch: Black patients often have less access to care and lower healthcare spending for the same level of need. The algorithm, by equating "cost" with "need," systematically underestimated the sickness of Black patients. White patients were prioritized for extra care, while equally sick Black patients were overlooked. The bias was hidden in the choice of proxy.

Credit Scoring That Redlines Digitally

Fintech companies promise to use "alternative data" (like your social network, shopping habits, or phone usage) to give credit to the "unbanked." But these alternative data points can be minefields of proxies. Where you shop, the type of phone you use, even how you fill out a digital form can correlate with race or neighborhood. Without extreme care, these systems can simply recreate the old, discriminatory practice of redlining in a shiny new algorithm.

The common thread? High stakes, historical data, and a failure to interrogate what that data really represents.

What Can Actually Be Done About Algorithmic Bias?

Fixing this isn't about finding a magic technical bullet. It's about process, auditing, and shifting mindset.

The Non-Consensus View: Many think the solution is just "more diverse data." That's necessary, but insufficient. You can have perfectly balanced demographic data and still have a biased system if the outcomes you're predicting (like "arrest" or "hired") are themselves the result of past discrimination. The harder work is rethinking the prediction task itself.

Here’s a more practical approach:

  1. Bias Auditing is Non-Negotiable. Before deployment, you must rigorously test the model's performance across different subgroups (race, gender, age, etc.). Don't just look at overall accuracy; look at false positive/negative rates, as the COMPAS case showed. Tools like IBM's AI Fairness 360 or Google's What-If Tool can help.
  2. Question Your Proxies and Labels. Is "arrest" a fair label? Is "healthcare cost" a good proxy for need? Often, you need to work with sociologists, ethicists, and domain experts—not just data scientists—to answer this.
  3. Build Diverse Teams. Homogeneous engineering teams are more likely to miss these blind spots. Diversity isn't just an HR metric; it's a critical component of risk mitigation for AI systems.
  4. Plan for Human Oversight. AI should be a decision-support tool, not a decision-maker in high-stakes areas like justice or medicine. Judges, doctors, and hiring managers must understand the tool's limitations and retain final accountability.

It's messy, ongoing work. There's no "bias-free" AI, just like there's no bias-free human. The goal is to be aware, to measure, and to mitigate.

Your Questions on AI Bias, Answered

How can I tell if an AI system I'm using is biased?

Look for disparities in outcomes across different demographic groups. For an AI hiring tool, check if it recommends candidates from one gender or ethnicity significantly more often than others, assuming equal qualifications. For a credit scoring algorithm, audit approval rates by zip code or income bracket. The clearest red flag is when the system's error rates are not uniform—like COMPAS having higher false positives for Black defendants. Ask the vendor for their bias audit reports. If they can't provide transparency on their training data and testing methodology, that's a major risk.

Who is legally responsible when a biased AI system causes harm?

This is a legal gray area, but liability is increasingly falling on the organizations that deploy the AI, not just the developers. A court or a company using a biased hiring tool could face discrimination lawsuits. The key is 'duty of care.' If you deploy an AI that makes high-stakes decisions, you are expected to have conducted reasonable due diligence to test for bias. Simply saying "the algorithm did it" is not a defense. The European Union's proposed AI Act explicitly places obligations on deployers ("users") to monitor for risks. The trend is toward shared responsibility across the supply chain.

Can't we just remove demographic data to fix AI bias?

This is a common but flawed solution. Removing explicit fields like "race" or "gender" is called "fairness through blindness." It often fails because proxies for these attributes remain in the data. An algorithm can infer race from zip code, shopping patterns, or even name frequency. A more effective approach is to actively measure bias using demographic data you hold securely for testing purposes, then adjust the model to minimize disparities. The goal isn't to ignore sensitive attributes but to ensure the model doesn't use them to create unfair outcomes. Sometimes, you might need to include them in a controlled way to correct for historical bias.

The COMPAS story is a cautionary tale, but it's also a roadmap. It shows us where to look, what questions to ask, and why treating AI as an infallible oracle is a dangerous mistake. Real-world AI bias isn't a bug in some distant future system. It's here, it's measurable, and its consequences are profound. Understanding it is the first step toward building something better.