Proving fairness and explaining every no · The Builder's Stack

SecurityTesting & evidence

In November 2019, a tech founder posted that Apple Card had offered him twenty times the credit limit it offered his wife, despite her better credit score. The thread went viral, and within days New York's financial regulator had opened an investigation.

The report landed in 2021 and found no unlawful discrimination. That sounds like the end of the story, and it is actually the uncomfortable middle of it, because the bank had spent more than a year unable to explain its own decisions in public. An answer that amounts to saying the algorithm did it satisfies nobody, not the customer who got the lower limit, not the press, and certainly not a regulator.

If your product shapes decisions about money, work, claims, or coverage, the lesson is direct: have the evidence that your decisions are fair ready before anyone asks for it, and have a specific reason ready to give for every denial.

Disparate treatment and disparate impact

The vocabulary your regulator will use comes from decades of US discrimination law, and the working terms are disparate treatment and disparate impact.

Disparate treatment means a protected trait was used in the decision itself. If sex, race, age, religion, or another protected characteristic feeds the model or the rule, that is treatment, and outside narrow exceptions it is illegal in credit, hiring, housing, and insurance.
Disparate impact means a neutral practice that lands unequally on a protected group without an adequate business justification. Nothing in the decision mentions the trait, and the outcomes split along it anyway.

AI products rarely commit the first, because no serious team knowingly feeds protected traits into a feature set. They routinely risk the second, because a trained model reproduces every correlation in its data, including the ones that track protected groups.

Deleting the protected column does not remove the bias

A model that was never given sex or race as an input can still produce outputs that vary by sex or race, because other features carry the signal in. Zip code tracks race in most American cities, first names track sex and national origin, the school on a resume tracks class and race, and shopping patterns track more than most teams expect, so a model fitted on enough of them can reassemble the trait you deleted.

Dropping the protected column does not make the model fair. The signal still arrives through other features that stand in for the trait you deleted, and now you can no longer measure whether the outputs split along that line.

Measure fairness by checking your eval results group by group

You prove fairness by measuring it, not by claiming it, and the machinery already exists in your eval stack. The case ledger you built in Cases: build the set that samples reality gains one more dimension here. Tag each case with the group memberships your regulator cares about, then read approval rates, error rates, and reason codes per group the same way you already read them per segment.

Keep the record next to the results, written and dated:

Write down what you tested: which decisions, which slices, and on what population.
Write down what you found, including the runs where the slices came back even.
Write down what you changed, with the retest that confirmed the change held.

Document it even when the news is good, because after a complaint or an exam, the absence of testing reads as indifference. Which groups belong on the list is a legal question that varies by domain, so treat your legal and compliance teams as the authority on it; your job is making the product measurable along whatever lines they name.

The decisions are yours even when the software is rented

Buying the model does not transfer the duty. The landmark case here is Mobley v. Workday, the leading test of whether the maker of an AI hiring tool, and not only the employer using it, can be held liable for biased screening. In 2024 a federal court held that Workday could be liable as the employer's agent and let the discrimination claims proceed. The case has only escalated since. In 2025 the court certified a nationwide collective action for age discrimination, covering applicants over forty who were rejected through Workday's screening tools, and in early 2026 it authorized notice to that group and rejected Workday's argument that age-discrimination law protects only employees and not applicants. An older EEOC settlement points the same way: in 2023, iTutorGroup paid $365,000 over recruiting software that auto-rejected older applicants.

So run rented models through the same slicing you run on your own. Ask the vendor for their fairness testing, then test anyway on your own population, because the rates that matter are the ones your applicants and customers actually experience.

Design the reasons for every no up front

US credit rules attach a duty to the no itself. An adverse action notice (the formal explanation that follows a denial, a limit cut, or worse terms) must give the specific and accurate reasons for that decision. The CFPB, the federal consumer finance regulator, said in 2022 and again in 2023 that complex models are no excuse and that boilerplate reason codes do not count.

The product consequence is that reason codes get designed before launch, not reverse engineered after a complaint. For every decision the model influences, define the reasons a no can carry, confirm the model's actual decision drivers map onto them, and test that the mapping holds case by case. If the model cannot support a specific, accurate reason for an individual decision, the model is not ready to make that decision, whatever its aggregate accuracy says.

Insurance regulators are headed the same way as credit

Credit got there first, and insurance is close behind. Colorado's 2021 law, SB 21-169, holds insurers accountable for unfair discrimination arising from external data and algorithms, with state regulations following it. If you build in insurance, assume the same playbook above (slicing by group, documenting what you tested, and explaining each decision one customer at a time) is arriving on your desk with different statute numbers.

Try it now

Slice one decision. The drill runs on data you already have and takes about fifteen minutes.

Pick one decision your product influences. Choose an outcome where the product's output changes what happens to a person: an approval, a flag, a ranking, an escalation.

Slice the last fifty cases. Pull the most recent fifty cases through that decision and split them by one demographically meaningful proxy you can observe responsibly. Geography is usually available, and coarse regions are enough for a first look.

Compare outcome rates across the slices. Work out the rate of the outcome in each slice and look at the gap. A gap is not proof of discrimination, and an even result is not proof of fairness, but either one tells you what your fairness program has to investigate first. Write the numbers down with the date.

Draft the reasons for your most recent no. Take the latest no the product produced and write the three specific reasons you would give the person, concrete enough that they could change something and try again. If everything you write sounds like boilerplate, you have found this chapter's work item.

Chapter Summary

In high-stakes products, you have to measure fairness, document it, and explain each decision, because claiming a product is fair counts for nothing without that evidence.
Your regulator works in two terms: disparate treatment is using a protected trait in the decision, and disparate impact is a neutral practice that lands unequally on a protected group without good business reason.
Most AI products avoid the first and risk the second, because a trained model reproduces the correlations in its data, including ones that track protected groups.
Deleting the protected column does not fix bias. Other features stand in for the trait, and now you cannot measure whether the outputs split along it.
Tag your eval cases by the groups your regulator cares about, then compare approval rates, error rates, and reason codes group by group.
Write down what you tested, what you found, and what you changed, even when the result is even, because later the absence of testing reads as indifference.
Renting the model does not transfer the duty, so run vendor tools through the same group-by-group testing on your own population.
Design the specific reason behind every denial before launch, and if the model cannot support a concrete, accurate reason for an individual decision, it is not ready to make that decision.
Next is Testing, evidence, and audit trails, which turns the records you produce here into the evidence a regulator will eventually ask to see.

Sources

New York Department of Financial Services (2021). Report on the Apple Card investigation.
Consumer Financial Protection Bureau (2022). Circular 2022-03 on adverse action notices and complex algorithms.
Consumer Financial Protection Bureau (2023). Guidance on adverse action notices when lenders use artificial intelligence.
United States District Court for the Northern District of California. Mobley v. Workday, Inc., No. 3:23-cv-00770-RFL: 2024 ruling that Workday could be liable as an agent, 2025 certification of a nationwide age-discrimination collective action, and 2026 authorization of notice and ruling that the age law protects applicants.
U.S. Equal Employment Opportunity Commission (2023). Settlement with iTutorGroup over age discrimination in recruiting software.
Colorado General Assembly (2021). Senate Bill 21-169 on insurers' use of external consumer data and algorithms.

Marks this chapter complete on your course map. Reaching the end does this for you.