When You're More Correct Than the Source of Truth: Validate the Validator

The Disagreement

We validated our new pipeline the responsible way: against a source of truth — a reference dataset, numbers already derived from the same hospital price files, already published, already trusted. The plan was standard: run our engine, compare it to the reference, ship when they match. A green check against a trusted oracle is how you earn the right to ship.

For most hospitals, they matched. For some, they did not.

The disciplined instinct at that fork is to assume you are the one who is wrong. The reference is trusted and in production. Your engine is new. When the new thing disagrees with the established thing, the base rate says the new thing has the bug — so the natural move is to figure out what the reference does and reproduce it.

We almost did. What stopped us was a thirty-second habit: before bending the engine to match, check why they disagree.

The Oracle Was Wrong

The reference was buggy. Not at the edges — in ways that silently dropped real data.

It had been generated on a case-insensitive filesystem, where two directory names that differ only in capitalization are treated as the same directory. The hospital files used procedure-code prefixes as directory names, and some of those prefixes differed only in case. They collapsed into one, and one silently won. The result: about 12,945 valid rows dropped across 7 hospitals. No error. No warning. The reference just had fewer rows than the files contained, and presented that as the truth.

That was not the only one. The reference crashed on a particular text encoding — files written in Windows-1252 instead of UTF-8 — and when it crashed on a hospital's file, that hospital never made it into the output. Six whole hospitals lost that way. It also could not read price files delivered as zip archives, which cost five more. Add it up: thousands of rows and well over a dozen hospitals, missing, all of it invisible, all of it presented with the same confidence as the parts it got right.

So the disagreement was not our bug. It was us being right. We were reading data the reference had thrown on the floor.

Two Doors

When your engine is more correct than your reference, you face two doors, and they look more alike than they are.

Behind the first, you make your validation pass. You bend the engine to reproduce the reference's behavior — drop the same rows on the same collision, fail on the same encoding, skip the same zip files. Now your output matches the oracle exactly, your test is green, and you can say, truthfully, "we reproduce the source of truth." You have also shipped its bugs on purpose, with a certificate of correctness on top.

Behind the second, you admit the reference is wrong — which throws away your validation. The thing you were checking against is no longer something you can check against, because you no longer believe it. Now you have the harder problem: how do you know your output is right, if the only oracle you had is the one you just rejected? An engine validated against nothing is an engine you are trusting on faith.

The first door is comfortable and false. The second is correct and leaves you with work to do.

The Faithful-Superset Gate

What we built to walk through the second door, we call a faithful-superset gate. The name is the idea: our output should be a faithful superset of the reference — everything the reference got right, reproduced exactly, plus the things it dropped, recovered and accounted for.

It has two halves, and the discipline is keeping them separate.

The first half is exact reproduction of the clean set. For every hospital the reference handled correctly — no collision, no encoding crash, no zip it couldn't open — our engine must reproduce the numbers exactly. Not within tolerance. Exactly. This is what keeps us honest: it means we have not quietly changed the methodology, drifted the math, or introduced our own silent drops. On the data the oracle got right, we still have an oracle, and we are still held to it.

The second half is accounted-for recovery of the rest. Every row and every hospital we recover beyond the reference — the 12,945 rows, the encoding-crash hospitals, the zip hospitals — is pinned to an explicit allowlist. Each recovery is named: this hospital, this many rows, for this reason. The allowlist is the record of every place we deliberately diverge, with a justification attached.

That second half is what lets the gate still catch a real bug — the failure the comfortable door cannot see. If you simply declare "we get more rows than the reference, more is better," then the day your engine accidentally double-counts and produces way more rows, you call that a triumph instead of a regression. "More than the oracle" is not a gate; it is the absence of one. The allowlist fixes that. Anything beyond the reference that is not on the allowlist fails the build. A genuine over-count, a duplicate, a parser inventing rows — all still trip the gate, because they show up as recovery nobody authorized. The fixes pass because we named them. The bugs fail because we didn't.

Colorado went from 59 hospitals to 70, and from 33.9 million rows to 41.3 million, through that gate. Every additional hospital and row is a place the trusted reference was wrong and we could prove it — with a clean set still reproduced exactly underneath.

The Real Constraint on the Information Layer

I keep coming back to a claim about where the Information layer is actually hard. The common framing is that the constraint is access: get your hands on the data and legibility follows. Price transparency is supposed to be the poster child — the data is public, so the problem is solved, so the only work left is presentation.

Access was never the wall. We had the files. The reference had the files. The wall was judgment: knowing the trusted output was wrong, and having the discipline to be more correct without swapping one silent error for another.

The faithful-superset gate denies you both comforts. You do not get to reproduce the reference's bug, and you do not get to free-float — trusting your own output precisely because it is yours. You reproduce the clean set exactly, and you account for every divergence. You stay more correct than your reference without quietly becoming less rigorous than it.

This is a governance stance on what "validated" is allowed to mean. The weak version, "validated means matches the trusted source," is the version that ships the source's bugs. The strong version, "validated means reproduces what the source got right and accounts for everything else," is the one that lets you be the more-correct party without losing the ability to catch yourself when you are wrong.

What I Do Not Yet Know

The gate works because, in this case, we could prove the reference was wrong. The case-insensitive collision is a fact about a filesystem; the dropped rows are countable; the encoding crash is reproducible. Every divergence was the legible kind, where you can point at the row and say "this is real, and it was thrown away."

What I do not have a good answer for is the divergence you cannot prove either way — a genuinely ambiguous file, a malformed record that could be parsed two defensible ways, a code that maps to different things under different reasonable assumptions. The allowlist handles the cases where recovery is provably correct. It does not tell me what to do when "correct" is contested and both readings hold up. Right now those get a human judgment call, logged — and that does not scale the way the rest of the gate does.

So the honest open question: when you validate against a reference and hit a disagreement that neither of you can prove, what is your tiebreaker — and how do you keep it from quietly becoming "trust whichever output is mine"? I have a gate that is rigorous about the provable divergences. For the unprovable ones, I have a judgment call and the uneasy knowledge that judgment calls are exactly where a more-correct engine slides back into a confidently-wrong one.

→ See the companion piece on keeping the prose honest once the data is right: Two Thousand Pages, One Moving Median.

→ See the full framework: The Decision Effectiveness Framework.

When You're More Correct Than the Source of Truth

The Disagreement

The Oracle Was Wrong

Two Doors

The Faithful-Superset Gate

The Real Constraint on the Information Layer

What I Do Not Yet Know

Related

Two Thousand Pages, One Moving Median

The Berkshire Test for AI: A Compounding Diagnostic for Leaders Allocating Capital to AI

The Tower of Babel in Your Boardroom