verifiability over cleverness

may 2026

language models are good at being convincing, which is not the same as being right. so i build so the output has to prove itself.

the question i ask of anything an ai produces isn't "is this impressive." it's "can it show its work."

language models are very good at being convincing, which is a different thing from being right, and the gap between the two is where most of the trouble lives. a confident paragraph with a wrong number in it is worse than an obvious error, because it gets believed. so the systems i build are arranged so the model's output has to prove itself before anyone acts on it.

in practice that means moving the trust off the model and onto things that can be checked. when i built an agent over flight data, the model didn't compute any answers. it chose which validated function to run, and the function returned the result with the source attached, so every answer arrives already carrying its evidence. the model is allowed to be wrong about which tool to use, because that's cheap to catch. it's never allowed to be the thing you take on faith.

the same instinct runs through the evaluation side. before a change ships, it runs against a fixed set of questions with known answers, behind a check that fails loudly when a number moves the wrong way. the point isn't the score. it's that a regression shows up as a failed test instead of a bad feeling three weeks later. a system you can't re-run is an anecdote, and i don't want to make decisions on anecdotes.

at work this shows up as a habit of re-checking machine findings against the source before they surface, and dropping anything that can't be verified rather than passing it along with a shrug. it makes the system quieter. it surfaces less, and what it surfaces holds.

none of this is clever, and that's the point. the clever version trusts the model and hopes. the boring version assumes the output will be doubted and builds so it can answer the doubt. one of those scales to things that matter. the other one demos well, right up until someone checks.