What our confidence numbers actually tell us

When we classify an article, we read its title, its excerpt, and the first slice of its body. We compare those signals against the live list of categories the CMS has. We pick the one that fits best. Then, in the same response, we report a confidence number between zero and one.

The number is generated by the same model in the same pass. The model decided the category, then decided how confident it is in the category. Both decisions come from one place. The second decision is not a separate calibration step. It is the model’s own self-report, produced by the same process that produced the answer.

That should be uncomfortable. A self-report from a language model is not a measurement. It is another generation, with the same biases and the same failure modes as the generation it is reporting on. If the model would confidently pick the wrong category for a particular kind of article, it would also confidently report high confidence on that wrong pick. The error and the self-assessment of the error are correlated, because they come from the same place.

The downstream pipeline reads that number anyway and decides what to do with it. Auto-apply above some threshold, suggest to an editor in the middle, flag for taxonomy review below. The number looks like a probability. It is not.

Why we use the number anyway

We keep the number because the alternative is worse.

The alternative is treating every classification the same. Every article gets auto-applied. Every article gets routed to an editor. Both are bad in opposite ways. Auto-applying everything floods the site with miscategorized articles in the tail. Routing everything to an editor defeats the point of automation.

The confidence number, even if it is only the model’s self-report, is correlated with something. It is not noise. When the model says 0.95, it usually has picked a category that an editor would also pick. When it says 0.3, it has often picked between two categories that both fit, or none of them fit well. The number is not a probability. It is more like a heuristic for “this is the kind of decision a human should look at.”

We use thresholds, and we set them conservatively. The threshold for auto-apply is higher than what a calibrated probability would justify, because we know the self-report is biased toward overconfidence. The threshold for flagging is lower than it would be in a calibrated world, because we would rather review a few false alarms than miss a real edge case.

What the number does not tell us is whether the article is in the right category. A 0.95 can still be wrong, and when it is wrong, it is usually wrong in the way the title pulled the model toward a topic the body did not deliver. The model was confident because the signals it read were clear. The signals it did not read might have changed the answer.

The number also does not survive the difference between “I picked the right category” and “I picked the category that more readers would be looking for.” Our rule is the second one. The model’s confidence is closer to the first. We have to translate.

What it tells us about the run

The shape of confidence numbers over a batch tells us something about the batch.

A batch with consistently high confidence is usually a batch with a single dominant genre. A batch with confidence sitting near the middle is usually a batch with cross-cutting articles, or with a taxonomy whose categories overlap in the areas this batch lives. A batch with a spike of low-confidence runs is usually a batch where a new category should exist and does not, or an old category is now too broad to do work.

The confidence number is more useful as an aggregate signal about the run than as a per-article truth claim. We read individual confidences when deciding what to flag. We read the distribution when deciding whether the taxonomy is healthy.

We have not tried to calibrate the confidence number against editor decisions. We could. We could log every classification, wait for editor corrections, and produce a calibration curve. That would give us a real probability instead of a self-report.

We have not done it because the cost-benefit looks weak. Editor corrections are rare and slow. The categories drift faster than the calibration. By the time we had enough data to recalibrate, the taxonomy would have changed enough that the curve would be out of date. We could rebuild it continuously, but the engineering cost is real, and the gain over a well-chosen threshold is small.

The thing we have done instead is to keep the thresholds adjustable and to look at the distribution over time. When the distribution shifts, the threshold can shift. When the threshold no longer matches the editor’s tolerance, the threshold moves. The number is not calibrated, but the system around it is.

What this teaches us about self-report

A model that reports its own confidence is doing something subtler than a model that reports its answer. The answer is what we asked for. The confidence is what we asked the model to feel about what it just produced. Those are not the same kind of thing. The first is a classification. The second is a hedge.

We treat the hedge as data because hedges are usually informative even when they are not measurements. A person who says “I’m pretty sure but not certain” has told us something real, even though the sentence is not a probability. A model that says 0.7 has told us something real too. The shape of that something does not match the shape of a number, but it is not nothing.

We try to act on the something without confusing it for the number.

What our confidence numbers actually tell us

Why we use the number anyway

What it tells us about the run

What this teaches us about self-report

More from the team

Why the agent that writes the code never grades it

From prompts to skills: what changed when our conventions became files