How much of an article we read to categorize it

We classify articles by reading the title, the excerpt, and the first 2000 characters of the body. That number was not chosen carefully. It was chosen once, and it has worked well enough that we have never revisited it. But the choice is doing more work than it looks like it is doing.

The simplest model of classification assumes the classifier reads the article and understands it. In practice, the classifier reads a slice and infers the rest. The slice is a sampling window, and the way an article fills that window decides what category it lands in. An article that opens with a thesis sentence and one paragraph of context will be classified accurately on its first 200 characters. An article that opens with a story will be classified by the story, not by the argument the story is leading into.

Why the lead does most of the work

Most theological articles announce their topic in the first paragraph. The structure is: here is what we will argue, here are the reasons, here is what to do about it. That structure is unusually friendly to a fixed-window classifier. The category-defining signal sits at the top, and reading more does not change the answer.

When we read the first 2000 characters, we are usually reading the entire first move of the article. Title, opening framing, definitional paragraph, sometimes the first scripture reference or the first quotation from a confession. By the time we hit the cutoff, the article has told us what it is. The window is not arbitrary because the writing is not arbitrary. The writing has a shape, and the window catches the shape.

The size is mostly historical. We could probably do as well with 1200 characters on most articles, and a few outliers would do better with 4000. We have not bothered to tune the number because the savings would be small and the risk of tuning to last month’s content is real. A classifier that works well on the genre we read today is more valuable than one tuned to whatever subgenre we happened to see during a tuning week.

Where the window fails

Three patterns regularly defeat us, and they have a shape worth naming.

The first is the long illustration opener. A pastoral piece sometimes starts with three paragraphs of a story before mentioning what the article is about. The story is meant to land the reader in a mood, not to declare a topic. A classifier reading only those three paragraphs sees the mood and not the argument. We have classified articles as Devotional that were actually expositional pieces with a devotional warm-up, and the misroute is correct only in the sense that the warm-up is the only part the classifier got to read.

The second is heavy front-loaded quotation. Some articles open with a long quote from a confession, a creed, or a reformer. Two thousand characters is sometimes mostly the quote. The classifier reads the quote as content rather than as the article’s frame, and the category drifts toward whatever the quote is about, even when the article’s actual argument is one level above it. A piece using a Heidelberg passage to argue something about modern church practice will sometimes land in Catechism instead of Ecclesiology because the quote dominated the window.

The third is the title that lies. Not deliberately, but in the way titles often do. An article called “The God who keeps his promises” might be about the doctrine of providence, or about parenting, or about a single Old Testament narrative. The title constrains nothing. When the body lead is also vague, the only signal is the excerpt, and the excerpt is sometimes the same vague gesture as the title. We have to commit to a category on a thin signal, and we say so in the confidence number.

What we have done about it

Not much, and that is intentional. The temptation when a categorization window fails is to widen the window. Read 4000 characters. Read the whole article. Each step makes the classifier slower and only fixes the failures we noticed, not the ones we did not. The articles that fit the window pay the cost of the articles that did not, every time.

What we have done instead is to track our low-confidence runs. The cases where the window seems to be telling us two different things show up as confidence below the threshold we trust. Those articles get a different routing in the pipeline. We do not try to make the classifier solve them. We let them be flagged.

The other thing we have done is to weight the title and the excerpt slightly higher than the body lead. Editors choose those fields intentionally. They are noisier than the body in many cases, but when they agree with each other, they carry more information per character than the opening paragraph. When they disagree with the body, that is itself a signal that something interesting is happening in the article, and we surface it.

What this taught us about reading

The classifier reads less than a human would, and it has to make a decision a human would not be forced to make. That is the cost of automation, and it is acceptable. The interesting realization is that even with the whole article, a human classifier facing a single-pick taxonomy will often make the same call we did. The window does not change the answer for most articles. It only changes the failure modes.

The shape of our errors tells us something about the shape of the articles we read. Articles that lead with story, articles that lead with quotation, articles whose titles refuse to commit. Those are not classification failures. They are the natural outliers of a content stream that mostly writes in a familiar form. The window catches the form. The outliers are the cost of catching the form efficiently.

We have not tried to read more. We have tried to read better.

How much of an article we read to categorize it

Why the lead does most of the work

Where the window fails

What we have done about it

What this taught us about reading

More from the team

What the tags on a translated post are for

What our confidence numbers actually tell us