Content categorization with AI: lessons from theological articles

Sorting content into the right category sounds simple until you try to do it at scale. Over the course of building an automated translation pipeline for ReformedVoice, a Ukrainian Reformed theology website, I’ve learned that AI-powered categorization is less about pattern matching and more about understanding intent, audience, and the subtle boundaries between ideas.

Here’s what I’ve learned about building effective content categorization systems — and why theological articles make an unexpectedly good stress test.

Why Theological Content Is Hard to Classify

Most categorization tutorials use clean examples: a sports article goes in Sports, a recipe goes in Food. Theological writing breaks those assumptions immediately.

Consider an article about the Heidelberg Catechism’s treatment of suffering. Is it:

Doctrinal — because it expounds systematic theology?
Devotional — because it offers comfort to the reader?
Historical — because it discusses a 16th-century document?
Practical/Pastoral — because it addresses how to cope with pain?

The honest answer is all four. And yet a website taxonomy requires you to pick one. This is the core tension in content categorization: real-world content is multi-dimensional, but organizational structures are trees.

Building an Effective Taxonomy

The first lesson is that your taxonomy should serve your audience, not your ontology. A theologically precise category tree (Systematic Theology → Soteriology → Ordo Salutis → Justification) might satisfy a seminary professor but would alienate a general reader looking for articles about grace.

When working with the ReformedVoice category system, I found that the most useful categories were defined by reader intent rather than academic discipline:

What is the reader looking for?
What will they do after reading?
What other articles would they want to see alongside this one?

A practical taxonomy answers these questions. An academic taxonomy answers “where does this idea live in the map of all ideas?” — interesting, but less useful for content discovery.

Takeaway: Before building your classifier, audit your categories. If two categories consistently compete for the same articles, they may need to be merged or redefined. If one category is a catch-all, it probably needs to be split.

The Classification Pipeline

Our approach uses a straightforward pipeline:

Extract signals from the title, first 2,000 characters of content, and any available excerpt or summary.
Fetch the live category list from the target CMS — never hardcode categories, because editors change them.
Score each category against the extracted signals using the LLM’s native understanding of language and topic.
Return the best match with a confidence score.

The confidence score is critical. A high-confidence classification (0.85+) can be auto-applied. A low-confidence result (below 0.5) signals that the article may not fit the existing taxonomy well — which is valuable editorial feedback in itself.

Handling Edge Cases

Three patterns cause the most classification errors:

1. Genre mismatch. A book review about ecclesiology is not an ecclesiology article — it’s a book review. Systems that classify purely on topic keywords will miscategorize reviews, interviews, and meta-commentary. The fix is to weight structural signals (the presence of review language, interview Q&A format, or “about the author” sections) alongside topical ones.

2. Cross-cutting themes. An article about prayer in the workplace touches devotional practice, vocation theology, and practical Christian living. Rather than trying to find the “true” category, I’ve found it more reliable to ask: “If a reader found this article in category X, would they feel it belonged there?” This reader-centered heuristic breaks ties better than any semantic similarity score.

3. Cultural and linguistic gaps. When working across languages — English source articles destined for a Ukrainian audience — category boundaries shift. Ukrainian evangelical readers may draw different lines between “devotional” and “doctrinal” content than American readers would. The classifier needs to respect the target taxonomy’s cultural logic, not impose the source culture’s assumptions.

Balancing Precision and Recall

In classification, precision asks: “Of the articles I put in this category, how many truly belong?” Recall asks: “Of all the articles that belong in this category, how many did I find?”

For a content website, I’ve found that precision matters more than recall. A miscategorized article frustrates readers and undermines trust in the taxonomy. A missed article — one that could have fit in a category but ended up in a neighbor — is invisible and harmless. Nobody notices the article that could have also appeared in “Church History” but landed in “Theology” instead.

This means the classifier should be conservative: when in doubt, pick the broader or more obvious category. Don’t try to be clever with niche classifications unless confidence is high.

Practical Recommendations

If you’re building an AI-powered content categorizer, here’s what I’d suggest:

Keep your category list dynamic. Fetch it from your CMS at classification time. Categories evolve, and a stale list produces stale results.
Use confidence thresholds to route decisions. High confidence → auto-apply. Medium → suggest to an editor. Low → flag for taxonomy review.
Log your edge cases. Articles that consistently score below 0.5 confidence are telling you something about your taxonomy, not about the classifier.
Test with adversarial examples. Feed in book reviews, opinion pieces, interviews, and listicles — not just standard essays. These formats expose classifier assumptions.
Respect multilingual nuance. If your content crosses languages, validate that your categories make sense in both the source and target cultures.

The Bigger Picture

Content categorization is one of those problems that looks solved until you encounter real content. The articles that matter most — the ones that synthesize ideas across domains, challenge existing frameworks, or speak to multiple audiences — are precisely the ones that resist clean classification.

That’s not a failure of AI. It’s a feature of good writing. The best we can do is build systems that are honest about their uncertainty and designed to improve through editorial feedback. In the end, the classifier is not replacing human judgment — it’s giving human editors a well-reasoned first suggestion, freeing them to focus on the cases that genuinely require discernment.