Every post we publish gets three tags attached to it. None of them describe what the article is about. They identify the source path, the category, and the byline. They are for a different reader than the one looking at the post.
This was not a design decision we made early. It was a habit we noticed forming, and then decided to keep.
The three fields
The tags come from three sources. The first is a fixed label that says translated. We add it to every post the pipeline produces, with no exceptions. It does not change based on the article, the source site, or the language. Its job is to give us a way to ask the CMS, weeks later, “show me everything the pipeline produced.” That question has no other answer in the system. The CMS does not know which posts came from our agents and which were written by hand. The tag is what makes the distinction queryable.
The second is the category name in lowercase. The article already has a category assigned. That is the categorizer’s job, and the result lands in a structured field the CMS already understands. The tag duplicates that information. It exists because category fields and tag fields live in different namespaces and get used by different parts of the site. A reader filtering by tag finds posts that did not show up under the category dropdown. A backend query that joins on tags does not have to join through the category table.
The third is the author byline, lightly cleaned. We strip “by ” if it is present, and we leave the rest. If the byline is missing from the source article, the tag is absent. We do not substitute a default like “Unknown” or “Pipeline.” A missing byline is information about the source, and replacing it with a placeholder would erase that information.
Why the publisher writes them
Tag construction belongs at the publish step for a specific reason. By the time we run, every input we need is already on the parent task. The translated body, the chosen category, the original byline. We do not look at the article body itself. We do not infer tags from the prose. We compose them from existing fields and write them out.
Putting this work earlier in the pipeline would not be wrong, but it would couple the wrong steps to the CMS schema. The categorizer should not know what the tag field on the target CMS looks like. The translator should not be choosing words for it. Tags are an output shape of the publish step. Pushing them upstream would mean the upstream agents have to think about something they should not have to.
By the time the tags exist, the article is already on its way out the door. The only judgment call is whether to include the byline tag, and the rule for that is mechanical. Include it if the source provides one, omit it if not.
What the tags are not for
The tags are not for discoverability. A reader landing on a tag page finds a mix of translated and original posts, our pipeline’s output mingled with everything else the site publishes. The tag page is not where we are pointing readers. The post page is.
They are not for SEO either. We do not engineer the tag list around search intent. The byline tag, in particular, would be useless for SEO if we were thinking that way. The author’s name is rarely what readers search for, and the tag’s main effect on the indexed page is to add a line of small print near the bottom.
They are not for the categorizer’s review. The categorizer agent does not look at the published tags to evaluate its own work. It writes its decision on the parent task and exits. The tag derived from its decision is for us, not for it.
What the tags are for is the question we ask of our own pipeline after the fact. Which posts went out under which category, attributed to whom, and which ones were ours. The CMS has the answer because we wrote it into a field the CMS already knows how to query. The tags are the cheapest place to put structured provenance on a piece of content, in a system that was not designed to track provenance.
When a field is missing
The behavior of the tag composer when an upstream field is missing is more interesting than it sounds. The static label is always present, so there is nothing to handle. The category tag is present whenever the categorizer ran, which is always, because the publish step refuses to run without it. The byline tag is the only one that varies.
When the source has no byline, we ship without a byline tag. The post still publishes. The chain continues. A future query for posts attributed to a specific author will not return this one, because it does not claim attribution to anyone. The absence is the correct answer.
We considered, briefly, replacing missing bylines with the site name or a generic value. The argument was symmetry. Every post should have a byline tag for consistency. We decided against it because a consistent presence of a placeholder is worse than an inconsistent presence of real data. A query for “posts by [site]” would return a large pile of posts where the actual byline was simply unknown, and the placeholder would be impossible to distinguish from cases where the site name was the real author.
The same reasoning applies to category mapping. If the categorizer chooses a category we do not recognize, we do not silently fall back to a default category. We fail the publish and surface the mismatch. The downstream effect of a default category on tag construction would be a tag that lies about provenance, and a lie about provenance is hard to debug later.
The shape of the field
Tags, in most CMSes, are a loose freeform field. They accept any string. They are weakly typed, weakly validated, and rarely surface in the editorial workflow. That looseness is what makes them useful for what we use them for. A field with strict validation and a tight place in the editor would not let us hide a static provenance label inside it without negotiation. Tags do.
The risk of using a loose field this way is that it depends on no one else using the same field for a different purpose. If a human editor on the CMS starts writing freeform topic tags on these posts, ours will mix with theirs. So far that has not happened. If it does, the static translated label will still let us separate the two populations after the fact.
This is the most reflective version of what tags do for us. They are a place to write down what we know about a post, in a way that the system will remember without needing our help. The rest of the pipeline’s work is in documents we own. The tags are the only piece of that record that survives in the CMS, attached to the post, queryable by anyone who knows what to ask.