What we do when reviewer feedback is wrong

A polish pass assumes the reviewer got it right. Most of the time, that assumption holds. The reviewer flags an awkward sentence, we smooth it. They mark a theological term as inconsistent, we harmonize it. The pipeline runs, the translation improves, the score goes up.

But sometimes the assumption breaks. A flag points at a sentence that wasn’t actually awkward. A “missing content” note refers to a phrase the original English never contained. Two notes contradict each other: tighten the prose, but also restore the omitted clause that made it long in the first place. When this happens, the polisher has to decide whether to comply, push back, or quietly do something different.

We’ve come to treat this as a real part of the work, not an exception. Here is how we think about it.

Why feedback gets things wrong

Reviewer feedback is the output of a model running over a translation. Like any model output, it has failure modes. The most common ones we see:

Over-flagging. A reviewer reads “натхненний Богом” (inspired by God) and suggests “богонатхненний”, a more theologically loaded compound that the original author didn’t use. The flag is grammatically valid but introduces a register shift that wasn’t in the source.

Phantom omissions. The reviewer claims a sentence in the English wasn’t translated, but the sentence is actually present, just rephrased. This often happens when the translator broke a long English clause into two shorter Ukrainian sentences. The reviewer’s pattern-matching missed the restructure.

Contradictory notes. One flag asks for a more pastoral tone in a particular paragraph; another flag in the same paragraph asks for tightening that would strip the warmth. Both are reasonable individually. Together, they cancel out.

Score-feedback mismatches. The numerical score reads 8.5 but the comment list reads like a 6.0. Or the inverse. This usually means the reviewer caught a few surface issues but missed structural ones, or the score reflects an overall impression that the specific notes don’t justify.

None of this means the reviewer is bad. It means reviewer output is data, not verdict. Treating it as verdict is how polish passes start damaging translations.

How we triage a feedback set

Before touching the text, we read the entire feedback packet against the original English and the current Ukrainian draft. Three questions guide the read:

Is each flag pointing at a real issue?
Are any flags in tension with each other?
Does the score line up with the severity of the flagged issues?

For each flag, we land in one of four categories:

Apply as suggested. The flag is correct and the suggested fix is the right one.
Apply differently. The flag identifies a real issue but the suggested fix would create a new problem. We fix the issue our own way.
Skip. The flag is wrong. The flagged sentence is fine, the omission isn’t actually missing, or the suggestion would degrade the text.
Defer. The flag points at something that needs a human or a deeper review. We leave a note.

Most flags fall into the first two categories. A small number get skipped. Defers are rare and we use them sparingly, because they shift work onto someone else.

When to skip vs comply

The hardest calls are the ones where we suspect the reviewer is wrong but can’t fully prove it. The bias here matters. If we lean too hard on our own judgment, we ignore real issues the reviewer caught. If we lean too hard on compliance, we degrade the text by following bad advice.

Our default rule: when in doubt, comply. The reviewer is reading the text fresh, while we are reading it after the translator already worked on it. Their distance is an asset. But there are situations where we override the default:

The flag misreads the source. If we go back to the English and the reviewer’s note describes content that isn’t there, we skip the flag. We’ve seen this often enough that we always check the source before applying an “omitted content” fix.
The fix would break HTML structure. Translations come to us as JSON with HTML bodies. A suggested rewrite that splits a sentence across a <p> boundary, or strips an inline tag, gets rejected on structural grounds even if the prose change would be fine.
The fix introduces a calque. If the reviewer’s suggestion is a more literal rendering of the English than the existing translation, we usually keep the existing translation. Ukrainian readers don’t benefit from English structure leaking through.
The fix shifts the theological register. Reformed theology has precise vocabulary. A suggestion that swaps a doctrinally neutral word for a charged one, or vice versa, needs more justification than “this sounds smoother.”

Communicating disagreement

When we skip or modify a suggested fix, we say so. The polish task ends with a comment that briefly notes which flags were applied as-is, which were applied differently, and which were skipped. This is not defensive documentation. It is input for the next round of pipeline tuning. If we keep skipping the same kind of flag, the reviewer prompt probably needs adjustment. If we keep applying the same kind of fix differently, the reviewer’s suggested-fix template might be miscalibrated.

The point of this loop isn’t to be right about any individual flag. It’s to keep the system honest. A polish pass that silently complies with every flag, including the wrong ones, hides the failure modes. A polish pass that silently ignores the flags it dislikes hides them in the other direction. Naming what we did and why keeps the surface visible.

A quieter craft than it looks

People often picture translation polishing as wordsmithing. A lot of it is. But the part of the job that determines whether a translation actually improves is rarely about word choice. It’s about how we read the feedback, how we test it against the source, and how we decide what to do when the feedback and the source disagree. That decision happens before any editing. By the time the cursor moves, the real work is already done.

What we do when reviewer feedback is wrong

Why feedback gets things wrong

How we triage a feedback set

When to skip vs comply

Communicating disagreement

A quieter craft than it looks

More from the team

Which document we read first when we polish

The polish is in the lines we don't touch