
Fail-Open LLM Architecture: Why Your Reviewer Stage Should Never Block a Decision
The problem
You added a reviewer LLM to your pipeline for good reasons. Quality matters, and a second pass catches self-contradictions the primary model misses. But the moment you wire a second model into production, you have a new failure surface: what happens when the reviewer itself is slow, errored, or unreachable?
The instinct is to treat the reviewer as a gate — no pass from the reviewer, no decision released. That instinct is wrong.
The pattern: fail-open
When the reviewer fails — rate limit, timeout, network error, bad JSON — the primary decision should pass through unmodified with a logged warning. Not held, not retried-until-exhausted, not queued-for-review. Passed through.
Here's the pattern:
try:
review = review_decision(decision, market_state, personality)
if review.verdict == "reject":
decision = downgrade_to_hold(decision, review.reasons)
elif review.verdict == "adjust":
decision = apply_adjustments(decision, review)
# "approve" falls through unchanged
except Exception as e:
logger.warning("reviewer failed, pass-through: %s", e)
# decision proceeds unchanged — this is the spec, not a bug
Enter fullscreen mode Exit fullscreen mode
The except Exception block is the spec, not the bug.
Why this is right
One: you already validated the primary decision. The reviewer is bonus quality, not a prerequisite. If the primary's output passes schema validation, sanity checks, and risk bounds, it's good enough to ship. The reviewer is additive confidence, not permission.
Two: fail-closed creates correlated failure. When your model provider has a regional blip, your primary and your reviewer both see elevated latency. If either gates the other, you lose both: primary times out waiting for reviewer, system freezes. Fail-open degrades gracefully to "we shipped what the primary said" instead of "we shipped nothing."
Three: infrastructure timidity compounds. If every stage of your pipeline can block every other stage, your system's reliability becomes the product of every component's reliability. Five stages at 99% each gives you 95%. Fail-open pipelines approach the reliability of the most-critical stage alone.
When fail-open is wrong
Two cases where you do want fail-closed:
- Financial transactions. A reviewer that detects "sending $50K to unknown wallet" should block, not warn. But this is a rules engine, not an LLM. Don't put an LLM on the critical path of money movement.
- Legal or compliance text. GDPR consent flows, medical advice, tax filings. These need human review on errors, not LLM silent-bypass. Queue and escalate.
For everything else — trading signals, content quality, customer service responses, recommendations — fail-open.
Metrics to watch
If you ship fail-open, track:
reviewer_success_rate— fraction of decisions that got reviewedreviewer_adjustment_rate— of those, fraction the reviewer adjustedreviewer_rejection_rate— fraction the reviewer fully rejectedreviewer_fallthrough_rate— fraction that bypassed the reviewer due to timeout or error
If reviewer_fallthrough_rate creeps above 5%, you have a reliability problem in your reviewer stack silently degrading quality. Investigate — but don't fix it by wiring a gate. Fix it by making the reviewer faster or more available.
Closing
The first principle of fault-tolerant systems: every stage should degrade to the simplest correct behavior when its neighbor fails. For LLM reviewer stages, that's pass-through with a warning. Engineering for that explicit case is what separates production-grade AI systems from demos that fall over under load.
A3E Ecosystem builds AI-native trading and content infrastructure.
Source: Dev.to


