Why Quality Gates Matter
In autonomous AI systems, code quality cannot be an afterthought. When an AI agent writes code without human review, the verification system becomes the most critical component of the entire architecture. Nevo solves this with an 8-stage quality pipeline — a chain of seven specialized AI agents that inspect every piece of work before it ships.
The pipeline runs automatically after every task completion. No manual trigger required. No way to skip it.
The 8 Stages
Stage 1: Write
The initial implementation. A subagent receives a story from the PRD framework with clear acceptance criteria and writes the code. This is where the work begins — but far from where it ends.
Stage 2: Typecheck
A Haiku-tier agent runs the type checker (TypeScript's tsc --noEmit or equivalent). Fast, cheap, catches type errors immediately. If types don't pass, the code goes back for correction before anything else runs.
Stage 3: Test
A Sonnet-tier agent runs the test suite. If tests are missing for new code, this agent writes them first. Test failures block progression — no exceptions.
Stage 4: Lint
A Haiku-tier agent runs the linter. Style violations, unused imports, formatting issues — all caught here. Fast and deterministic.
Stage 5: Critique
This is where it gets interesting. An Opus-tier agent — the most capable model — reviews the code against the Karpathy rubric: simplicity, surgical changes, goal-driven execution. This isn't just checking for bugs. It evaluates whether the code is good — readable, maintainable, well-architected.
Stage 6: Refine
Issues from the critique are addressed. The implementing agent fixes what the critic found. This creates a feedback loop: write → critique → fix → re-critique. Up to 3 iterations.
Stage 7: Escalate
If 3 iterations of critique-and-fix haven't resolved all issues, the escalation chain activates. Two fresh agents enter:
- Code Researcher (Sonnet) — researches current best practices for the patterns in question
- Fresh Reviewer (Opus) — reviews the code with zero iteration bias, seeing it for the first time
Fresh eyes catch what fatigued ones miss.
Stage 8: Arbiter
The Quality Arbiter (Opus) makes the final call: APPROVE or DENY. It synthesizes the critic's findings, the researcher's recommendations, and the fresh reviewer's assessment. If approved, it may include cherry-picked guidance for minor improvements. If denied, the code meets the quality bar as-is — the remaining suggestions are cosmetic, not substantive.
Model Routing: The Right Brain for the Job
Not every stage needs the most powerful model. Nevo routes each stage to the optimal tier:
| Stage | Model Tier | Why |
|---|---|---|
| Typecheck | Haiku | Deterministic, fast, binary pass/fail |
| Test | Sonnet | Needs to understand code intent, write missing tests |
| Lint | Haiku | Rule-based, fast |
| Critique | Opus | Requires deep architectural judgment |
| Fresh Review | Opus | Needs unbiased, expert-level assessment |
| Arbiter | Opus | Final judgment requires highest reasoning |
This routing saves tokens without sacrificing quality. Simple checks go to fast, cheap models. Complex judgment goes to the best.
The Key Insight
The quality pipeline isn't a feature you enable — it's a structural guarantee. Every task triggers it. Every piece of code passes through it. The pipeline evaluates only code quality — it never adds features or expands scope. It is the immune system of the codebase.
And when the pipeline catches a novel error pattern? That feeds into the error-to-rule pipeline, generating a permanent preventive rule. The system doesn't just catch problems — it evolves to prevent them.