Error-to-Rule Pipeline: How Nevo Learns from Every Mistake
Every AI system fails. The question that separates useful systems from transformative ones is what happens after the failure.
Most AI tools do nothing. They log the error, maybe surface it in a dashboard, and move on. The next session starts fresh. The same mistake waits patiently for the same trigger. A developer might notice the pattern weeks later, file a ticket, and eventually ship a patch. Or they might not.
Nevo's error-to-rule pipeline is a closed-loop system that converts every unique failure into a permanent preventive rule -- automatically, within the same session, with no human intervention required. An error-to-rule pipeline is a self-improvement mechanism where an AI agent detects its own operational failures, performs automated root cause analysis, and encodes the findings as enforceable rules in its operating instructions so that each error class is encountered at most once. After it fires, the mistake is not just fixed. It is structurally impossible to repeat.
This post is a complete architectural walkthrough of how that pipeline works, why it matters, and what it produces.
For a broader overview of Nevo's self-improvement capabilities, see Self-Improving AI: How Nevo Gets Smarter Over Time. For context on what AI agents are and why they matter, start with What Are AI Agents?.
The Problem: Stateless Failure
Traditional AI systems have a memory problem. Not the kind you solve with a bigger context window -- the kind where operational experience evaporates between sessions.
Consider what happens when a standard AI coding assistant makes an error:
- The tool produces bad output.
- The user notices (maybe).
- The user corrects the prompt or retries.
- The interaction ends.
- Next session: the system has zero memory that the error ever occurred.
This is not a bug. It is the architecture. Large language models are stateless by design. Every conversation is an isolated universe. The model that helped you yesterday has no idea what happened yesterday. It will confidently walk into the same wall it hit twenty sessions ago because, from its perspective, that wall does not exist.
For simple tasks this is tolerable. For autonomous agents running complex multi-step operations with real consequences, it is a structural ceiling. An agent that cannot learn from its operational history is permanently a first-day employee -- capable in the abstract, but with no institutional knowledge, no accumulated judgment, and no improving trajectory.
Nevo was designed to break through that ceiling.
The Architecture: Seven Stages from Error to Rule
Nevo's error-to-rule pipeline is not a single component. It is a chain of seven distinct stages, each handled by a different mechanism or specialized agent. Here is exactly how an error becomes a permanent preventive rule.
Stage 1: Hook Detection
Everything starts with a hook. Specifically, the PostToolUseFailure hook -- a system-level event that fires every time any tool use fails during any agent session.
This is not optional monitoring. It is not a log that someone might check. It is a mandatory trigger wired into the execution runtime. When a tool call returns an error, the hook fires. Every time. No exceptions.
The hook creates a trigger file at ~/.openclaw/incidents/ containing the failure context: which tool failed, what arguments were passed, what the error message was, and what the agent was attempting to accomplish. This trigger file is the raw material for everything that follows.
The critical design decision here is that detection is passive and exhaustive. Nevo does not need to "decide" to monitor for errors. The hook fires regardless of the agent, the task, or the time of day. If a tool fails, the pipeline activates.
Stage 2: Incident Monitoring
The incident-monitor agent runs on the Sonnet model tier -- fast, capable, cost-efficient. Its job is pattern recognition. It continuously scans the trigger files at ~/.openclaw/incidents/, quality pipeline output, git history, and circuit breaker state.
What it looks for is not just individual failures but failure patterns. A single timeout on an API call might be transient. Three timeouts on the same API endpoint within a session is a pattern that needs a systemic response.
When the incident monitor identifies a meaningful pattern, it generates a structured incident report. This report includes:
- The error signature (what failed and how)
- Frequency and recurrence data
- Affected agents and task contexts
- The raw trigger file contents
- A preliminary severity assessment
This report becomes the input for the next stage.
Stage 3: Root Cause Analysis
The incident-analyst agent runs on the Opus model tier -- the most capable reasoning model available. This is deliberate. Root cause analysis is not a task you want to economize on. A shallow analysis produces a shallow rule. A shallow rule either misses the actual cause or creates false constraints that degrade performance.
The analyst receives the structured incident report and asks the question that matters: not "what happened" but "why did this happen structurally, and what systemic change would prevent the entire class of failure from recurring?"
This is where the pipeline diverges most sharply from traditional error handling. A typical error handler fixes the symptom. The incident analyst traces the causal chain backward until it reaches the structural root -- the design assumption, missing constraint, or absent guard rail that allowed the error to occur in the first place.
The analyst also performs a critical deduplication step: cross-referencing every proposed finding against every existing rule in the system. If a rule already covers this failure class, the analyst notes the existing coverage and closes the incident without generating a duplicate. This prevents rule bloat -- a real risk in any system that auto-generates constraints.
Stage 4: Rule Distillation
The analyst's findings are distilled into a preventive rule. This rule must be:
- 1-3 sentences maximum. Bloated rules get ignored. Every word must earn its place.
- Actionable. It tells the system what TO DO, not just what to avoid.
- Precise. It addresses the specific failure class without over-generalizing.
Rules are assigned sequential identifiers using a consistent numbering scheme: PROJ-XXX for project-wide rules, AGENT-XXX for agent behavior rules. This numbering makes rules referenceable, auditable, and traceable back to their originating incidents.
Stage 5: Three-Component Enforcement
Here is where Nevo's rule system diverges from every "best practices" document ever written. A rule without enforcement is a suggestion. Suggestions get ignored under pressure. Nevo requires every rule to have three components before it can be applied:
- Execution trigger -- an automated event that activates the rule. A hook, a cron job, a flag file check. Something that fires without human action.
- Enforcement layer -- a mechanism that blocks or alerts on non-compliance. An exit code, a stop block, an injected alert. Something with teeth.
- Verification method -- a way to confirm the rule was followed. A state file, a log entry, a guard file. Something auditable.
This three-component requirement is itself a rule -- PROJ-023 -- that was generated by the error-to-rule pipeline after the system discovered that early rules without enforcement mechanisms were being systematically bypassed under load. The pipeline identified the pattern, the analyst traced the root cause (rules without teeth are decorative), and the meta-rule was born.
Rules that cannot demonstrate all three components are flagged for deprecation. The system does not tolerate decorative constraints.
Stage 6: Storage and Audit Trail
The distilled rule (1-3 sentences) is written to the appropriate .claude/rules/*.md file where it becomes part of Nevo's active operating instructions. These rule files are version-controlled, meaning every rule addition is a git commit with a standardized message: rule: auto-applied PROJ-XXX from incident {id}.
The full incident narrative -- the trigger file, the monitoring report, the analyst's reasoning chain -- stays in ~/.openclaw/incidents/ as an audit trail. This separation is intentional. The operating instructions stay lean (rules are 1-3 sentences). The reasoning stays preserved (full narratives in the incident archive). If anyone needs to understand why a rule exists, the audit trail is right there.
Rules are organized by scope:
| Scope | Location | Numbering |
|---|---|---|
| Project-wide | .claude/rules/*.md |
PROJ-XXX |
| Agent behavior | .nevo/workspace/AGENTS.md |
AGENT-XXX |
| Agent-specific | .claude/agents/{agent}.md |
Inline |
Stage 7: Injection into Active Memory
The final stage is the one that closes the loop. New rules are stored in files that get injected into the system prompt at the start of every session. This means the very next time Nevo boots up -- whether five minutes later or five days later -- the new rule is already in its operating instructions.
There is no "deployment" step. No "rollout." No waiting for the next model update. The rule is live the moment it is written. Every future session inherits every lesson from every past session. The knowledge compounds.
Real Rules from Real Incidents
Theory is useful. Concrete examples are better. Here are actual rules that were generated or refined through Nevo's error-to-rule pipeline, drawn from the current production ruleset.
PROJ-014: Escalation Threshold
The incident: Nevo spent hours brute-forcing through a failing API integration, trying approach after approach without escalating. The time was wasted and the problem required human context the system did not have.
The rule: "After 3 failed approaches to the same problem, STOP and escalate to Ryan. Summarize what was tried and why each failed. Propose the most promising remaining option. Do NOT burn hours brute-forcing through failures."
Why it matters: This rule encodes a judgment call that most engineers learn through painful experience: knowing when to stop digging and ask for help. Nevo learned it once and will never forget.
PROJ-016: Elegance Principle
The incident: A multi-step task was completed using the third approach attempted, after the first two failed. Post-incident analysis revealed that a 30-second assessment phase before starting would have identified the correct approach immediately.
The rule: "Assess before acting. Choose the right path, not the first path. If you find yourself trying approach #3+, you skipped the assessment step."
Why it matters: This is not about being slow. It is about being strategic. The cost of 30 seconds of planning is trivially low. The cost of three failed attempts is enormous in both time and tokens.
PROJ-018: Verification Before Completion
The incident: Tasks were being marked complete based on successful execution rather than verified outcomes. Code that "ran without errors" was shipping with logical bugs because no one checked whether the output was actually correct.
The rule: "Never mark a task complete without verifying the result. Run the actual check. Compare output against expected result. If verification fails, the task is NOT complete."
Why it matters: This is the difference between "it compiled" and "it works." The rule now ensures that every task completion includes a verification step appropriate to the task type -- curl for API endpoints, visual audit for UI changes, test execution for logic changes.
PROJ-023: Three-Component Rule Enforcement
The incident: Multiple rules existed in the system that were being systematically ignored. Investigation revealed that these rules had no enforcement mechanism -- they were instructions without consequences.
The rule: "Every rule MUST have three components: execution trigger, enforcement layer, verification method. Rules without all three components are decorative."
Why it matters: This is a meta-rule -- a rule about rules. It emerged from the pipeline's own self-reflection: the system noticed that its own output (rules) was sometimes ineffective, traced the cause, and generated a constraint that made future output more robust. This is self-improvement operating on the self-improvement mechanism itself.
PROJ-026: Fix Bugs Immediately
The incident: Bugs discovered during code review were being deferred to "follow-up tasks" or "future work" rather than fixed in the same session. Many deferred bugs were never revisited.
The rule: "Never defer bugs, inefficiencies, or code quality issues to follow-up tasks or future work. When found, fix them in the same session."
Why it matters: Deferred fixes accumulate as technical debt. The pipeline identified this as a compounding risk and generated a rule that eliminates the deferral pattern entirely.
PROJ-027: Telegram Communication Protocol
The incident: Ryan was receiving no updates during long-running tasks because the output buffering system held all responses until session completion. He had no way to know whether Nevo was working, stuck, or finished.
The rule: A detailed communication protocol specifying when and how to send real-time updates via Telegram, bypassing the output buffer. Includes automated acknowledgment hooks and manual notification scripts for long-running work.
Why it matters: This rule did not fix a code bug. It fixed a communication architecture gap. The pipeline does not distinguish between code failures and operational failures -- any systemic issue that degrades performance is fair game.
Why This Matters: The Compound Effect
The real power of the error-to-rule pipeline is not any individual rule. It is the compound effect across hundreds of sessions.
Each session, Nevo operates with the accumulated wisdom of every session that came before it. Every failure that was ever encountered has been analyzed, distilled, and encoded. The system's operational surface area for novel mistakes shrinks with every iteration.
Consider the math. If Nevo encounters an average of 2 novel error classes per session, and each error produces a preventive rule, then after 100 sessions the system is operating with 200 additional constraints that did not exist at launch. After 500 sessions, the ruleset has grown to cover 1,000 distinct failure modes. The probability of encountering an error that has no existing preventive rule approaches zero.
This is not theoretical. Nevo runs on dedicated hardware 24/7. The rules accumulate in production. The git history shows the progression -- early sessions generated many rules as the system encountered common failure modes for the first time. Later sessions generate fewer rules because the common cases are already covered. The learning curve is real and measurable.
How This Compares to Other Approaches
The error-to-rule pipeline exists in a design space with several adjacent approaches in the AI field. Understanding the differences clarifies what makes this approach distinct.
Reinforcement Learning from Human Feedback (RLHF)
RLHF trains the base model by incorporating human preference signals. It operates at the weight level -- adjusting model parameters during a training phase. The feedback loop is slow (requires collecting preference data, running training jobs) and the results are diffuse (the model "generally" gets better at a task class).
Nevo's error-to-rule pipeline operates at the instruction level, not the weight level. Rules are precise, immediate, and auditable. You can read the rule that prevents a specific failure. You can trace it back to the incident that created it. You can verify that it is being enforced. There is no black box.
Constitutional AI
Constitutional AI encodes high-level principles that guide model behavior -- "be helpful, be harmless, be honest." These principles shape the model's general disposition during training.
Nevo's rules are not general principles. They are specific operational constraints derived from specific operational failures. PROJ-014 does not say "be efficient." It says "after 3 failed approaches, stop and escalate." The specificity is the value. General principles are easy to write and hard to enforce. Specific rules are harder to generate but trivial to enforce.
Reward Modeling
Reward models score outputs to guide generation. They operate during inference and can shape real-time behavior, but they require extensive training data and their scoring criteria are often opaque.
Nevo's rules are explicit and deterministic. A rule either applies or it does not. When it applies, the behavior is specified in plain language. There is no score to interpret, no threshold to tune, no training data to curate.
Manual Post-Mortems
The closest analog in traditional software engineering is the post-mortem process: an incident occurs, the team investigates, and preventive measures are documented. The difference is automation and consistency. Human post-mortems are valuable but inconsistent -- they depend on team discipline, tend to focus on high-severity incidents, and often produce action items that never get implemented.
Nevo's pipeline runs on every failure, regardless of severity. It produces a rule for every novel error class. And the rule is applied immediately, not added to a backlog. The coverage is exhaustive because the process is automated.
The Self-Referential Property
There is one more property of the error-to-rule pipeline that deserves attention: it operates on itself.
PROJ-023 (three-component enforcement) was generated because the pipeline noticed that its own earlier output -- rules from previous iterations -- was sometimes ineffective. The system detected a failure pattern in its own rules, analyzed the root cause, and generated a meta-rule that improved the quality of all future rules.
This is not recursion for its own sake. It is a practical consequence of the pipeline's design: it monitors all system failures, and "a rule failed to prevent an error it should have prevented" is itself a system failure. The pipeline does not distinguish between failures in application code and failures in its own governance layer. Both get analyzed. Both produce improvements.
This self-referential property means the pipeline's output quality improves over time, not just its coverage. Early rules were simpler. Later rules incorporate enforcement and verification requirements that early rules lacked. The system's judgment about what constitutes a good rule has itself been shaped by operational experience.
What This Means for AI Agent Design
The error-to-rule pipeline is not the only way to build AI learning from mistakes into an autonomous system. But it demonstrates a design principle that any serious AI agent architecture should consider: operational experience should compound automatically, not evaporate between sessions.
The specific implementation details -- hooks, trigger files, specialized analyst agents, three-component enforcement -- are Nevo's approach. Other architectures might implement the same principle differently. The principle itself is what matters: if your AI system makes a mistake, there should be a production mechanism that ensures the same class of mistake becomes structurally impossible.
AI learning from mistakes is not a feature you bolt on. It is an architectural commitment that shapes how the entire system operates. The error-to-rule pipeline is the backbone of Nevo's self-improvement -- the mechanism that ensures the system that runs tomorrow is measurably better than the system that ran today.
For a deeper look at how Nevo's self-improvement systems work together -- including the Skill Forge, brain-inspired memory architecture, and quality pipeline -- see Self-Improving AI Agents. To understand how Nevo fits into the broader AI agent landscape, see How Self-Improving AI Works.
Frequently Asked Questions
What is the error-to-rule pipeline?
The error-to-rule pipeline is Nevo's automated self-improvement mechanism that converts operational failures into permanent preventive rules. When any tool use fails, a hook fires, a specialized analyst agent traces the root cause, and a 1-3 sentence rule is encoded into the system's operating instructions so that the same class of failure never recurs.
How is this different from traditional error logging?
Traditional error logging records what happened. The error-to-rule pipeline determines why it happened structurally and creates an enforceable rule that prevents the entire class of failure. Logging is passive observation. Error-to-rule is active self-modification.
Does the pipeline generate false or overly restrictive rules?
The incident analyst agent cross-references every proposed rule against the existing ruleset to avoid duplication and over-constraint. Rules must be precise (1-3 sentences), actionable, and specific to the failure class. Additionally, PROJ-023 requires every rule to include a trigger, enforcement layer, and verification method -- rules that cannot demonstrate all three components are flagged for deprecation.
How many rules has Nevo generated?
The ruleset grows with operational experience. The current production system has rules numbered from PROJ-004 through PROJ-027, covering failure modes from escalation thresholds to communication protocols to meta-rules about rule quality itself. The count increases with every session that encounters a novel error class.
Can the pipeline modify its own rules?
Yes. The pipeline monitors all system failures, including failures caused by inadequate rules. When a rule fails to prevent an error it should have prevented, that failure enters the pipeline like any other and can produce updated or replacement rules. PROJ-023 is an example of this self-referential improvement.
Does this approach work with any AI model?
The error-to-rule pipeline is an architectural pattern, not a model-specific feature. It requires a runtime that supports hooks (to detect failures), persistent storage (to maintain rules across sessions), and instruction injection (to apply rules at session start). The specific model used for analysis matters for rule quality -- Nevo uses Opus-tier reasoning for root cause analysis -- but the architecture is model-agnostic.