March 1, 2026|Nevo

AI Agent Testing and Debugging: A Practical Guide

To test an AI agent is to verify that a non-deterministic system -- one built on probabilistic language models, external tool calls, and multi-step reasoning loops -- produces reliable, correct, and safe behavior across the range of inputs it will encounter in production.

That sentence contains the reason most teams struggle with agent testing. Non-deterministic means the same input can produce different outputs. External tool calls mean behavior depends on services you do not control. Multi-step reasoning means errors compound through loops rather than failing at a single point. Traditional unit tests assert that function A returns value B. Agent tests must assert that a system capable of choosing its own actions will choose the right ones consistently enough to trust.

This guide covers all of it: unit testing tool implementations, integration and end-to-end testing with mock LLM responses, regression testing with golden datasets, performance profiling, and the debugging techniques that make agent failures diagnosable rather than mysterious.

If you are new to agent architecture, start with What Are AI Agents? For a detailed breakdown of the components you will be testing, see AI Agent Components.

Why Agent Testing Is Different

Traditional software testing rests on determinism. You call a function with known inputs and assert it returns a known output. AI agents break every one of those assumptions.

Non-deterministic outputs. An LLM asked the same question twice may give structurally different but equally correct answers. You cannot write assert response == "exact string" for most agent behaviors.

Emergent control flow. The agent decides its own execution path. A coding agent given a bug report might read the file, then edit, then test -- or start by running tests and working backward. Both paths are valid. Testing must evaluate the outcome, not the sequence.

Tool side effects. Agents interact with real systems: file systems, APIs, databases. A test that lets an agent call a real API is not a unit test -- it is an experiment. Side effects must be mocked, sandboxed, or recorded.

Compounding errors. In a 15-step agent loop, a subtle mistake at step 3 cascades into a completely wrong result at step 15. Debugging requires tracing the full reasoning chain, not just checking the final output.

These properties do not make testing impossible. They make it essential. A non-deterministic agent that is subtly wrong fails intermittently and silently -- the worst kind of bug. Rigorous testing is how you catch it.

Strategy 1: Unit Testing Tool Implementations

Tools are the most testable part of any agent system. A tool takes structured input and returns structured output. Start here.

Every tool should have tests covering four areas: happy path (valid inputs produce correct outputs), input validation (malformed inputs are rejected cleanly), error handling (API failures produce structured error responses), and edge cases (empty inputs, binary files, Unicode).

Pytest Example: Testing a File Read Tool

import pytest
from agent.tools.file_reader import FileReaderTool

@pytest.fixture
def tool():
    return FileReaderTool(allowed_dirs=["/workspace"])

class TestFileReaderTool:
    def test_reads_file_contents(self, tool, tmp_path):
        test_file = tmp_path / "example.py"
        test_file.write_text("def hello():\n    return 'world'")
        tool_instance = FileReaderTool(allowed_dirs=[str(tmp_path)])
        result = tool_instance.execute({"path": str(test_file)})
        assert result["success"] is True
        assert "def hello():" in result["content"]

    def test_rejects_path_outside_allowed_dirs(self, tool):
        result = tool.execute({"path": "/etc/passwd"})
        assert result["success"] is False
        assert "not allowed" in result["error"].lower()

    def test_handles_missing_file(self, tool):
        result = tool.execute({"path": "/workspace/nonexistent.py"})
        assert result["success"] is False

    def test_rejects_empty_path(self, tool):
        result = tool.execute({"path": ""})
        assert result["success"] is False

Testing Tool Schemas

Tools expose JSON schemas that the LLM uses to construct calls. A malformed schema means the LLM generates invalid calls, and the failure looks like an agent reasoning problem rather than a schema problem:

import jsonschema

def test_file_reader_schema_is_valid():
    schema = FileReaderTool.input_schema()
    jsonschema.Draft7Validator.check_schema(schema)
    jsonschema.validate({"path": "/workspace/main.py"}, schema)
    with pytest.raises(jsonschema.ValidationError):
        jsonschema.validate({}, schema)

Strategy 2: Integration Testing Agent-Tool Interactions

Unit tests verify tools work in isolation. Integration tests verify the agent can actually use them -- the full cycle of deciding which tool to call, constructing arguments, receiving results, and incorporating them into reasoning.

The key insight: mock the LLM but use real (sandboxed) tools. This isolates tests from LLM non-determinism while exercising the actual tool integration layer:

from agent.core import AgentLoop
from agent.tools import FileReaderTool, FileWriterTool

class MockLLM:
    def __init__(self, responses):
        self.responses = iter(responses)
    def generate(self, messages, tools):
        return next(self.responses)

def test_agent_reads_then_writes(tmp_path):
    workspace = tmp_path / "workspace"
    workspace.mkdir()
    (workspace / "buggy.py").write_text("def add(a, b):\n    return a - b\n")

    mock = MockLLM([
        {"tool_calls": [{"name": "read_file", "args": {"path": str(workspace / "buggy.py")}}]},
        {"tool_calls": [{"name": "write_file", "args": {
            "path": str(workspace / "buggy.py"),
            "content": "def add(a, b):\n    return a + b\n"
        }}]},
        {"content": "Fixed the bug: changed - to + in the add function."}
    ])

    agent = AgentLoop(
        llm=mock,
        tools=[FileReaderTool(allowed_dirs=[str(workspace)]),
               FileWriterTool(allowed_dirs=[str(workspace)])]
    )
    result = agent.run("Fix the bug in buggy.py")

    fixed = (workspace / "buggy.py").read_text()
    assert "return a + b" in fixed
    assert "return a - b" not in fixed

Strategy 3: End-to-End Testing with Mock LLM Responses

The most effective approach is recorded response testing: run the agent once against a real LLM, capture the full exchange, then replay those responses in tests. Deterministic tests based on real agent behavior:

import json

class RecordedLLM:
    def __init__(self, recording_path: str):
        with open(recording_path) as f:
            self.recording = json.load(f)
        self.call_index = 0

    def generate(self, messages, tools=None):
        exchange = self.recording["exchanges"][self.call_index]
        self.call_index += 1
        return exchange["response"]

def test_bug_fix_workflow_e2e():
    llm = RecordedLLM("tests/recordings/bug_fix_workflow.json")
    agent = AgentLoop(llm=llm, tools=load_all_tools())
    result = agent.run("The /api/users endpoint returns 500 on empty query")

    assert result.status == "completed"
    assert "fix" in result.summary.lower()
    assert agent.tool_call_count <= 10

Build recording into your agent loop from the start. Every production agent should have a record=True mode that captures every LLM exchange with timestamps and tool availability metadata. These recordings become your regression suite -- and your most powerful debugging tool when something breaks after a model update.

Strategy 4: Regression Testing with Golden Datasets

Non-deterministic systems need statistical testing. A golden dataset is a curated collection of input-output pairs where you know what correct looks like. The critical detail: you assert on properties of the output, not the exact output.

# tests/golden/dataset.json
[
    {
        "id": "calc-001",
        "input": "What is 15% of 240?",
        "expected": {"contains": ["36"], "tool_calls": ["calculator"], "max_steps": 3}
    },
    {
        "id": "code-002",
        "input": "Write a Python function to reverse a string",
        "expected": {"code_executes": true, "output_contains": ["def"]}
    }
]

@pytest.mark.parametrize("case", load_golden_dataset(), ids=lambda c: c["id"])
def test_golden_case(case, agent):
    result = agent.run(case["input"])
    expected = case["expected"]

    if "contains" in expected:
        for substring in expected["contains"]:
            assert substring in result.output
    if "tool_calls" in expected:
        actual_tools = [tc.name for tc in result.tool_calls]
        for tool in expected["tool_calls"]:
            assert tool in actual_tools
    if "max_steps" in expected:
        assert result.step_count <= expected["max_steps"]

Track pass rates across model updates and agent changes. Set a threshold -- 95% is a reasonable starting point -- and fail the build if regression drops below it.

Strategy 5: Performance and Load Testing

Agent performance has two dimensions: cost (tokens consumed) and time (latency). Both must be bounded.

def test_simple_task_token_budget(agent):
    result = agent.run("What is 2 + 2?")
    assert result.total_tokens < 2000
    assert result.step_count <= 2

def test_loop_termination(agent):
    result = agent.run("This is an intentionally ambiguous task with no clear end")
    assert result.step_count <= agent.max_steps

For multi-agent systems, test that concurrent agents do not interfere with each other -- shared state, file locks, and API rate limits are the usual failure modes. If your system runs 4 agents in parallel (as Nevo does), your test suite should exercise that concurrency explicitly.

Debugging Technique 1: Structured Logging for Agent Loops

When an agent fails, you need to know what happened between step 1 and the failure. Structured logging makes every step a queryable event:

import structlog

logger = structlog.get_logger()

class AgentLoop:
    def run(self, task: str):
        run_id = uuid.uuid4().hex[:8]
        log = logger.bind(run_id=run_id, task=task[:100])
        log.info("agent.run.start")

        for step in range(self.max_steps):
            step_log = log.bind(step=step)
            response = self._call_llm(self.messages, self.tools)
            step_log.info("agent.llm.response",
                has_tool_call=bool(response.tool_calls),
                tokens_used=response.usage.total_tokens)

            if response.tool_calls:
                for tc in response.tool_calls:
                    result = self._execute_tool(tc)
                    step_log.info("agent.tool.result",
                        tool=tc.name, success=result.get("success", True))

Then query with jq:

# Find all tool failures for a specific run
cat agent.log | jq 'select(.run_id == "a1b2c3d4" and .success == false)'

# Find runs that exceeded token budgets
cat agent.log | jq 'select(.event == "agent.run.complete" and .total_tokens > 50000)'

Debugging Technique 2: Token Usage Monitoring

Token usage is the agent equivalent of CPU profiling. Runaway costs almost always point to architectural problems: redundant context loading, unnecessary tool calls, or agents trapped in reasoning loops.

Track tokens per-step and per-tool. When a task that normally consumes 8,000 tokens suddenly burns 40,000, the per-tool breakdown reveals whether the problem is a bloated prompt, excessive tool calls, or a reasoning loop the agent cannot escape. At Nevo, a dedicated token-monitor agent analyzes spend patterns across sessions and flags anomalies automatically.

Debugging Technique 3: Prompt and Replay Debugging

When an agent behaves unexpectedly, the cause is usually in what it was told. Prompt debugging means inspecting the exact messages sent to the LLM at the moment of failure. Build a PromptDebugger wrapper that captures every message array before it hits the LLM API, and look for three common problems:

Context overflow. The conversation history grows until critical information falls outside the context window. The agent "forgets" what it was doing mid-task.

Conflicting instructions. Two system prompt sections give contradictory guidance. The agent oscillates between behaviors.

Tool description ambiguity. Two tools have similar descriptions, and the agent consistently picks the wrong one.

For intermittent failures, replay debugging is the definitive technique. Record every interaction in a session -- every LLM call, every tool response -- then replay it step by step. Compare a successful run against a failed run side by side. The divergence point -- where the successful run made one tool call and the failed run made a different one -- is almost always where the bug lives. This technique is especially valuable after model updates: record sessions on the old model, replay against the new model, and find behavioral regressions before they reach production.

Debugging Technique 4: Tracing and Observability

For production systems, structured logging is necessary but not sufficient. Distributed tracing connects LLM calls, tool executions, and sub-agent spawns into a single visual timeline. Several platforms support AI agent tracing:

LangSmith -- Per-step visibility into prompts, tool calls, and outputs. Best for LangChain users.
Arize Phoenix -- Open-source tracing with evaluation built in, via OpenTelemetry-compatible spans.
Braintrust -- Combines tracing with scoring and evaluation for tracking quality over time.
Custom OpenTelemetry -- If you already have distributed tracing infrastructure, extending it to agent spans is the lowest-friction option.

The principle is the same regardless of tool: every LLM call, tool execution, and decision point should be a span in a trace.

Case Study: Nevo's 8-Stage Quality Pipeline

Theory is useful. A working system is better. Nevo's quality pipeline is a production example of how testing, debugging, and quality assurance integrate into an agent architecture that runs automatically on every code change.

The pipeline has eight stages, executed in strict sequence:

1. WRITE. The implementing agent produces code changes -- the artifact to be tested.

2. TYPECHECK. A typechecker subagent (running on Haiku for speed) executes type checking. TypeScript goes through tsc --noEmit, Python through mypy. Pass criteria: zero type errors.

3. TEST. A test-runner subagent (Sonnet, for reasoning about test design) checks coverage, writes missing tests, and runs the full suite. The agent identifies untested code paths and generates new cases. Pass criteria: all tests pass.

4. LINT. A linter subagent (Haiku) runs ESLint, Ruff, or ShellCheck as appropriate. Pass criteria: zero lint errors.

5. CRITIQUE. A code-critic subagent (Opus, the most capable model) performs deep code review against a quality rubric: simplicity, surgical precision, security, performance, error handling, and architectural fit. Each category gets a PASS, WARN, or FAIL rating with specific file and line references.

6. REFINE. If any stage returned FAIL, the implementing agent fixes the issues and re-runs stages 2 through 5. Up to three iterations. If failures persist, the system escalates.

7. ESCALATE. Four fresh-context steps designed to break analysis deadlocks: a diagnostic report, a research agent searching current best practices, an independent fresh-reviewer, and a quality-arbiter that makes the final call. Each escalation agent operates in a clean context window -- no iteration history, no accumulated bias.

8. ARBITER LOOP. If the arbiter approved changes, the implementing agent applies only the approved modifications (maximum five), then re-runs the pipeline. After three arbiter rounds without resolution, the system escalates to a human with a complete audit trail.

The model routing is deliberate. Mechanical tasks (type checking, linting) run on Haiku -- fast and cheap. Tasks requiring judgment (test writing, code review) run on Sonnet or Opus. This is architectural recognition that different tasks require different levels of reasoning.

Testing Non-Deterministic Systems: Evaluation Frameworks

The fundamental challenge of agent testing is that the same input can produce different but equally valid outputs. Exact-match assertions fail. Evaluation frameworks solve this by defining success as a set of properties rather than a specific value.

Property-Based Evaluation

Instead of asserting exact outputs, assert properties of the output:

def evaluate_code_generation(result, task):
    scores = {}
    scores["executes"] = try_execute(result.code)
    scores["addresses_task"] = llm_judge(
        f"Does this code address the task '{task}'? Answer YES or NO.",
        result.code)
    scores["passes_lint"] = run_linter(result.code)
    scores["safe"] = not contains_dangerous_ops(result.code)
    return scores

Each property is a binary check. The aggregate tells you quality. A result that executes correctly and passes lint but does not fully address the task is a partial success -- and that granularity matters for identifying where the agent's reasoning breaks down.

LLM-as-Judge

For subjective outputs -- natural language responses, code review comments, architectural decisions -- an LLM judge provides consistent evaluation:

def llm_judge(prompt: str, content: str) -> bool:
    response = llm.generate([
        {"role": "system", "content": "You are an evaluation judge. Answer only YES or NO."},
        {"role": "user", "content": f"{prompt}\n\nContent to evaluate:\n{content}"}
    ])
    return response.strip().upper() == "YES"

Statistical Pass Criteria

For non-deterministic tests, run them multiple times and assert on the distribution:

def test_code_quality_consistency(agent, n_runs=5):
    results = [agent.run("Write a function to parse CSV files") for _ in range(n_runs)]
    passing = sum(1 for r in results if evaluate_code_generation(r, "parse CSV")["executes"])
    assert passing / n_runs >= 0.8, f"Only {passing}/{n_runs} runs produced executing code"

Set your threshold based on the cost of failure for that task type. Mission-critical code generation might require 95%. Exploratory research summaries might tolerate 70%.

Putting It All Together: A Testing Checklist

Before shipping an AI agent to production, verify each layer:

Tool layer (unit tests)

Every tool handles valid inputs correctly
Every tool rejects invalid inputs with structured errors
Tool schemas are valid and complete
Error responses are parseable by the agent

Integration layer

Agent can call each tool with correct arguments
Agent handles tool errors without crashing
Multi-tool chains execute in the right order
Tool selection is correct for ambiguous tasks

End-to-end layer

Core workflows complete successfully with recorded LLM responses
Agent terminates within step and token budgets
Session recording and replay infrastructure works
Regression suite pass rate exceeds threshold (95%+)

Performance layer

Token usage is within budget for each task category
No individual tool call exceeds latency threshold
Concurrent agents do not interfere with each other
Agent terminates on ambiguous or impossible tasks

Observability layer

Structured logging captures every LLM call and tool execution
Token usage is tracked per-step and per-tool
Traces connect all events in a single agent run
Prompt contents are captured for debugging

FAQ

How do you test something that gives different answers every time?

You test for properties of the answer -- correctness, completeness, safety, efficiency -- not the exact answer. Run tests multiple times and assert on the pass rate, not individual results.

Should I use a real LLM in my test suite?

For unit and integration tests, no -- mock the LLM for speed and determinism. For end-to-end and regression tests, use recorded responses from a real LLM. For evaluation suite runs, use a real LLM but run infrequently (nightly, not per-commit).

What is the minimum viable test suite for an AI agent?

Unit tests for every tool, one integration test per core workflow, and a golden dataset with at least 20 cases. Add structured logging from day one. Everything else can be built incrementally.

How do you debug an agent that works sometimes and fails sometimes?

Session replay. Record every successful and failed run. Compare traces side by side. The divergence point is almost always where the bug lives.

How does Nevo's quality pipeline handle persistent failures?

After three refine iterations, Nevo escalates to fresh-context agents with no knowledge of previous attempts. If three arbiter rounds also fail, the system escalates to a human with a complete audit trail. The pipeline never silently gives up.

Next Steps

Testing and debugging are not afterthoughts -- they are the difference between an agent you demo and an agent you deploy. Start with unit tests for your tools, add structured logging to your agent loop, and build a golden dataset from real usage. The rest follows naturally.

For a complete walkthrough of building the system these tests validate, see How to Build an AI Agent System from Scratch. To understand the individual components being tested, read AI Agent Components. For a focused look at testing the skill layer specifically, our guide to AI agent skill testing covers what this tutorial does not.