March 1, 2026|Nevo

Deploy an AI Agent to Production: Docker, Monitoring, and Scaling

Building an AI agent is the easy part. Making it run reliably at 3 AM when no one is watching -- that is the actual engineering challenge.

Most AI agent tutorials end at "it works on my laptop." They skip the hard questions: How do you keep the agent running after a crash? How do you know when it is silently failing? How do you manage API costs that can spike 10x overnight? How do you scale from one user to a hundred without rewriting everything?

This guide covers the full deployment lifecycle for AI agents in production. From containerization with Docker through monitoring, scaling, cost management, and reliability engineering. Every configuration example is drawn from real production systems -- including Nevo, which runs 24/7 across 11 LaunchAgent services on a single Mac Studio, processing messages from 20+ platforms without downtime.

For foundational context, see What Are AI Agents?. If you are still in the building phase, start with How to Build an AI Agent System from Scratch. For architecture patterns, see AI Agent Systems.

Why Deployment Is Where Most AI Agents Die

Deploying an AI agent to production means running an AI agent system reliably, continuously, and observably on infrastructure that can handle real-world load, failures, and cost constraints.

The gap between "demo" and "production" is not a small step. It is a category change. A demo agent needs to work once. A production agent needs to work every time, recover from failures automatically, cost a predictable amount, and tell you when something is wrong before your users notice.

Here is what goes wrong when teams skip deployment engineering:

Silent failures. The agent crashes at 2 AM. No alert fires. No one knows until a customer complains eight hours later.
Cost explosions. A bug causes the agent to loop, making thousands of API calls. The monthly bill arrives five figures higher than expected.
State corruption. The agent loses memory or context after a restart because no one thought about persistence.
Scaling walls. The agent works for 10 users. At 100, response times hit 30 seconds. At 1,000, the system falls over.

Every one of these is avoidable with the right deployment architecture. The rest of this guide shows you how.

Step 1: Containerize with Docker

Containerization is the foundation of reproducible deployment. An AI agent typically depends on specific Python or Node.js versions, system libraries, API clients, and configuration files. Docker ensures the environment is identical everywhere -- your laptop, a staging server, production.

The Dockerfile

A production AI agent Dockerfile has three priorities: small image size, fast builds, and no secrets baked in.

# syntax=docker/dockerfile:1
FROM python:3.12-slim AS base

# System dependencies (build stage)
RUN apt-get update && apt-get install -y --no-install-recommends \
    git curl build-essential \
    && rm -rf /var/lib/apt/lists/*

# Create non-root user
RUN groupadd -r agent && useradd -r -g agent -d /app -s /bin/bash agent

WORKDIR /app

# Install Python dependencies (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY --chown=agent:agent . .

# Never run as root
USER agent

# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080

CMD ["python", "-m", "agent.main"]

Key decisions in this Dockerfile:

python:3.12-slim -- Not alpine (which causes issues with many Python packages) and not the full image (900MB of tools you do not need).
Non-root user -- Production containers should never run as root. A compromised agent with root access can escape the container.
Layer ordering -- requirements.txt is copied before the application code. This means dependency installation is cached unless requirements change, cutting build times from minutes to seconds.
Health check built in -- Docker can automatically restart unhealthy containers. Without a health check, Docker only knows if the process exited, not if it is stuck in an infinite loop or deadlocked.

Docker Compose for Multi-Service Agents

Real AI agent systems are not single-process applications. They typically involve the agent runtime, a model proxy, a message queue, and a database for memory. Docker Compose orchestrates all of these.

# docker-compose.yml
version: "3.9"

services:
  agent:
    build: .
    restart: unless-stopped
    env_file: .env
    ports:
      - "8080:8080"
    volumes:
      - agent-memory:/app/data
      - ./config:/app/config:ro
    depends_on:
      model-proxy:
        condition: service_healthy
      redis:
        condition: service_healthy
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: "2.0"
    logging:
      driver: json-file
      options:
        max-size: "50m"
        max-file: "5"

  model-proxy:
    image: ghcr.io/berriai/litellm:main-latest
    restart: unless-stopped
    ports:
      - "4000:4000"
    volumes:
      - ./litellm-config.yaml:/app/config.yaml:ro
    env_file: .env
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 512M

  redis:
    image: redis:7-alpine
    restart: unless-stopped
    volumes:
      - redis-data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5
    deploy:
      resources:
        limits:
          memory: 256M

volumes:
  agent-memory:
  redis-data:

Several things to note here:

restart: unless-stopped -- The agent restarts on crash, on server reboot, on Docker daemon restart. It only stays down if you explicitly stop it.
depends_on with health conditions -- The agent does not start until the model proxy and Redis are healthy. Without this, the agent boots, tries to call the model proxy, fails, and enters a crash loop.
Memory limits -- Without resource limits, a runaway agent can consume all available RAM and crash everything else on the machine. Set limits based on your observed usage plus a 50% buffer.
Log rotation -- max-size: "50m" with max-file: "5" caps logs at 250MB. Without this, verbose agents fill disks in days.
Read-only config -- The :ro flag on config mounts prevents the agent from accidentally modifying its own configuration.

Step 2: Manage Secrets and Configuration

Hard-coding API keys into source code or Docker images is a guaranteed production incident waiting to happen. AI agents typically need credentials for multiple services: LLM providers, messaging platforms, databases, and external tools.

Environment Variables with .env

The simplest approach for small deployments:

# .env -- NEVER commit this file
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
TELEGRAM_BOT_TOKEN=7891234567:AAH...
DATABASE_URL=postgresql://agent:secret@db:5432/memory
REDIS_URL=redis://redis:6379/0

# Model routing configuration
DEFAULT_MODEL=claude-sonnet-4-20250514
COMPLEX_MODEL=claude-opus-4-20250514
FAST_MODEL=claude-haiku-3-5-20241022

# Rate limiting
MAX_REQUESTS_PER_MINUTE=60
MAX_TOKENS_PER_MINUTE=100000

Enforce this with .gitignore:

.env
.env.*
!.env.example

Always commit a .env.example with placeholder values so new developers know which variables are required.

Credential Store for Multi-Service Systems

For systems with many credentials (Nevo manages credentials for over a dozen services), a directory-based credential store is cleaner than a single .env file:

# ~/.agent/credentials/ (chmod 700)
# Each service gets its own file (chmod 600)

# ~/.agent/credentials/anthropic.env
ANTHROPIC_API_KEY=sk-ant-...

# ~/.agent/credentials/telegram.env
TELEGRAM_BOT_TOKEN=7891234567:AAH...
TELEGRAM_CHAT_ID=123456789

# ~/.agent/credentials/litellm.env
LITELLM_MASTER_KEY=sk-...

Load credentials at runtime:

source "$HOME/.agent/credentials/anthropic.env"
source "$HOME/.agent/credentials/telegram.env"

This approach has a practical advantage: you can rotate one service's credentials without touching any others.

For Cloud Deployments

If deploying to a cloud provider, use their secrets management:

AWS: Secrets Manager or Systems Manager Parameter Store
GCP: Secret Manager
Azure: Key Vault
Kubernetes: Sealed Secrets or External Secrets Operator

The principle is the same everywhere: secrets are injected at runtime, never baked into images, never committed to repositories.

Step 3: Monitoring and Observability

An AI agent you cannot observe is an AI agent you cannot trust. Monitoring has three layers: logging, metrics, and alerting.

Structured Logging

Unstructured logs ("Agent did something") are useless at scale. Structure every log entry:

import json
import sys
from datetime import datetime, timezone

def log(level: str, event: str, **context):
    entry = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "level": level,
        "event": event,
        "agent_id": os.getenv("AGENT_ID", "default"),
        **context
    }
    print(json.dumps(entry), file=sys.stderr)

# Usage
log("info", "task_started", task_id="abc-123", model="claude-sonnet")
log("warn", "rate_limited", provider="anthropic", retry_after_s=30)
log("error", "tool_failed", tool="database_query", error="connection refused")

Structured logs let you filter, aggregate, and alert on specific patterns. "Show me all errors from the last hour where the model was claude-opus" becomes a simple query instead of a grep nightmare.

Key Metrics to Track

These are the metrics that actually matter for AI agent operations:

Metric	Why It Matters	Alert Threshold
Request latency (p95)	User experience	> 10s
Error rate	Reliability	> 5% over 5 min
Token usage per request	Cost control	> 2x daily average
Active tasks	Capacity	> queue depth limit
Memory usage	Stability	> 80% of limit
Model API latency	Dependency health	> 5s
Cost per hour	Budget	> 2x hourly budget
Circuit breaker state	Failure cascade	Any open breaker

Prometheus Metrics Endpoint

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Counters
requests_total = Counter(
    "agent_requests_total",
    "Total requests processed",
    ["model", "status"]
)
tokens_used = Counter(
    "agent_tokens_total",
    "Total tokens consumed",
    ["model", "direction"]  # direction: input or output
)

# Histograms
request_duration = Histogram(
    "agent_request_duration_seconds",
    "Request processing time",
    ["model"],
    buckets=[0.5, 1, 2, 5, 10, 30, 60]
)

# Gauges
active_tasks = Gauge("agent_active_tasks", "Currently running tasks")
cost_estimate = Gauge("agent_cost_estimate_hourly", "Estimated hourly cost in USD")

# Expose on /metrics
start_http_server(9090)

Health Check Endpoint

Every production agent needs a health endpoint. This is not optional.

from fastapi import FastAPI
import time

app = FastAPI()
last_successful_inference = time.time()

@app.get("/health")
async def health():
    checks = {
        "model_proxy": await check_model_proxy(),
        "memory_store": await check_memory(),
        "last_inference_age_s": time.time() - last_successful_inference,
    }
    healthy = all([
        checks["model_proxy"],
        checks["memory_store"],
        checks["last_inference_age_s"] < 300  # Alert if no inference in 5 min
    ])
    return {
        "status": "healthy" if healthy else "degraded",
        "checks": checks,
        "uptime_s": time.time() - start_time
    }

Alerting

Monitoring without alerting is just data collection. Set up alerts for the failure modes that matter:

# alertmanager rules (Prometheus format)
groups:
  - name: agent_alerts
    rules:
      - alert: AgentHighErrorRate
        expr: rate(agent_requests_total{status="error"}[5m]) > 0.05
        for: 2m
        annotations:
          summary: "Agent error rate above 5% for 2 minutes"

      - alert: AgentCostSpike
        expr: agent_cost_estimate_hourly > 10
        for: 5m
        annotations:
          summary: "Hourly cost exceeds $10"

      - alert: AgentUnresponsive
        expr: up{job="agent"} == 0
        for: 1m
        annotations:
          summary: "Agent health endpoint unreachable"

Step 4: Always-On Deployment with systemd and LaunchAgent

Docker is not the only way to keep an agent running. For single-machine deployments -- especially development machines or dedicated hardware -- process managers provide lower overhead and tighter OS integration.

systemd on Linux

systemd is the standard process manager on modern Linux distributions. A systemd service file gives you automatic restarts, logging, resource limits, and boot-time startup.

# /etc/systemd/system/ai-agent.service
[Unit]
Description=AI Agent Service
After=network-online.target
Wants=network-online.target
StartLimitIntervalSec=300
StartLimitBurst=5

[Service]
Type=simple
User=agent
Group=agent
WorkingDirectory=/opt/agent
ExecStart=/opt/agent/venv/bin/python -m agent.main
Restart=always
RestartSec=10

# Secrets from credential store
EnvironmentFile=/opt/agent/.env

# Resource limits
MemoryMax=2G
CPUQuota=200%

# Security hardening
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/opt/agent/data

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=ai-agent

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable ai-agent
sudo systemctl start ai-agent

# Check status
systemctl status ai-agent

# Tail logs
journalctl -u ai-agent -f

The StartLimitIntervalSec and StartLimitBurst settings prevent rapid crash loops -- if the agent crashes 5 times in 5 minutes, systemd stops trying and marks the service as failed, giving you a clear signal that something is fundamentally wrong rather than endlessly restarting.

LaunchAgent on macOS

On macOS, launchd is the system process manager. This is how Nevo runs -- 11 LaunchAgent services coordinating the full stack on a single Mac Studio.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.myagent.daemon</string>

    <key>Comment</key>
    <string>AI Agent — always-on autonomous assistant</string>

    <!-- Start on login, keep alive forever -->
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>

    <!-- Prevent rapid restart loops (min 10s between restarts) -->
    <key>ThrottleInterval</key>
    <integer>10</integer>

    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/python3</string>
        <string>/opt/agent/main.py</string>
    </array>

    <key>WorkingDirectory</key>
    <string>/opt/agent</string>

    <!-- Log output -->
    <key>StandardOutPath</key>
    <string>/opt/agent/logs/agent.log</string>
    <key>StandardErrorPath</key>
    <string>/opt/agent/logs/agent.err.log</string>

    <!-- Environment -->
    <key>EnvironmentVariables</key>
    <dict>
        <key>HOME</key>
        <string>/Users/yourusername</string>
        <key>PATH</key>
        <string>/usr/local/bin:/usr/bin:/bin</string>
    </dict>
</dict>
</plist>

Install and manage:

# Copy plist to LaunchAgents directory
cp com.myagent.daemon.plist ~/Library/LaunchAgents/

# Load and start
launchctl load ~/Library/LaunchAgents/com.myagent.daemon.plist

# Check if running
launchctl list | grep myagent

# Unload (stop)
launchctl unload ~/Library/LaunchAgents/com.myagent.daemon.plist

# View logs
tail -f /opt/agent/logs/agent.log

Key differences from systemd:

KeepAlive: true is the equivalent of Restart=always -- launchd relaunches the process if it exits for any reason.
ThrottleInterval prevents restart storms. If the process exits within 10 seconds, launchd waits before relaunching.
RunAtLoad: true starts the agent when the user logs in (LaunchAgents) or when the system boots (LaunchDaemons in /Library/LaunchDaemons/).
No built-in resource limits -- macOS launchd does not support memory or CPU limits natively. Use container-based deployment if you need hard resource constraints.

For a real-world example: Nevo runs 11 LaunchAgent services including the core daemon, a model routing proxy, a cron watchdog, memory consolidation, and upstream monitoring -- all coordinated through launchd with individual log files and automatic restart on failure.

When to Use Which

Approach	Best For	Overhead
Docker Compose	Cloud deployment, team environments, multi-machine	Medium
systemd	Linux servers, VPS, dedicated hardware	Low
LaunchAgent	macOS workstations, Mac Mini/Studio servers	Low
Kubernetes	Large-scale, multi-agent, auto-scaling	High

For a single agent on dedicated hardware, systemd or LaunchAgent is simpler and more efficient. For deployable, reproducible, team-managed systems, Docker. For scale-out architectures, Kubernetes.

Step 5: Scaling Strategies

A single-process agent handles one request at a time. That works until it does not. Scaling AI agents is different from scaling traditional web services because the bottleneck is usually external -- LLM API rate limits, not your own compute.

Horizontal Scaling with a Task Queue

The most effective pattern for scaling AI agents is queue-based processing. Requests go into a queue. Multiple agent workers pull from the queue and process independently.

# worker.py -- one of N identical agent workers
import redis
import json

r = redis.Redis()

while True:
    # Block until a task is available
    _, task_json = r.brpop("agent:tasks")
    task = json.loads(task_json)

    try:
        result = process_task(task)
        r.lpush(f"agent:results:{task['id']}", json.dumps(result))
    except Exception as e:
        r.lpush(f"agent:results:{task['id']}", json.dumps({
            "error": str(e),
            "task_id": task["id"]
        }))

Scale by running more workers:

# docker-compose.yml (scaling section)
services:
  agent-worker:
    build: .
    command: python -m agent.worker
    deploy:
      replicas: 4
    env_file: .env
    depends_on:
      - redis
      - model-proxy

Rate Limit Awareness

LLM APIs enforce rate limits. Your scaling strategy must account for this, or you will hit 429 errors and degrade service for everyone.

import time
from collections import deque

class RateLimiter:
    def __init__(self, max_requests: int, window_seconds: int):
        self.max_requests = max_requests
        self.window = window_seconds
        self.timestamps = deque()

    async def acquire(self):
        now = time.time()
        # Remove timestamps outside the window
        while self.timestamps and self.timestamps[0] < now - self.window:
            self.timestamps.popleft()

        if len(self.timestamps) >= self.max_requests:
            wait_time = self.timestamps[0] + self.window - now
            await asyncio.sleep(wait_time)

        self.timestamps.append(time.time())

# Usage: shared across all workers
rate_limiter = RateLimiter(max_requests=60, window_seconds=60)

async def call_model(prompt: str):
    await rate_limiter.acquire()
    return await model_client.complete(prompt)

Load Balancing Across Model Providers

If you use multiple LLM providers (or multiple API keys), a model proxy like LiteLLM distributes load and handles failover:

# litellm-config.yaml
model_list:
  - model_name: agent-default
    litellm_params:
      model: claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
    model_info:
      max_tokens: 8192

  - model_name: agent-default
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
    model_info:
      max_tokens: 4096

router_settings:
  routing_strategy: least-busy
  num_retries: 3
  timeout: 120
  fallbacks:
    - agent-default: [agent-fallback]

This configuration routes to whichever provider has the lowest current load, retries failed requests up to 3 times, and falls back to alternative models if the primary provider is down.

Step 6: Cost Management

LLM API costs are the largest operational expense for most AI agent systems. Without active cost management, a productive agent can easily cost thousands per month.

Model Routing for Cost Optimization

Not every task needs your most expensive model. Route by complexity:

MODEL_TIERS = {
    "fast": {
        "model": "claude-haiku-3-5-20241022",
        "cost_per_1k_input": 0.0008,
        "cost_per_1k_output": 0.004,
        "use_for": ["classification", "formatting", "simple_extraction"]
    },
    "standard": {
        "model": "claude-sonnet-4-20250514",
        "cost_per_1k_input": 0.003,
        "cost_per_1k_output": 0.015,
        "use_for": ["code_generation", "analysis", "summarization"]
    },
    "complex": {
        "model": "claude-opus-4-20250514",
        "cost_per_1k_input": 0.015,
        "cost_per_1k_output": 0.075,
        "use_for": ["architecture", "root_cause_analysis", "code_review"]
    }
}

def select_model(task_type: str) -> str:
    for tier_name, tier in MODEL_TIERS.items():
        if task_type in tier["use_for"]:
            return tier["model"]
    return MODEL_TIERS["standard"]["model"]  # Default to mid-tier

Nevo uses this exact pattern with LiteLLM routing 6 model aliases across 3 Anthropic tiers. Simple tasks like type checking and linting go to Haiku. Standard work goes to Sonnet. Complex reasoning -- code review, root cause analysis, architectural decisions -- goes to Opus. The cost difference between routing everything to Opus versus tiered routing can be 5-10x.

Cost Tracking and Budgets

Track every API call and enforce budget limits:

from dataclasses import dataclass
from datetime import datetime, timezone

@dataclass
class CostRecord:
    timestamp: datetime
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float
    task_id: str

class CostTracker:
    def __init__(self, daily_budget_usd: float):
        self.daily_budget = daily_budget_usd
        self.records: list[CostRecord] = []

    def record(self, model: str, input_tokens: int,
               output_tokens: int, task_id: str):
        cost = self._calculate_cost(model, input_tokens, output_tokens)
        self.records.append(CostRecord(
            timestamp=datetime.now(timezone.utc),
            model=model, input_tokens=input_tokens,
            output_tokens=output_tokens, cost_usd=cost,
            task_id=task_id
        ))

        daily_total = self._daily_total()
        if daily_total > self.daily_budget * 0.8:
            log("warn", "cost_threshold",
                daily_total=daily_total, budget=self.daily_budget)
        if daily_total > self.daily_budget:
            raise BudgetExceededError(
                f"Daily budget of ${self.daily_budget} exceeded: ${daily_total:.2f}")

Local vs Cloud: The Cost Equation

The decision to run locally versus in the cloud is primarily a cost decision, not a technical one.

Factor	Local (Dedicated Hardware)	Cloud (VPS/Container)
Upfront cost	$500-$5,000 (hardware)	$0
Monthly cost	~$10 (electricity)	$20-$200 (compute)
LLM API cost	Same	Same
Scaling	Buy more hardware	Click a button
Maintenance	You handle everything	Provider handles infra
Latency	Lowest (local network)	Variable
Data privacy	Complete	Depends on provider

For a single-user AI agent like a private AI agent, local deployment is almost always cheaper after month 3. The hardware pays for itself. For multi-user systems that need elastic scaling, cloud deployment makes more sense despite the ongoing cost.

The LLM API cost is the same either way -- and it is usually the dominant expense. A system spending $5/day on API calls and $20/month on a VPS is spending 88% of its budget on the API. Optimize model routing first. Cloud versus local is a secondary decision.

Step 7: Reliability Engineering

A production agent must handle failure gracefully. Networks drop. APIs return errors. Models hallucinate. The system should degrade, not collapse.

Circuit Breakers

A circuit breaker prevents cascade failures. When an external service (like an LLM API) starts failing, the circuit breaker stops sending requests -- giving the service time to recover instead of hammering it with doomed requests.

import time

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5,
                 recovery_timeout: int = 60):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failures = 0
        self.state = "closed"  # closed = normal, open = blocking
        self.last_failure_time = 0

    async def call(self, func, *args, **kwargs):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "half-open"  # Try one request
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = await func(*args, **kwargs)
            if self.state == "half-open":
                self.state = "closed"
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.failure_threshold:
                self.state = "open"
                log("error", "circuit_opened",
                    failures=self.failures, service=func.__name__)
            raise

Nevo uses circuit breakers in its Ralph loop execution system. After 3 consecutive failures on the same task, the circuit breaker opens and the system escalates to a human rather than burning through tokens on a task it cannot complete.

Restart Policies and Graceful Shutdown

Production agents need to handle restarts without losing work:

import signal
import asyncio

class GracefulAgent:
    def __init__(self):
        self.running = True
        self.current_task = None
        signal.signal(signal.SIGTERM, self._shutdown)
        signal.signal(signal.SIGINT, self._shutdown)

    def _shutdown(self, signum, frame):
        log("info", "shutdown_requested", signal=signum)
        self.running = False

    async def run(self):
        while self.running:
            task = await self.get_next_task()
            if task:
                self.current_task = task
                try:
                    await self.process_task(task)
                    await self.checkpoint(task)  # Persist progress
                finally:
                    self.current_task = None

        # Clean shutdown: finish current work, persist state
        if self.current_task:
            log("info", "completing_current_task",
                task_id=self.current_task["id"])
            await self.process_task(self.current_task)
            await self.checkpoint(self.current_task)
        log("info", "shutdown_complete")

The pattern: catch shutdown signals, finish the current unit of work, persist state, then exit cleanly. This prevents the most common production issue -- an agent that loses progress on every restart.

Retry with Exponential Backoff

Transient failures (network blips, rate limits, temporary API errors) should be retried. Permanent failures (invalid API key, malformed request) should not.

import asyncio
import random

async def retry_with_backoff(func, max_retries=5, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return await func()
        except TransientError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            log("warn", "retrying", attempt=attempt + 1,
                delay_s=round(delay, 1), error=str(e))
            await asyncio.sleep(delay)
        except PermanentError:
            raise  # Do not retry permanent failures

The jitter (random.uniform(0, 1)) prevents the thundering herd problem -- when multiple workers retry at exactly the same time, creating a synchronized spike.

Putting It All Together: Deployment Checklist

Before declaring your AI agent production-ready, verify each of these:

Containerization

Dockerfile builds reproducibly
Non-root user in container
Health check defined
Resource limits set
Log rotation configured

Secrets

No credentials in source code or images
.env in .gitignore
Credential rotation procedure documented
File permissions set (600 for credential files)

Monitoring

Structured logging implemented
Health endpoint responds with dependency status
Cost tracking per model, per task
Alerts configured for error rate, cost, and availability

Scaling

Queue-based task processing (if multi-user)
Rate limiter for LLM API calls
Model routing by task complexity

Reliability

Circuit breakers on external dependencies
Graceful shutdown handler
Retry with exponential backoff
State persistence across restarts
Restart policy configured (systemd/LaunchAgent/Docker)

Cost

Model routing by complexity tier
Daily/monthly budget limits
Cost alerts at 80% threshold
Token usage monitored per task type

FAQ

How much does it cost to run an AI agent in production?

The cost depends almost entirely on LLM API usage, not infrastructure. A lightly-used personal agent costs $5-$30/month in API fees. A heavily-used agent processing hundreds of tasks daily can cost $500-$2,000/month. Infrastructure costs (VPS, database) are typically 10-20% of total spend. Model routing -- sending simple tasks to cheaper models -- is the single most effective cost optimization.

Should I run my AI agent locally or in the cloud?

For a single-user agent, local deployment on dedicated hardware (Mac Mini, Linux server) is cheaper after 2-3 months because the hardware pays for itself and eliminates monthly compute fees. For multi-user systems that need elastic scaling, cloud deployment is more practical. The LLM API cost is identical either way, and it is usually the largest expense.

How do I prevent runaway API costs from a bug?

Three layers of protection: (1) set daily budget limits that halt processing when exceeded, (2) implement circuit breakers that stop after consecutive failures, and (3) configure cost alerts at 80% of your budget threshold. Without all three, a single bug in an agent loop can generate thousands of dollars in API calls overnight.

What is the best way to keep an AI agent running 24/7?

On Linux, use a systemd service with Restart=always. On macOS, use a LaunchAgent with KeepAlive: true. For cloud deployments, Docker with restart: unless-stopped. All three approaches automatically restart the agent after crashes, reboots, and failures. Add health checks so the process manager knows when the agent is stuck, not just when it has exited.

Do I need Kubernetes to run AI agents at scale?

Not necessarily. Queue-based scaling with Docker Compose handles most workloads up to hundreds of concurrent users. Kubernetes becomes valuable when you need automatic scaling based on queue depth, zero-downtime deployments, or you are running dozens of different agent services. For most teams, Docker Compose with a Redis task queue is sufficient and dramatically simpler to operate.

How do I monitor an AI agent in production?

Three layers: structured JSON logging (for debugging), Prometheus metrics (for dashboards and trends), and alerting rules (for incidents). The critical metrics are error rate, request latency, token usage, and cost per hour. A health endpoint that checks all dependencies (LLM API, database, message queue) lets your process manager or load balancer detect failures automatically.

What to Read Next

This guide covered the infrastructure side of AI agent deployment. For the agent-building fundamentals that precede deployment, see How to Build an AI Agent System from Scratch. For the architectural patterns that make agent systems maintainable at scale, see AI Agent Systems. And if data sovereignty matters for your deployment, Private AI Agents covers running everything on hardware you control.

Production deployment is not a one-time event. It is an ongoing practice of monitoring, optimizing, and hardening. The best production agents are not the ones that never fail -- they are the ones that recover so fast, no one notices.