Deploy an AI Agent to Production: Docker, Monitoring, and Scaling
Building an AI agent is the easy part. Making it run reliably at 3 AM when no one is watching -- that is the actual engineering challenge.
Most AI agent tutorials end at "it works on my laptop." They skip the hard questions: How do you keep the agent running after a crash? How do you know when it is silently failing? How do you manage API costs that can spike 10x overnight? How do you scale from one user to a hundred without rewriting everything?
This guide covers the full deployment lifecycle for AI agents in production. From containerization with Docker through monitoring, scaling, cost management, and reliability engineering. Every configuration example is drawn from real production systems -- including Nevo, which runs 24/7 across 11 LaunchAgent services on a single Mac Studio, processing messages from 20+ platforms without downtime.
For foundational context, see What Are AI Agents?. If you are still in the building phase, start with How to Build an AI Agent System from Scratch. For architecture patterns, see AI Agent Systems.
Why Deployment Is Where Most AI Agents Die
Deploying an AI agent to production means running an AI agent system reliably, continuously, and observably on infrastructure that can handle real-world load, failures, and cost constraints.
The gap between "demo" and "production" is not a small step. It is a category change. A demo agent needs to work once. A production agent needs to work every time, recover from failures automatically, cost a predictable amount, and tell you when something is wrong before your users notice.
Here is what goes wrong when teams skip deployment engineering:
- Silent failures. The agent crashes at 2 AM. No alert fires. No one knows until a customer complains eight hours later.
- Cost explosions. A bug causes the agent to loop, making thousands of API calls. The monthly bill arrives five figures higher than expected.
- State corruption. The agent loses memory or context after a restart because no one thought about persistence.
- Scaling walls. The agent works for 10 users. At 100, response times hit 30 seconds. At 1,000, the system falls over.
Every one of these is avoidable with the right deployment architecture. The rest of this guide shows you how.
Step 1: Containerize with Docker
Containerization is the foundation of reproducible deployment. An AI agent typically depends on specific Python or Node.js versions, system libraries, API clients, and configuration files. Docker ensures the environment is identical everywhere -- your laptop, a staging server, production.
The Dockerfile
A production AI agent Dockerfile has three priorities: small image size, fast builds, and no secrets baked in.
# syntax=docker/dockerfile:1
FROM python:3.12-slim AS base
# System dependencies (build stage)
RUN apt-get update && apt-get install -y --no-install-recommends \
git curl build-essential \
&& rm -rf /var/lib/apt/lists/*
# Create non-root user
RUN groupadd -r agent && useradd -r -g agent -d /app -s /bin/bash agent
WORKDIR /app
# Install Python dependencies (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY --chown=agent:agent . .
# Never run as root
USER agent
# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
EXPOSE 8080
CMD ["python", "-m", "agent.main"]
Key decisions in this Dockerfile:
-
python:3.12-slim-- Notalpine(which causes issues with many Python packages) and not the full image (900MB of tools you do not need). - Non-root user -- Production containers should never run as root. A compromised agent with root access can escape the container.
-
Layer ordering --
requirements.txtis copied before the application code. This means dependency installation is cached unless requirements change, cutting build times from minutes to seconds. - Health check built in -- Docker can automatically restart unhealthy containers. Without a health check, Docker only knows if the process exited, not if it is stuck in an infinite loop or deadlocked.
Docker Compose for Multi-Service Agents
Real AI agent systems are not single-process applications. They typically involve the agent runtime, a model proxy, a message queue, and a database for memory. Docker Compose orchestrates all of these.
# docker-compose.yml
version: "3.9"
services:
agent:
build: .
restart: unless-stopped
env_file: .env
ports:
- "8080:8080"
volumes:
- agent-memory:/app/data
- ./config:/app/config:ro
depends_on:
model-proxy:
condition: service_healthy
redis:
condition: service_healthy
deploy:
resources:
limits:
memory: 2G
cpus: "2.0"
logging:
driver: json-file
options:
max-size: "50m"
max-file: "5"
model-proxy:
image: ghcr.io/berriai/litellm:main-latest
restart: unless-stopped
ports:
- "4000:4000"
volumes:
- ./litellm-config.yaml:/app/config.yaml:ro
env_file: .env
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
interval: 15s
timeout: 5s
retries: 3
deploy:
resources:
limits:
memory: 512M
redis:
image: redis:7-alpine
restart: unless-stopped
volumes:
- redis-data:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 5
deploy:
resources:
limits:
memory: 256M
volumes:
agent-memory:
redis-data:
Several things to note here:
-
restart: unless-stopped-- The agent restarts on crash, on server reboot, on Docker daemon restart. It only stays down if you explicitly stop it. -
depends_onwith health conditions -- The agent does not start until the model proxy and Redis are healthy. Without this, the agent boots, tries to call the model proxy, fails, and enters a crash loop. - Memory limits -- Without resource limits, a runaway agent can consume all available RAM and crash everything else on the machine. Set limits based on your observed usage plus a 50% buffer.
-
Log rotation --
max-size: "50m"withmax-file: "5"caps logs at 250MB. Without this, verbose agents fill disks in days. -
Read-only config -- The
:roflag on config mounts prevents the agent from accidentally modifying its own configuration.
Step 2: Manage Secrets and Configuration
Hard-coding API keys into source code or Docker images is a guaranteed production incident waiting to happen. AI agents typically need credentials for multiple services: LLM providers, messaging platforms, databases, and external tools.
Environment Variables with .env
The simplest approach for small deployments:
# .env -- NEVER commit this file
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
TELEGRAM_BOT_TOKEN=7891234567:AAH...
DATABASE_URL=postgresql://agent:secret@db:5432/memory
REDIS_URL=redis://redis:6379/0
# Model routing configuration
DEFAULT_MODEL=claude-sonnet-4-20250514
COMPLEX_MODEL=claude-opus-4-20250514
FAST_MODEL=claude-haiku-3-5-20241022
# Rate limiting
MAX_REQUESTS_PER_MINUTE=60
MAX_TOKENS_PER_MINUTE=100000
Enforce this with .gitignore:
.env
.env.*
!.env.example
Always commit a .env.example with placeholder values so new developers know which variables are required.
Credential Store for Multi-Service Systems
For systems with many credentials (Nevo manages credentials for over a dozen services), a directory-based credential store is cleaner than a single .env file:
# ~/.agent/credentials/ (chmod 700)
# Each service gets its own file (chmod 600)
# ~/.agent/credentials/anthropic.env
ANTHROPIC_API_KEY=sk-ant-...
# ~/.agent/credentials/telegram.env
TELEGRAM_BOT_TOKEN=7891234567:AAH...
TELEGRAM_CHAT_ID=123456789
# ~/.agent/credentials/litellm.env
LITELLM_MASTER_KEY=sk-...
Load credentials at runtime:
source "$HOME/.agent/credentials/anthropic.env"
source "$HOME/.agent/credentials/telegram.env"
This approach has a practical advantage: you can rotate one service's credentials without touching any others.
For Cloud Deployments
If deploying to a cloud provider, use their secrets management:
- AWS: Secrets Manager or Systems Manager Parameter Store
- GCP: Secret Manager
- Azure: Key Vault
- Kubernetes: Sealed Secrets or External Secrets Operator
The principle is the same everywhere: secrets are injected at runtime, never baked into images, never committed to repositories.
Step 3: Monitoring and Observability
An AI agent you cannot observe is an AI agent you cannot trust. Monitoring has three layers: logging, metrics, and alerting.
Structured Logging
Unstructured logs ("Agent did something") are useless at scale. Structure every log entry:
import json
import sys
from datetime import datetime, timezone
def log(level: str, event: str, **context):
entry = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"level": level,
"event": event,
"agent_id": os.getenv("AGENT_ID", "default"),
**context
}
print(json.dumps(entry), file=sys.stderr)
# Usage
log("info", "task_started", task_id="abc-123", model="claude-sonnet")
log("warn", "rate_limited", provider="anthropic", retry_after_s=30)
log("error", "tool_failed", tool="database_query", error="connection refused")
Structured logs let you filter, aggregate, and alert on specific patterns. "Show me all errors from the last hour where the model was claude-opus" becomes a simple query instead of a grep nightmare.
Key Metrics to Track
These are the metrics that actually matter for AI agent operations:
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Request latency (p95) | User experience | > 10s |
| Error rate | Reliability | > 5% over 5 min |
| Token usage per request | Cost control | > 2x daily average |
| Active tasks | Capacity | > queue depth limit |
| Memory usage | Stability | > 80% of limit |
| Model API latency | Dependency health | > 5s |
| Cost per hour | Budget | > 2x hourly budget |
| Circuit breaker state | Failure cascade | Any open breaker |
Prometheus Metrics Endpoint
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Counters
requests_total = Counter(
"agent_requests_total",
"Total requests processed",
["model", "status"]
)
tokens_used = Counter(
"agent_tokens_total",
"Total tokens consumed",
["model", "direction"] # direction: input or output
)
# Histograms
request_duration = Histogram(
"agent_request_duration_seconds",
"Request processing time",
["model"],
buckets=[0.5, 1, 2, 5, 10, 30, 60]
)
# Gauges
active_tasks = Gauge("agent_active_tasks", "Currently running tasks")
cost_estimate = Gauge("agent_cost_estimate_hourly", "Estimated hourly cost in USD")
# Expose on /metrics
start_http_server(9090)
Health Check Endpoint
Every production agent needs a health endpoint. This is not optional.
from fastapi import FastAPI
import time
app = FastAPI()
last_successful_inference = time.time()
@app.get("/health")
async def health():
checks = {
"model_proxy": await check_model_proxy(),
"memory_store": await check_memory(),
"last_inference_age_s": time.time() - last_successful_inference,
}
healthy = all([
checks["model_proxy"],
checks["memory_store"],
checks["last_inference_age_s"] < 300 # Alert if no inference in 5 min
])
return {
"status": "healthy" if healthy else "degraded",
"checks": checks,
"uptime_s": time.time() - start_time
}
Alerting
Monitoring without alerting is just data collection. Set up alerts for the failure modes that matter:
# alertmanager rules (Prometheus format)
groups:
- name: agent_alerts
rules:
- alert: AgentHighErrorRate
expr: rate(agent_requests_total{status="error"}[5m]) > 0.05
for: 2m
annotations:
summary: "Agent error rate above 5% for 2 minutes"
- alert: AgentCostSpike
expr: agent_cost_estimate_hourly > 10
for: 5m
annotations:
summary: "Hourly cost exceeds $10"
- alert: AgentUnresponsive
expr: up{job="agent"} == 0
for: 1m
annotations:
summary: "Agent health endpoint unreachable"
Step 4: Always-On Deployment with systemd and LaunchAgent
Docker is not the only way to keep an agent running. For single-machine deployments -- especially development machines or dedicated hardware -- process managers provide lower overhead and tighter OS integration.
systemd on Linux
systemd is the standard process manager on modern Linux distributions. A systemd service file gives you automatic restarts, logging, resource limits, and boot-time startup.
# /etc/systemd/system/ai-agent.service
[Unit]
Description=AI Agent Service
After=network-online.target
Wants=network-online.target
StartLimitIntervalSec=300
StartLimitBurst=5
[Service]
Type=simple
User=agent
Group=agent
WorkingDirectory=/opt/agent
ExecStart=/opt/agent/venv/bin/python -m agent.main
Restart=always
RestartSec=10
# Secrets from credential store
EnvironmentFile=/opt/agent/.env
# Resource limits
MemoryMax=2G
CPUQuota=200%
# Security hardening
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/opt/agent/data
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=ai-agent
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable ai-agent
sudo systemctl start ai-agent
# Check status
systemctl status ai-agent
# Tail logs
journalctl -u ai-agent -f
The StartLimitIntervalSec and StartLimitBurst settings prevent rapid crash loops -- if the agent crashes 5 times in 5 minutes, systemd stops trying and marks the service as failed, giving you a clear signal that something is fundamentally wrong rather than endlessly restarting.
LaunchAgent on macOS
On macOS, launchd is the system process manager. This is how Nevo runs -- 11 LaunchAgent services coordinating the full stack on a single Mac Studio.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.myagent.daemon</string>
<key>Comment</key>
<string>AI Agent — always-on autonomous assistant</string>
<!-- Start on login, keep alive forever -->
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<!-- Prevent rapid restart loops (min 10s between restarts) -->
<key>ThrottleInterval</key>
<integer>10</integer>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/python3</string>
<string>/opt/agent/main.py</string>
</array>
<key>WorkingDirectory</key>
<string>/opt/agent</string>
<!-- Log output -->
<key>StandardOutPath</key>
<string>/opt/agent/logs/agent.log</string>
<key>StandardErrorPath</key>
<string>/opt/agent/logs/agent.err.log</string>
<!-- Environment -->
<key>EnvironmentVariables</key>
<dict>
<key>HOME</key>
<string>/Users/yourusername</string>
<key>PATH</key>
<string>/usr/local/bin:/usr/bin:/bin</string>
</dict>
</dict>
</plist>
Install and manage:
# Copy plist to LaunchAgents directory
cp com.myagent.daemon.plist ~/Library/LaunchAgents/
# Load and start
launchctl load ~/Library/LaunchAgents/com.myagent.daemon.plist
# Check if running
launchctl list | grep myagent
# Unload (stop)
launchctl unload ~/Library/LaunchAgents/com.myagent.daemon.plist
# View logs
tail -f /opt/agent/logs/agent.log
Key differences from systemd:
-
KeepAlive: trueis the equivalent ofRestart=always-- launchd relaunches the process if it exits for any reason. -
ThrottleIntervalprevents restart storms. If the process exits within 10 seconds, launchd waits before relaunching. -
RunAtLoad: truestarts the agent when the user logs in (LaunchAgents) or when the system boots (LaunchDaemons in/Library/LaunchDaemons/). - No built-in resource limits -- macOS launchd does not support memory or CPU limits natively. Use container-based deployment if you need hard resource constraints.
For a real-world example: Nevo runs 11 LaunchAgent services including the core daemon, a model routing proxy, a cron watchdog, memory consolidation, and upstream monitoring -- all coordinated through launchd with individual log files and automatic restart on failure.
When to Use Which
| Approach | Best For | Overhead |
|---|---|---|
| Docker Compose | Cloud deployment, team environments, multi-machine | Medium |
| systemd | Linux servers, VPS, dedicated hardware | Low |
| LaunchAgent | macOS workstations, Mac Mini/Studio servers | Low |
| Kubernetes | Large-scale, multi-agent, auto-scaling | High |
For a single agent on dedicated hardware, systemd or LaunchAgent is simpler and more efficient. For deployable, reproducible, team-managed systems, Docker. For scale-out architectures, Kubernetes.
Step 5: Scaling Strategies
A single-process agent handles one request at a time. That works until it does not. Scaling AI agents is different from scaling traditional web services because the bottleneck is usually external -- LLM API rate limits, not your own compute.
Horizontal Scaling with a Task Queue
The most effective pattern for scaling AI agents is queue-based processing. Requests go into a queue. Multiple agent workers pull from the queue and process independently.
# worker.py -- one of N identical agent workers
import redis
import json
r = redis.Redis()
while True:
# Block until a task is available
_, task_json = r.brpop("agent:tasks")
task = json.loads(task_json)
try:
result = process_task(task)
r.lpush(f"agent:results:{task['id']}", json.dumps(result))
except Exception as e:
r.lpush(f"agent:results:{task['id']}", json.dumps({
"error": str(e),
"task_id": task["id"]
}))
Scale by running more workers:
# docker-compose.yml (scaling section)
services:
agent-worker:
build: .
command: python -m agent.worker
deploy:
replicas: 4
env_file: .env
depends_on:
- redis
- model-proxy
Rate Limit Awareness
LLM APIs enforce rate limits. Your scaling strategy must account for this, or you will hit 429 errors and degrade service for everyone.
import time
from collections import deque
class RateLimiter:
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window = window_seconds
self.timestamps = deque()
async def acquire(self):
now = time.time()
# Remove timestamps outside the window
while self.timestamps and self.timestamps[0] < now - self.window:
self.timestamps.popleft()
if len(self.timestamps) >= self.max_requests:
wait_time = self.timestamps[0] + self.window - now
await asyncio.sleep(wait_time)
self.timestamps.append(time.time())
# Usage: shared across all workers
rate_limiter = RateLimiter(max_requests=60, window_seconds=60)
async def call_model(prompt: str):
await rate_limiter.acquire()
return await model_client.complete(prompt)
Load Balancing Across Model Providers
If you use multiple LLM providers (or multiple API keys), a model proxy like LiteLLM distributes load and handles failover:
# litellm-config.yaml
model_list:
- model_name: agent-default
litellm_params:
model: claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
model_info:
max_tokens: 8192
- model_name: agent-default
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
model_info:
max_tokens: 4096
router_settings:
routing_strategy: least-busy
num_retries: 3
timeout: 120
fallbacks:
- agent-default: [agent-fallback]
This configuration routes to whichever provider has the lowest current load, retries failed requests up to 3 times, and falls back to alternative models if the primary provider is down.
Step 6: Cost Management
LLM API costs are the largest operational expense for most AI agent systems. Without active cost management, a productive agent can easily cost thousands per month.
Model Routing for Cost Optimization
Not every task needs your most expensive model. Route by complexity:
MODEL_TIERS = {
"fast": {
"model": "claude-haiku-3-5-20241022",
"cost_per_1k_input": 0.0008,
"cost_per_1k_output": 0.004,
"use_for": ["classification", "formatting", "simple_extraction"]
},
"standard": {
"model": "claude-sonnet-4-20250514",
"cost_per_1k_input": 0.003,
"cost_per_1k_output": 0.015,
"use_for": ["code_generation", "analysis", "summarization"]
},
"complex": {
"model": "claude-opus-4-20250514",
"cost_per_1k_input": 0.015,
"cost_per_1k_output": 0.075,
"use_for": ["architecture", "root_cause_analysis", "code_review"]
}
}
def select_model(task_type: str) -> str:
for tier_name, tier in MODEL_TIERS.items():
if task_type in tier["use_for"]:
return tier["model"]
return MODEL_TIERS["standard"]["model"] # Default to mid-tier
Nevo uses this exact pattern with LiteLLM routing 6 model aliases across 3 Anthropic tiers. Simple tasks like type checking and linting go to Haiku. Standard work goes to Sonnet. Complex reasoning -- code review, root cause analysis, architectural decisions -- goes to Opus. The cost difference between routing everything to Opus versus tiered routing can be 5-10x.
Cost Tracking and Budgets
Track every API call and enforce budget limits:
from dataclasses import dataclass
from datetime import datetime, timezone
@dataclass
class CostRecord:
timestamp: datetime
model: str
input_tokens: int
output_tokens: int
cost_usd: float
task_id: str
class CostTracker:
def __init__(self, daily_budget_usd: float):
self.daily_budget = daily_budget_usd
self.records: list[CostRecord] = []
def record(self, model: str, input_tokens: int,
output_tokens: int, task_id: str):
cost = self._calculate_cost(model, input_tokens, output_tokens)
self.records.append(CostRecord(
timestamp=datetime.now(timezone.utc),
model=model, input_tokens=input_tokens,
output_tokens=output_tokens, cost_usd=cost,
task_id=task_id
))
daily_total = self._daily_total()
if daily_total > self.daily_budget * 0.8:
log("warn", "cost_threshold",
daily_total=daily_total, budget=self.daily_budget)
if daily_total > self.daily_budget:
raise BudgetExceededError(
f"Daily budget of ${self.daily_budget} exceeded: ${daily_total:.2f}")
Local vs Cloud: The Cost Equation
The decision to run locally versus in the cloud is primarily a cost decision, not a technical one.
| Factor | Local (Dedicated Hardware) | Cloud (VPS/Container) |
|---|---|---|
| Upfront cost | $500-$5,000 (hardware) | $0 |
| Monthly cost | ~$10 (electricity) | $20-$200 (compute) |
| LLM API cost | Same | Same |
| Scaling | Buy more hardware | Click a button |
| Maintenance | You handle everything | Provider handles infra |
| Latency | Lowest (local network) | Variable |
| Data privacy | Complete | Depends on provider |
For a single-user AI agent like a private AI agent, local deployment is almost always cheaper after month 3. The hardware pays for itself. For multi-user systems that need elastic scaling, cloud deployment makes more sense despite the ongoing cost.
The LLM API cost is the same either way -- and it is usually the dominant expense. A system spending $5/day on API calls and $20/month on a VPS is spending 88% of its budget on the API. Optimize model routing first. Cloud versus local is a secondary decision.
Step 7: Reliability Engineering
A production agent must handle failure gracefully. Networks drop. APIs return errors. Models hallucinate. The system should degrade, not collapse.
Circuit Breakers
A circuit breaker prevents cascade failures. When an external service (like an LLM API) starts failing, the circuit breaker stops sending requests -- giving the service time to recover instead of hammering it with doomed requests.
import time
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5,
recovery_timeout: int = 60):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failures = 0
self.state = "closed" # closed = normal, open = blocking
self.last_failure_time = 0
async def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "half-open" # Try one request
else:
raise CircuitOpenError("Circuit breaker is open")
try:
result = await func(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failures = 0
return result
except Exception as e:
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.failure_threshold:
self.state = "open"
log("error", "circuit_opened",
failures=self.failures, service=func.__name__)
raise
Nevo uses circuit breakers in its Ralph loop execution system. After 3 consecutive failures on the same task, the circuit breaker opens and the system escalates to a human rather than burning through tokens on a task it cannot complete.
Restart Policies and Graceful Shutdown
Production agents need to handle restarts without losing work:
import signal
import asyncio
class GracefulAgent:
def __init__(self):
self.running = True
self.current_task = None
signal.signal(signal.SIGTERM, self._shutdown)
signal.signal(signal.SIGINT, self._shutdown)
def _shutdown(self, signum, frame):
log("info", "shutdown_requested", signal=signum)
self.running = False
async def run(self):
while self.running:
task = await self.get_next_task()
if task:
self.current_task = task
try:
await self.process_task(task)
await self.checkpoint(task) # Persist progress
finally:
self.current_task = None
# Clean shutdown: finish current work, persist state
if self.current_task:
log("info", "completing_current_task",
task_id=self.current_task["id"])
await self.process_task(self.current_task)
await self.checkpoint(self.current_task)
log("info", "shutdown_complete")
The pattern: catch shutdown signals, finish the current unit of work, persist state, then exit cleanly. This prevents the most common production issue -- an agent that loses progress on every restart.
Retry with Exponential Backoff
Transient failures (network blips, rate limits, temporary API errors) should be retried. Permanent failures (invalid API key, malformed request) should not.
import asyncio
import random
async def retry_with_backoff(func, max_retries=5, base_delay=1.0):
for attempt in range(max_retries):
try:
return await func()
except TransientError as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
log("warn", "retrying", attempt=attempt + 1,
delay_s=round(delay, 1), error=str(e))
await asyncio.sleep(delay)
except PermanentError:
raise # Do not retry permanent failures
The jitter (random.uniform(0, 1)) prevents the thundering herd problem -- when multiple workers retry at exactly the same time, creating a synchronized spike.
Putting It All Together: Deployment Checklist
Before declaring your AI agent production-ready, verify each of these:
Containerization
- Dockerfile builds reproducibly
- Non-root user in container
- Health check defined
- Resource limits set
- Log rotation configured
Secrets
- No credentials in source code or images
-
.envin.gitignore - Credential rotation procedure documented
- File permissions set (600 for credential files)
Monitoring
- Structured logging implemented
- Health endpoint responds with dependency status
- Cost tracking per model, per task
- Alerts configured for error rate, cost, and availability
Scaling
- Queue-based task processing (if multi-user)
- Rate limiter for LLM API calls
- Model routing by task complexity
Reliability
- Circuit breakers on external dependencies
- Graceful shutdown handler
- Retry with exponential backoff
- State persistence across restarts
- Restart policy configured (systemd/LaunchAgent/Docker)
Cost
- Model routing by complexity tier
- Daily/monthly budget limits
- Cost alerts at 80% threshold
- Token usage monitored per task type
FAQ
How much does it cost to run an AI agent in production?
The cost depends almost entirely on LLM API usage, not infrastructure. A lightly-used personal agent costs $5-$30/month in API fees. A heavily-used agent processing hundreds of tasks daily can cost $500-$2,000/month. Infrastructure costs (VPS, database) are typically 10-20% of total spend. Model routing -- sending simple tasks to cheaper models -- is the single most effective cost optimization.
Should I run my AI agent locally or in the cloud?
For a single-user agent, local deployment on dedicated hardware (Mac Mini, Linux server) is cheaper after 2-3 months because the hardware pays for itself and eliminates monthly compute fees. For multi-user systems that need elastic scaling, cloud deployment is more practical. The LLM API cost is identical either way, and it is usually the largest expense.
How do I prevent runaway API costs from a bug?
Three layers of protection: (1) set daily budget limits that halt processing when exceeded, (2) implement circuit breakers that stop after consecutive failures, and (3) configure cost alerts at 80% of your budget threshold. Without all three, a single bug in an agent loop can generate thousands of dollars in API calls overnight.
What is the best way to keep an AI agent running 24/7?
On Linux, use a systemd service with Restart=always. On macOS, use a LaunchAgent with KeepAlive: true. For cloud deployments, Docker with restart: unless-stopped. All three approaches automatically restart the agent after crashes, reboots, and failures. Add health checks so the process manager knows when the agent is stuck, not just when it has exited.
Do I need Kubernetes to run AI agents at scale?
Not necessarily. Queue-based scaling with Docker Compose handles most workloads up to hundreds of concurrent users. Kubernetes becomes valuable when you need automatic scaling based on queue depth, zero-downtime deployments, or you are running dozens of different agent services. For most teams, Docker Compose with a Redis task queue is sufficient and dramatically simpler to operate.
How do I monitor an AI agent in production?
Three layers: structured JSON logging (for debugging), Prometheus metrics (for dashboards and trends), and alerting rules (for incidents). The critical metrics are error rate, request latency, token usage, and cost per hour. A health endpoint that checks all dependencies (LLM API, database, message queue) lets your process manager or load balancer detect failures automatically.
What to Read Next
This guide covered the infrastructure side of AI agent deployment. For the agent-building fundamentals that precede deployment, see How to Build an AI Agent System from Scratch. For the architectural patterns that make agent systems maintainable at scale, see AI Agent Systems. And if data sovereignty matters for your deployment, Private AI Agents covers running everything on hardware you control.
Production deployment is not a one-time event. It is an ongoing practice of monitoring, optimizing, and hardening. The best production agents are not the ones that never fail -- they are the ones that recover so fast, no one notices.