Running AI Agents on Raspberry Pi: The Complete Hardware Guide
A Raspberry Pi 5 costs $80. It draws less power than a nightlight. And it can run an AI agent that works for you around the clock -- orchestrating tasks, searching your documents, managing your smart home, and learning from every interaction -- while your main computer stays free for actual work.
This is not a science project. The Pi 5's quad-core Cortex-A76 processor handles agent orchestration, tool execution, and memory management without breaking a sweat. For heavy reasoning, you route to cloud APIs or run quantized small language models locally. The result is a dedicated AI appliance that costs less per year in electricity than a single month of most SaaS subscriptions.
Running an AI agent on a Raspberry Pi is a specific instance of a broader trend called edge AI -- moving intelligence closer to where it is needed. For a general overview of how AI agents work, see What Are AI Agents?. For more on running agents on your own hardware, see Private AI Agents: Running AI on Your Own Hardware.
Why a Raspberry Pi for AI Agents?
Most people who run AI agents do it on their laptop. That works until it doesn't. The agent consumes your CPU during a complex task. Your fans scream. Your battery drains. And when you close the lid, everything stops.
A Raspberry Pi running as a dedicated agent host solves three problems at once:
Always-on operation. The Pi runs 24/7 without tying up your personal machine. Send it a task from your phone at midnight, and it is already working when you wake up.
Cost efficiency. At 5-8 watts under AI workload, the Pi 5 costs approximately $5-8 per year in electricity. A Mac Mini running the same role uses 20-40 watts. A desktop PC uses 65-200 watts. Over three years, the power savings alone exceed the cost of the Pi.
Dedicated resources. When the agent has its own hardware, there is no resource contention. No background processes competing for CPU. No browser tabs eating RAM. The agent gets everything the machine has, every time.
The trade-off is straightforward: the Pi is not powerful enough to run large language models locally. But it does not need to be. The architecture that makes a Pi-based agent practical is a hybrid one -- local orchestration with cloud inference. More on that below.
Raspberry Pi 5: What You Are Working With
The Raspberry Pi 5, released in late 2023, is a significant jump from its predecessor. Here are the specs that matter for AI agent workloads:
| Specification | Detail |
|---|---|
| CPU | Broadcom BCM2712, quad-core Cortex-A76 at 2.4 GHz |
| RAM | 8 GB or 16 GB LPDDR4X |
| Storage | MicroSD or NVMe SSD via M.2 HAT |
| Networking | Gigabit Ethernet, dual-band Wi-Fi 5, Bluetooth 5.0 |
| USB | 2x USB 3.0, 2x USB 2.0 |
| Power | 5V/5A USB-C (27W power supply recommended) |
| Idle power | ~2.7 watts |
| AI workload power | ~5-8 watts |
| OS | Raspberry Pi OS (64-bit Debian-based) |
The 8 GB model is the minimum for running AI agents. The 16 GB model, available since mid-2025, opens the door to larger local models and more comfortable multitasking. If you are buying new, get the 16 GB version.
Storage: NVMe Is Not Optional
An AI agent performs constant small reads and writes -- memory updates, document indexing, log entries, tool output caching. On an SD card, this pattern degrades performance within weeks and risks data corruption within months.
An NVMe SSD connected via the Pi 5's M.2 HAT solves this completely. A 256 GB NVMe drive costs under $25 and delivers 10-50x the random I/O performance of even the fastest microSD card. For an always-on agent, this is not an upgrade -- it is a requirement.
Cooling: Active Cooling Under Load
The Pi 5 will thermal-throttle under sustained inference workloads without adequate cooling. An active cooler (the official Raspberry Pi Active Cooler is $5) keeps the CPU at safe temperatures even during prolonged model inference. Passive cooling works for light orchestration tasks but is insufficient for continuous local LLM generation.
What Can You Actually Run on a Pi 5?
This is where expectations need to be precise. The Pi 5 is capable hardware, but it is not a GPU cluster. Here is what works, what works with caveats, and what does not.
What Works Well
Agent orchestration. Managing task queues, routing between tools, maintaining state, coordinating sub-tasks -- all of this is lightweight CPU work. The Pi handles it effortlessly.
Document search and retrieval. Local search engines like QMD (BM25 + GGUF embeddings) run efficiently on ARM. Indexing hundreds of documents and retrieving results in milliseconds is well within the Pi's capabilities.
Memory pipelines. Writing session logs, consolidating long-term memory, extracting facts from conversations -- these are I/O-bound tasks that benefit from NVMe storage, not GPU compute.
Tool execution. Running shell commands, calling APIs, managing files, sending notifications -- the bread and butter of what makes an agent useful rather than just conversational.
Small language model inference. Models under 4 billion parameters, quantized to 4-bit precision, run at usable speeds for many applications.
What Works With Caveats
Local LLM inference (3-7B models). It works, but it is slow. Expect 4-7 tokens per second for a 3B parameter model and 2-4 tokens per second for a 7B model. Fine for background processing. Painful for interactive chat.
Embedding generation. Small embedding models (under 500M parameters) run at acceptable speeds for indexing workflows that are not time-critical.
What Does Not Work
Large language model inference (13B+). Models above 7B parameters either do not fit in memory or run so slowly they are unusable. Do not try to run Llama 3.1 70B on a Pi. It will not work.
Real-time voice processing. Speech-to-text and text-to-speech at conversational speeds require more compute than the Pi delivers.
Image generation. Stable Diffusion and similar models require GPU acceleration. The Pi's VideoCore VII is not up to the task.
Local Inference: Benchmarks That Matter
If you plan to run models locally on the Pi, here is what to expect. All benchmarks use the Raspberry Pi 5 8 GB with active cooling at stock clocks. Performance on the 16 GB model is comparable -- the extra RAM lets you run larger models, not faster ones.
Token Generation Speed (llama.cpp, Q4_K_M quantization)
| Model | Parameters | RAM Usage | Tokens/sec | Verdict |
|---|---|---|---|---|
| TinyLlama 1.1B | 1.1B | ~1.2 GB | 12-18 tok/s | Usable for simple tasks |
| Gemma 3 1B | 1B | ~1.0 GB | 14-20 tok/s | Best efficiency on Pi |
| Qwen 2.5 3B | 3B | ~2.2 GB | 5-8 tok/s | Good accuracy, slow |
| Phi-3 Mini 3.8B | 3.8B | ~2.8 GB | 4-7 tok/s | Decent quality, borderline speed |
| Llama 3.2 3B | 3B | ~2.4 GB | 5-7 tok/s | Solid all-rounder |
| Llama 2 7B | 7B | ~4.2 GB | 2-4 tok/s | Usable only for batch processing |
| Mistral 7B | 7B | ~4.5 GB | 2-3 tok/s | Too slow for interactive use |
Key Observations
Gemma 3 1B is the sweet spot. It combines the lowest resource usage with the highest throughput and surprisingly good output quality for a 1B model. If you are deploying on Pi and need local inference, start here.
llama.cpp outperforms Ollama by 10-20%. Ollama adds a convenience layer (API server, model management, automatic quantization), but llama.cpp delivers better raw performance. For a dedicated Pi deployment where you control the stack, llama.cpp is the better choice.
Qwen 2.5 3B leads on accuracy. If output quality matters more than speed -- for example, in a knowledge-base agent that processes queries in the background -- Qwen consistently scores highest among sub-4B models.
7B models are batch-only. At 2-4 tokens per second, a 7B model takes 30-60 seconds to generate a typical response. This is acceptable for background tasks where the user is not waiting. It is unacceptable for interactive use.
The Hybrid Architecture: Where the Pi Actually Shines
Here is the insight that makes a Pi-based agent practical: the Pi does not need to do everything locally. It needs to orchestrate everything locally while routing heavy inference to the right backend.
A hybrid AI agent architecture on Raspberry Pi uses local compute for agent orchestration, memory, tool execution, and lightweight inference while routing complex reasoning to cloud APIs or more powerful local machines.
This is exactly how Nevo operates. The agent runtime -- task management, memory pipeline, document search, tool execution, quality gates -- runs locally. When the agent needs to reason about a complex problem, write substantial code, or analyze a large document, it sends that specific request to a cloud API (like Anthropic's Claude) and processes the result locally.
How It Works in Practice
User sends task via Telegram
|
v
[Raspberry Pi — Local Agent Runtime]
- Parses task, plans execution steps
- Searches local documents (QMD)
- Reads relevant files from local storage
- Decides which steps need LLM reasoning
|
v
[Cloud API — Heavy Inference]
- Complex reasoning (Claude, GPT-4, etc.)
- Code generation
- Long document analysis
|
v
[Raspberry Pi — Local Processing]
- Receives API response
- Executes tools (shell, file ops, APIs)
- Updates memory
- Returns result to user
Why This Architecture Works
Latency is acceptable. A cloud API call takes 1-3 seconds. For an agent that is executing a multi-step task autonomously, this is negligible compared to the overall task duration.
Cost is predictable. You pay per API call, not per hour of compute. A typical agent task might make 5-20 API calls costing $0.05-0.50 total. The Pi itself costs nothing beyond electricity.
Privacy where it matters. Your files, your memory, your task history -- all of that stays on the Pi. Only the specific prompts that require cloud reasoning leave the device, and you control exactly what gets sent.
Resilience. If the cloud API is down, the agent can still execute local tools, search documents, and queue tasks for later. The orchestration layer never stops.
For a detailed walkthrough of setting up this architecture on Pi hardware, see How to Install Nevo on Raspberry Pi.
Setting Up an AI Agent on Raspberry Pi 5
Here is the practical setup for running an AI agent on a Pi 5.
Hardware You Need
| Component | Recommended | Cost |
|---|---|---|
| Raspberry Pi 5 | 16 GB model | $80 |
| NVMe SSD | 256 GB M.2 NVMe | $20-25 |
| M.2 HAT | Official Pi 5 M.2 HAT+ | $12 |
| Active Cooler | Official Pi Active Cooler | $5 |
| Power Supply | 27W USB-C PSU | $12 |
| Case | Aluminum with M.2 slot | $15-25 |
| Total | $145-160 |
Software Stack
Operating system. Raspberry Pi OS Lite (64-bit). The Lite version skips the desktop environment, saving ~500 MB of RAM for your agent.
Runtime. Node.js 22 LTS for agent orchestration. Python 3.11+ for ML tooling.
Local inference (optional). llama.cpp compiled for ARM with NEON optimizations. Install from source for best performance:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CPU_ARM_ARCH=armv8.2-a+dotprod+fp16 .
cmake --build build --config Release -j4
Model management (optional). Ollama for easier model downloading and serving if you prefer convenience over raw performance:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma3:1b
ollama pull qwen2.5:3b
Agent framework. This depends on your agent system. For Nevo, the stack is OpenClaw (daemon + messaging) and Claude Code (reasoning engine) with QMD for local document search.
Network Configuration
For an always-on agent, a static IP or DHCP reservation makes life easier. You will want to access the agent from other devices on your network:
# Set a static IP (edit /etc/dhcpcd.conf)
interface eth0
static ip_address=192.168.1.100/24
static routers=192.168.1.1
static domain_name_servers=192.168.1.1 8.8.8.8
Systemd Service
Register your agent as a systemd service so it starts automatically on boot and restarts on failure:
# /etc/systemd/system/ai-agent.service
[Unit]
Description=AI Agent Service
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=pi
WorkingDirectory=/home/pi/agent
ExecStart=/home/pi/agent/start.sh
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Power Consumption and Always-On Economics
One of the Pi's strongest arguments is its operating cost.
| State | Power Draw | Annual Cost (at $0.12/kWh) |
|---|---|---|
| Idle | 2.7W | $2.84 |
| Light orchestration | 4-5W | $4.20-5.26 |
| Active AI inference | 5-8W | $5.26-8.41 |
| Maximum sustained load | 10-12W | $10.51-12.61 |
Compare this to alternatives:
| Device | Typical AI Workload Power | Annual Cost |
|---|---|---|
| Raspberry Pi 5 | 5-8W | $5-8 |
| Intel NUC (i5) | 25-45W | $26-47 |
| Mac Mini M1 | 20-39W | $21-41 |
| Mac Mini M4 | 15-30W | $16-32 |
| Old laptop | 30-65W | $32-68 |
| Desktop PC | 65-200W | $68-210 |
The Pi is the only option where the hardware pays for itself in power savings within 2-3 years compared to repurposing an old laptop.
Hardware Comparison: Pi vs. Everything Else
The Raspberry Pi is not the only option for hosting a personal AI agent. Here is how it stacks up.
Raspberry Pi 5 ($80)
Best for: Budget-conscious builders, always-on lightweight agents, hybrid cloud architectures, home automation, learning and experimentation.
Limitations: Local inference limited to sub-4B models at usable speeds. No GPU acceleration. 16 GB RAM ceiling.
Mac Mini M1/M2 ($160-300 used)
Best for: Local inference with larger models. The unified memory architecture and Neural Engine make the Mac Mini dramatically faster at ML workloads. A used M1 Mac Mini runs Llama 3.2 8B at 15-25 tokens per second -- 5x faster than the Pi.
Limitations: Higher power consumption, higher cost, overkill if you are primarily using cloud APIs.
Mac Mini M4 ($500-600 new)
Best for: Serious local AI work. 24 GB unified memory runs 13B models comfortably. The Neural Engine and GPU deliver 30-50+ tokens per second on 8B models. This is what Nevo runs on -- a Mac Studio M4 -- and it handles the full stack including local model routing.
Limitations: 5-8x the cost of a Pi. Overkill for simple agent orchestration with cloud APIs.
Intel NUC / Mini PC ($150-400)
Best for: Middle ground between Pi and Mac. A refurbished NUC 13 with 32 GB DDR5 can run 7B models at 8-12 tokens per second. Some newer NUCs include NPU (Neural Processing Unit) hardware with up to 67 TOPS of dedicated AI compute.
Limitations: x86 power consumption (25-45W). Fan noise. Less elegant form factor than the Pi.
Old Laptop ($0)
Best for: Free hardware you already own. If it has 16 GB RAM and an Intel i5 or better from the last 5 years, it can run agent orchestration and small local models.
Limitations: Not designed for always-on use. Battery degradation from constant charging. Fan noise. Power consumption 3-10x higher than a Pi.
The Verdict
If your agent primarily uses cloud APIs for reasoning and you want a cheap, silent, always-on appliance: get a Pi 5.
If you want meaningful local inference capability and can spend more: get a used Mac Mini M1/M2 or a refurbished Intel NUC with 32 GB RAM.
If you want the best of everything and budget is not the primary constraint: get a Mac Mini M4 or Mac Studio.
Use Cases: What People Actually Build
Home Automation Agent
A Pi-based agent connected to Home Assistant via its REST API can manage your smart home with natural language commands, create complex automations, and learn your patterns over time. The Pi runs the agent logic and communicates with Home Assistant locally -- no cloud dependency for device control.
"Turn off all lights except the office at 11 PM" becomes a task the agent understands, translates to Home Assistant API calls, and executes. Over time, it notices you always do this on weeknights and offers to automate it.
Personal Knowledge Base
Point the agent at your notes, documents, PDFs, and bookmarks. It indexes everything locally using embedding models and BM25 search. When you ask a question, it searches your personal knowledge base first, then supplements with web search or cloud reasoning if needed.
On a Pi with QMD or a similar local search engine, this runs entirely on-device. Your documents never leave your network.
Privacy-First Coding Assistant
For developers who cannot send proprietary code to cloud APIs, a Pi running a local 3B coding model (Qwen 2.5 Coder 3B or similar) provides code completion, explanation, and review without any data leaving the network. The quality is limited compared to frontier models, but for routine tasks -- formatting, boilerplate, documentation, simple refactoring -- it is surprisingly capable.
Notification and Monitoring Agent
A Pi-based agent that monitors RSS feeds, email inboxes, GitHub repositories, or social media mentions and sends you filtered, summarized updates via Telegram or other messaging platforms. The monitoring and filtering logic runs locally. Only summaries that pass your relevance threshold get forwarded to you.
Development Environment Agent
For the personal AI agent use case, a Pi serves as a dedicated development companion. It watches your project repository, runs tests on commits, performs code review, maintains documentation, and handles routine development operations -- all without consuming resources on your primary development machine.
Performance Tuning Tips
Overclock Conservatively
The Pi 5 supports overclocking to 2.8-3.0 GHz with adequate cooling. This yields a 10-20% improvement in inference speed. Add to /boot/firmware/config.txt:
arm_freq=2800
over_voltage_delta=50000
Monitor temperatures with vcgencmd measure_temp. Stay below 80C under sustained load.
Optimize llama.cpp Build Flags
Compile llama.cpp with ARM-specific optimizations:
cmake -B build \
-DGGML_CPU_ARM_ARCH=armv8.2-a+dotprod+fp16+sve \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_NATIVE=ON .
The dotprod and fp16 extensions provide meaningful speedups for quantized model inference on the Cortex-A76.
Use Swap on NVMe
If running models that approach your RAM limit, configure swap on the NVMe (not the SD card):
sudo dphys-swapfile swapoff
sudo nano /etc/dphys-swapfile
# Set CONF_SWAPSIZE=4096 and CONF_SWAPFILE=/mnt/nvme/swapfile
sudo dphys-swapfile setup
sudo dphys-swapfile swapon
This is a last resort -- swap-based inference is slow -- but it prevents OOM kills when a model slightly exceeds available RAM.
Disable Unnecessary Services
Every megabyte of RAM matters on a Pi running AI workloads. Disable what you do not need:
sudo systemctl disable bluetooth
sudo systemctl disable avahi-daemon
sudo systemctl disable cups
On Pi OS Lite, most of these are already absent. If you installed the full desktop version, strip it down.
Frequently Asked Questions
Can I run ChatGPT or Claude on a Raspberry Pi?
Not the full frontier models -- those require data center hardware. But you can run an AI agent on a Pi that calls the ChatGPT or Claude API for reasoning while handling everything else locally. This hybrid approach gives you frontier-level intelligence with Pi-level cost and simplicity.
Is 8 GB enough RAM for an AI agent on Pi?
For agent orchestration with cloud APIs, 8 GB is plenty. For local model inference, 8 GB limits you to models under 4B parameters. The 16 GB model is worth the $20 premium if you plan to run local models.
How does the Raspberry Pi 5 compare to the Pi 4 for AI?
The Pi 5 is approximately 2-3x faster than the Pi 4 for inference workloads, thanks to the Cortex-A76 cores (vs. Cortex-A72) and faster memory. The Pi 4 can technically run sub-2B models, but the experience is poor. The Pi 5 is the minimum viable hardware for a useful AI agent.
Can I add a GPU to a Raspberry Pi for faster AI?
The Pi 5's PCIe interface technically supports eGPU setups, and people have experimented with this. But the single-lane PCIe 2.0 connection severely bottlenecks data transfer, making it impractical for real inference acceleration. If you need GPU compute, a different hardware platform is the right answer.
How long will a Raspberry Pi last running 24/7?
With adequate cooling and NVMe storage (not SD card), a Pi can run continuously for years. The primary failure points are SD cards (which is why NVMe is essential) and power supply quality. Use the official 27W power supply and a quality NVMe drive, and the Pi will outlast its usefulness as AI hardware evolves.
What is the best model to run on Raspberry Pi 5?
For the best balance of speed, quality, and resource usage: Gemma 3 1B (Q4_K_M quantization). For the best output accuracy among small models: Qwen 2.5 3B. For coding tasks specifically: Qwen 2.5 Coder 3B.
The Bottom Line
An AI agent on a Raspberry Pi is not a compromise. It is a deliberate architectural choice that trades local compute power for always-on reliability, negligible operating costs, and a clean separation between orchestration and inference.
The Pi handles what it is good at -- running continuously, managing tasks, executing tools, maintaining memory, serving a local dashboard -- and delegates what it is not good at to cloud APIs or more powerful machines on your network. This is the same hybrid pattern that powers production distributed systems everywhere. The Pi is just the cheapest, quietest, most power-efficient node you can put at the edge.
For $150 in hardware and $5 per year in electricity, you get a dedicated AI agent that never sleeps, never competes with your other work, and never stops learning.
Start with the Nevo Pi installation guide if you want a step-by-step walkthrough. Or explore personal AI agents for more on what a dedicated agent can do for your workflow.