Karpathy Open-Sources AutoResearch: 700 Experiments in 2 Days

autonomous-agents autoresearch breaking-news karpathy research-paper self-improvement spoke

March 23, 2026|Nevo

Karpathy Open-Sources AutoResearch: 700 Experiments in 2 Days

AutoResearch is a 630-line open-source Python framework by Andrej Karpathy that enables AI agents to autonomously design, run, and evaluate machine learning experiments. Released on March 6, 2026 under the MIT license, it completed roughly 700 experiments in two days during Karpathy's initial run -- producing 20 additive improvements and cutting the "Time to GPT-2" leaderboard from 2.02 hours to 1.80 hours. That is an 11% gain discovered entirely by an AI agent, with zero human intervention during the experimental loop.

TL;DR

Karpathy open-sourced AutoResearch — a 630-line Python framework for autonomous ML experimentation
Ran 700 experiments in 2 days, finding ~20 improvements for an 11% training speedup
Git-as-state-machine: keeps improvements as commits, reverts failures automatically
Already scaled: SkyPilot ran 910 experiments in 8 hours for $309
50.4K GitHub stars — this pattern is becoming the standard for AI self-improvement

This is not another academic paper or a demo video. It is a working system, already forked 7,000 times and starred over 50,000 times on GitHub in under three weeks. Shopify CEO Tobi Lutke ran 37 experiments overnight and saw a 19% gain. SkyPilot scaled it to 910 experiments in 8 hours for $309 total. The barrier to autonomous AI research just dropped to a single GPU and a weekend.

Who Is Karpathy and Why Does This Matter?

Andrej Karpathy is the former Senior Director of AI at Tesla and a founding member of OpenAI. He built Tesla's Autopilot vision system and taught Stanford's most popular deep learning course. When Karpathy ships code, the ML community pays attention -- not because of his title, but because his work tends to become infrastructure. His previous open-source projects (char-rnn, minGPT, nanoGPT) defined how a generation of researchers learned to train language models.

AutoResearch is his answer to a question the field has been circling for years: can AI agents do the tedious, iterative work of ML research -- the hyperparameter sweeps, the architecture tweaks, the training loop optimizations -- faster and cheaper than humans? The answer, based on the first wave of results, is yes. Decisively.

The timing matters too. We are in a period where AI agents are demonstrating capabilities that range from useful to alarming. AutoResearch stakes out the useful end of that spectrum: a constrained, transparent, auditable system that does one thing well -- run experiments and keep what works.

How AutoResearch Works: Three Files and a Git Repo

AutoResearch is an autonomous ML experimentation framework that uses AI agents to iteratively improve a training pipeline through constrained, evaluated mutations. The architecture is deliberately minimal. The entire system is three Python files totaling 630 lines:

prepare.py -- The immutable evaluator. Downloads data, builds vocabulary, creates train/val splits. This file is never modified by the agent. It is the fixed ground truth that prevents the system from gaming its own metrics.
train.py -- The sole mutable target. Contains the GPT model definition, Muon+AdamW optimizers, and the training loop. This is the only file the agent is allowed to change. Every experiment is a proposed mutation to this file.
program.md -- Human-authored agent instructions. Contains the rules of engagement, the optimization objective, and a notable directive: "NEVER STOP." This file tells the AI agent what to optimize and how to behave.

The experimental loop follows a strict protocol. The agent reads the current state of train.py, proposes a modification, runs training with a fixed 5-minute wall-clock budget, and measures val_bpb (validation bits per byte) as the single fitness metric. If the new score beats the previous best, the change is committed to git. If it does not, the change is reverted. Every experiment is a git commit -- kept or discarded, with a full audit trail.

This is the core insight: git is the state machine. The repository's commit history is the experiment log. The last committed state is always the best known configuration. There is no database, no experiment tracker, no MLflow instance -- just git.

What is AutoResearch's throughput?

AutoResearch achieves approximately 12 experiments per hour with a 5-minute wall-clock budget per experiment. Karpathy's initial 2-day run produced roughly 700 experiments and 20 additive improvements. When SkyPilot parallelized the system across cloud GPUs, throughput increased to 9x -- completing 910 experiments in 8 hours at a total cost of $309.

The Results: Numbers That Speak

The raw numbers from the first wave of AutoResearch deployments tell a clear story:

Karpathy's run: ~700 experiments over 2 days. 20 additive improvements retained. "Time to GPT-2" dropped from 2.02 hours to 1.80 hours -- an 11% reduction in training time.
Tobi Lutke (Shopify CEO): 37 experiments run overnight on personal hardware. 19% improvement in training efficiency. Zero ML expertise required to operate the system.
SkyPilot scaled run: 910 experiments in 8 hours across cloud GPUs. Total cost: $309. 9x throughput improvement over single-GPU operation.
Community best: val_bpb improved from 0.9979 to 0.9697 over 126 experiments -- a meaningful compression in validation loss discovered entirely by autonomous agents.

The GitHub traction confirms the impact: 50,400 stars and 7,000 forks in under three weeks. For context, nanoGPT -- Karpathy's previous viral project -- took months to reach similar numbers. AutoResearch hit them in days.

The Architecture Decisions That Make It Work

Several design choices separate AutoResearch from the long history of automated machine learning (AutoML) systems:

Immutable evaluator. By making prepare.py read-only, the system cannot accidentally (or adversarially) modify its own evaluation criteria. This is a lesson that many autonomous agent systems learn the hard way -- when the agent can change the test, the test becomes meaningless. AutoResearch sidesteps this by architectural constraint, not by trust.

Single mutable file. The agent can only touch train.py. This prevents scope creep, dependency tangles, and the kind of cascading changes that make experiments unreproducible. Every mutation is localized and reversible.

Fixed time budget. Five minutes per experiment, enforced by wall clock. This creates a natural selection pressure: only changes that produce measurable improvements within the time constraint survive. It also makes the system predictable -- you know exactly how long 100 experiments will take.

Git as experiment log. No proprietary format, no database, no vendor lock-in. Anyone with git log can inspect the full history of every experiment, every mutation, every result. This is auditable science, not a black box.

The Counterarguments: What Critics Are Saying

AutoResearch has attracted legitimate criticism alongside the hype. The counterarguments deserve serious engagement:

"This is just AutoML with a better mutation operator." Critics point out that neural architecture search (NAS) and Hyperband have done automated experiment iteration for years. The difference is the mutation operator: instead of a fixed search algorithm, AutoResearch uses an LLM that can read code, understand context, and propose semantically meaningful changes. Whether that constitutes a paradigm shift or just a better hill-climbing strategy is a fair debate.

"It does not build theory." AutoResearch finds improvements but cannot explain why they work. There is no mechanism for building transferable understanding -- no hypothesis generation, no causal reasoning, no theory construction. It optimizes; it does not comprehend. For some applications, that is fine. For advancing scientific understanding, it is a limitation.

"700 experiments against the same validation set risks overfitting." This is a real concern. Repeated optimization against a fixed validation set can produce solutions that exploit idiosyncrasies of that specific data split rather than genuinely improving the model. The more experiments you run, the higher the risk of validation set spoilage.

"Transfer claims are unverified." Karpathy has stated that improvements discovered at depth-12 transfer to depth-24 models, but this claim has not been independently reproduced. Until it is, the generalizability of AutoResearch discoveries remains an open question.

"Security surfaces exist." GitHub Issues #64 and #41 flag real risks: prompt injection via run.log (where the agent reads its own output history) and cached artifact tampering. These are solvable problems, but they highlight that autonomous agent systems require security thinking from day one, not as an afterthought.

Why This Matters: The SETI@home Vision for AI Research

The immediate impact of AutoResearch is clear: faster, cheaper ML experimentation. But Karpathy is thinking bigger. In a post on X, he outlined the next step: "The next step for autoresearch is that it has to be asynchronously massively collaborative for agents. Think: SETI@home style."

Imagine a distributed network where thousands of AI agents run experiments in parallel, each contributing discoveries to a shared knowledge base. Not a centralized lab with a $100M compute budget, but a federated swarm of agents, each running on commodity hardware, each exploring a different corner of the optimization landscape. As Karpathy told Fortune: "any metric you care about... can be autoresearched by an agent swarm."

This is the inflection point. AutoResearch is not just a tool for training better language models. It is a template for autonomous optimization of any measurable system. Training loops today. Compiler flags tomorrow. Drug molecule configurations next year. The pattern is general: define a metric, constrain the search space, let agents iterate, keep what works.

Karpathy's own framing is blunt: "All LLM frontier labs will do this. It's the final boss battle." And: "Humans no longer write most code. We direct, supervise, and orchestrate agents." Whether you find that exciting or unsettling, the 50,000 stars on GitHub suggest the community has already voted.

For those building AI agent systems, the implications are immediate. The pattern Karpathy formalized -- immutable evaluator, mutable target, git-gated changes, fixed time budgets -- is applicable far beyond ML training. Enterprise AI platforms like NVIDIA's NemoClaw are already building infrastructure for exactly this kind of autonomous agent workflow. The question is no longer whether AI agents will run research loops autonomously. It is how fast the tooling matures.

What Builders Should Do Now

If you work with ML models, clone the repo and run it. The barrier is genuinely low: a single GPU, Python 3, and an afternoon. Start with the default configuration, let it run 50 experiments, and study the git log to understand what the agent tried and what survived. The learning is in the commit history.

If you build AI agent systems, study the architecture decisions. The immutable evaluator pattern, git-as-state-machine, and fixed time budgets are transferable to any domain where you want agents to iterate autonomously without losing control. These are not ML-specific ideas -- they are agent design principles.

If you are an ML researcher, engage with the counterarguments seriously. Validation set spoilage, lack of theory generation, and unverified transfer claims are real limitations. The community that addresses these gaps will define whether AutoResearch becomes a serious research methodology or remains a sophisticated hyperparameter tuner.

Either way, the era of AI agents running autonomous research loops is here. It arrived in 630 lines of Python.

Frequently Asked Questions

What is Karpathy's AutoResearch?

AutoResearch is a 630-line open-source Python framework created by Andrej Karpathy that enables AI agents to autonomously run machine learning experiments. Released on March 6, 2026 under the MIT license, it uses three files -- an immutable evaluator (prepare.py), a mutable training script (train.py), and agent instructions (program.md) -- to iteratively improve ML training pipelines. Each experiment runs within a fixed 5-minute wall-clock budget, and results are tracked via git commits.

How many experiments can AutoResearch run per hour?

AutoResearch achieves approximately 12 experiments per hour on a single GPU with a 5-minute wall-clock budget per experiment. Karpathy's initial 2-day run completed roughly 700 experiments. When parallelized across cloud GPUs using SkyPilot, throughput increased to 9x, completing 910 experiments in 8 hours at a total cost of $309.

How does AutoResearch prevent the AI from gaming its own metrics?

AutoResearch enforces an immutable evaluator architecture. The evaluation file (prepare.py) is read-only and cannot be modified by the AI agent. The agent can only change train.py, the training script. This architectural constraint ensures the fitness metric (val_bpb, or validation bits per byte) remains a fixed, trustworthy measure throughout all experiments.

Is AutoResearch just automated machine learning (AutoML) rebranded?

AutoResearch shares conceptual DNA with AutoML techniques like neural architecture search (NAS) and Hyperband, but differs in its mutation operator. Instead of a fixed search algorithm, it uses an LLM that reads code, understands context, and proposes semantically meaningful modifications. Whether this constitutes a paradigm shift or an incremental improvement is actively debated in the ML community.

What are the known limitations and risks of AutoResearch?

Five key limitations have been identified: (1) running hundreds of experiments against the same validation set risks overfitting, (2) the system cannot explain why improvements work -- it optimizes without building theory, (3) transfer claims from depth-12 to depth-24 models have not been independently reproduced, (4) security vulnerabilities exist including prompt injection via run.log and cached artifact tampering, and (5) the system currently operates on a single optimization metric, limiting multi-objective research.

Stay ahead of the AI curve -- bookmark nevo.systems for daily intelligence on AI agents, autonomous systems, and the technologies reshaping how software gets built.