Anthropic Claude: $20,000, 16 AI Agents, and a Compiler That Builds Linux

TL;DR

Anthropic researcher Nicholas Carlini orchestrated 16 autonomous Claude agents working in parallel to build a 100,000-line C compiler in Rust. Using a custom harness for task coordination, testing, and conflict resolution, the agent team produced a compiler capable of building Linux 6.9 across multiple architectures.

Key Points

Highlight key points with color coding based on sentiment (positive, neutral, negative).

The project resulted in a 100,000-line compiler capable of building Linux 6.9 on x86, ARM, and RISC-V architectures.

The project surfaced challenges including task synchronization, frequent merge conflicts, quality assurance issues, and functional limitations.

The work describes the use of agent teams to execute tasks in parallel, coordinated through a simple file-based locking mechanism.

The agent team approach demonstrates that complex software projects can be produced autonomously with limited human supervision.

The authors emphasize the importance of high-quality tests and strict continuous integration to keep autonomous agents on track.

A recent project has developed a Rust-based C compiler using a novel method called "agent teams." This approach involves running 16 instances of the language model Claude in parallel, without active human intervention. Over nearly 2,000 Claude Code sessions and approximately two weeks of continuous execution, the agent team produced a 100,000-line compiler capable of building Linux 6.9 across x86, ARM, and RISC-V architectures. In total, the project consumed roughly 2 billion input tokens, generated 140 million output tokens, and cost just under $20,000 in API usage. The primary focus of the work was not only the compiler itself, but the design of frameworks that allow autonomous agent teams - each running in an isolated Docker container - to make sustained progress and execute tasks in parallel. Challenges such as task synchronization, merge conflicts, and regression control were central to the experiment.

The "agent teams" concept allows multiple Claude instances to autonomously work on a shared codebase. Each agent runs inside its own isolated container with a local copy of the repository, while a bare upstream Git repository is used for synchronization. To avoid agents duplicating work, the framework implements a simple file-based locking mechanism, where an agent claims a task by creating a lock file before starting work. If a conflict occurs, Git forces the agent to rebase and choose a different task. Once a task is completed, the agent merges changes back into the upstream repository and releases the lock. A continuous execution loop keeps spawning fresh agent sessions, enabling long-running autonomous development without manual supervision.

Maintaining correctness and preventing regressions proved to be one of the most difficult aspects of the project. As the compiler grew in scope, agents frequently introduced changes that broke existing functionality. To address this, the project incorporated high-quality test suites, strict continuous integration pipelines, and known-good compiler oracles. In particular, GCC was used as a reference compiler to compare outputs and isolate failing files when compiling large codebases such as the Linux kernel. This technique enabled agents to split a monolithic task into smaller, parallelizable units. Using this approach, the compiler was able to build not only the Linux kernel but also large real-world projects including QEMU, FFmpeg, SQLite, Postgres, Redis, Lua, and libjpeg, achieving a ~99% pass rate on most compiler test suites, including the GCC torture tests.

Despite these successes, the compiler has clear limitations. It lacks a native 16-bit x86 code generator, which is required to boot Linux from real mode and instead relies on GCC for that stage.

The project also does not yet have a fully reliable in-house assembler or linker, again falling back to GCC tooling in some cases. Even with optimization passes enabled, the generated machine code is significantly less efficient than GCC’s output, sometimes performing worse than GCC with optimizations disabled. While the Rust codebase is functional and maintainable, it does not yet match the quality or performance of production-grade compilers written by expert human teams.

Overall, the project serves as a stress test of the current limits of autonomous agent teams. It demonstrated both their surprising capabilities and the constraints that still require careful human oversight.

Key Numbers

Present key numerics and statistics in a minimalist format.

The number of instances of the language model Claude used in parallel for the project.

100,000 lines

The total number of lines in the Rust-based C compiler.

6.9

The version of Linux that the compiler is capable of building.

2,000

The number of Claude Code sessions involved in the project.

2 billion tokens

The total number of input tokens consumed during the project.

140 million tokens

The total number of output tokens generated during the project.

20,000 USD

The total cost of the project.

99 %

The pass rate of the compiler on most compiler test suites.

2 weeks

The approximate duration of the project.

3 architectures

The number of CPU architectures supported by the compiler.

Stakeholder Relationships

An interactive diagram mapping entities directly or indirectly involved in this news. Drag nodes to rearrange them and see relationship details.

People

Key entities and stakeholders, categorized for clarity: people, organizations, tools, events, regulatory bodies, and industries.

Nicholas Carlini Researcher

Researcher on Anthropic’s Safeguards team who designed and ran the experiment using parallel Claude agents to build a C compiler.

Organizations

Key entities and stakeholders, categorized for clarity: people, organizations, tools, events, regulatory bodies, and industries.

Anthropic Research Organization

AI research organization that develops Claude and where the agent team compiler experiment was conducted.

Tools

Key entities and stakeholders, categorized for clarity: people, organizations, tools, events, regulatory bodies, and industries.

Claude Language Model

A large language model instantiated in multiple parallel agents to autonomously write, test, and debug a Rust-based C compiler.

Claude Code Agent Environment

The development environment used to run Claude agents in continuous loops for autonomous coding and testing.

GCC Compiler

A known-good C compiler used as an oracle to validate correctness and isolate failures during kernel compilation.

Git Version Control System

Used for task locking, synchronization, merging changes, and coordinating work between parallel Claude agents.

Docker Container Runtime

Used to run each Claude agent in an isolated container with its own workspace and controlled environment.

Timeline of Events

Timeline of key events and milestones.

Feb 05, 2026 Discussion of agent teams experiment

The experiment involved using multiple instances of the language model Claude to develop a Rust-based C compiler with 16 agents, resulting in a 100,000-line compiler capable of building Linux 6.9 on various architectures.

Late 2025 First functional compiler produced with Opus 4.5

Opus 4.5 was the first model version able to produce a functional compiler that could pass large test suites, although it was still incapable of compiling large real-world projects.

Jan 2026 Opus 4.6 large-scale compiler experiment

Over nearly two weeks, Opus 4.6 was tested across approximately 2,000 Claude Code sessions, consuming 2 billion input tokens and generating 140 million output tokens at a cost just under $20,000. The resulting compiler could build a bootable Linux 6.9 on x86, ARM, and RISC-V, and compile projects such as QEMU, FFmpeg, SQLite, Postgres, and Redis.