Anthropic: Claude Now Authors >80% of Merged Code as Recursive Self-Improvement Metrics Accelerate

Fresh data from Anthropic’s Institute post “When AI builds itself” (surfacing prominently this week) quantifies how quickly Claude is internalizing Anthropic’s own R&D loop. The numbers paint a clear picture of compounding automation in frontier AI development.
Core Metrics (as of May 2026)
- >80% of code merged into Anthropic’s production codebase is now authored by Claude — up from low single digits before the February 2025 Claude Code research preview.
- Engineer output (lines of code merged per engineer per day) has risen 8x since 2024. Output per engineer had been essentially flat from 2021–2024; the curve steepened sharply once Claude began autonomously writing and editing full files.
- Task length horizon (reliable autonomous completion time) is roughly doubling every 4 months:
- March 2024 (Claude Opus 3): ~4-minute tasks
- March 2025 (Claude Sonnet 3.7): ~1.5-hour tasks
- March 2026 (Claude Opus 4.6): ~12-hour tasks
- Trend line points to multi-day tasks later in 2026 and week-scale work in 2027.
- Experimental speedups on defined research optimization loops reached ~52x (Claude Mythos Preview, April 2026). A skilled human researcher typically needs 4–8 hours to achieve ~4x on the same class of task.
Additional Signals of Velocity
- Session success rate on open-ended coding tasks hit 76% in May 2026 (up 50 percentage points in six months).
- In one April 2026 API error-reduction sprint, Claude shipped >800 fixes that cut error rates by a factor of 1,000 — work a human engineer estimated would have taken roughly four years.
- Code quality has reached parity with senior human engineers (late 2025 it was still slightly behind); Anthropic expects it to pull ahead within the next year.
- Internal researcher poll (March 2026) showed a median estimate of ~4x output increase when using the latest Claude preview.
What This Means for AI R&D
These figures come from internal attribution pipelines and controlled experiments, not just benchmarks. SWE-bench and CORE-Bench (research reproduction) have moved from low performance to near-saturation in roughly two years and 15 months, respectively. The pattern is consistent: once models can reliably handle longer-horizon tasks with minimal human scaffolding, the iteration loop tightens dramatically.
Anthropic frames this as early but measurable progress toward recursive self-improvement — where AI systems increasingly design, implement, test, and improve their own successors. Human researchers still set high-level direction and exercise judgment on problem selection and evaluation rubrics, but execution velocity is shifting fast.
Actionable implications for other labs and infra teams:
- Coding-agent adoption curves are steepening; labs still on lighter tool use are likely seeing widening productivity gaps.
- R&D throughput is becoming more compute-bound than human-bound on well-scoped tasks.
- Bottlenecks are migrating upstream (idea generation, evaluation design, long-term research taste) and downstream (review, integration, safety validation).
- The same acceleration that delivers 52x experimental speedups also compresses the timeline for capability jumps — relevant for both capability forecasting and alignment work.
The post (co-authored by Marina Favaro and Jack Clark) is explicit that full recursive self-improvement remains uncertain and carries meaningful control risks, but the internal data shows the trend is already material inside one of the leading labs.
Direct link: https://www.anthropic.com/institute/recursive-self-improvement
Data like this is now the clearest public window into how quickly frontier AI organizations are automating their own core work.