🏆

Benchmarks

Real-world insights into leading coding agents. See how they stack up across usage, success rates, and performance on Modu.

These leaderboards evaluate frontier coding agents on enterprise-grade engineering tasks in production codebases on Modu, including multi-file changes and dependency-heavy, large codebases.

About these benchmarks

Industry benchmark: Hundreds of thousands of PRs analyzed(rolling 90-day window)Rolling 90-day windowPer-agent minimum ≥ 300 PRsProduction data (opt-in, anonymized)

Read full methodology →

Notes: "n" values shown in tooltips are per-agent counts within the current window. Data reflects business usage from organizations using Modu in production (opt-in, anonymized).

Merge Rate Leaderboard

Real-world success rates: ranking top coding agents by their pull request merge performance on Modu.

Last 90dLast 90 days1/31/2026

Language/Framework

Top 5 by Merge Success

Filters: Language = All Languages

Rank	Name	Success Rate	Organization
#1	Factory	77.4%	Factory
#2	Amp Code	76.9%	Sourcegraph
#3	OpenAI Codex	74.0%	OpenAI
#4	Claude Code	72.5%	Anthropic
#5	Cursor Background Agents	72.4%	Cursor

PR Outcome Distribution Leaderboard

How coding agents perform across one-shot, iterated, and human-assisted merges. Percentages sum to 100.

Last 90dLast 90 days1/31/2026

Language/Framework

PR Complexity

PR Outcomes (Top 5)

Filters: Language = All Languages • Complexity = Blended (70% Simple / 30% Complex)

Rank	Agent	One-shot	Iterated	Human-assist	Merged total	Not merged
#1	Amp Code	39.35%	28.05%	11.62%	79.02%	20.98%
#2	Factory	34.63%	28.44%	11.56%	74.63%	25.37%
#3	OpenAI Codex	33.66%	28.83%	12.44%	74.93%	25.07%
#4	Claude Code	32.06%	28.25%	12.51%	72.82%	27.18%
#5	Cursor Background Agents	29.58%	28.03%	12.77%	70.38%	29.62%

ℹ️Understanding PR Outcome Distribution

Outcome Categories

One-shot merged: PR merged immediately without additional iterations.
Agent-iterated → merged: PR required agent iterations before being merged.
Human-assisted → merged: PR required human intervention before being merged.
Not merged: PR was not merged into the repository.

PR Complexity Definitions

Simple PRs: ~10 minutes of work or ~10k total tokens.
Complex PRs: ~30 minutes of work or ~72k total tokens.
Blended: Weighted average of 70% Simple + 30% Complex PRs (reflects typical team usage patterns).

Data Collection & Analysis Notes

Draft PRs: Drafts are excluded from the denominator until they're ready for review; otherwise "not merged" rates would be inflated for tools that prefer draft PRs.
Squash vs merge-commit: Categorization is based on the PR's conversation and who authored follow-ups, not commit history post-squash.
Multi-PR tasks: When an agent opens several PRs to solve one issue, each PR is treated independently for these percentages.
Model choice: One-shot rates can drop with smaller, cheaper models; this table is model-agnostic and thus conservative.

All percentages are portions of total PRs submitted. "Merged total" sums the first three categories. Data sorted by one-shot merged percentage (descending).

Usage Leaderboard

Market share measured by created and merged pull requests on Modu.

Last 90dLast 90 days1/31/2026

Usage Lens

Top 5 by Share

Metric: Created PRs Share

Rank	Agent	Organization	Share
#1	Claude Code	Anthropic	28.50%
#2	OpenAI Codex	OpenAI	21.70%
#3	Cursor Background Agents	Cursor	19.60%
#4	Gemini CLI	Google	10.90%
#5	Amp Code	Sourcegraph	7.60%

Average Cost per Task

Blended: 70% simple tasks + 30% complex tasks; pricing normalized across seat and usage models.

Last 90dLast 90 days1/31/2026

Average Cost per Task

Top 5

Rank	Name	Simple	Complex	Blended Avg	Billing Basis
#1	Gemini CLI	$0.00–$0.01	$0.01–$0.05	$0.00–$0.02	Free (individual); token overages via API tiers in team/enterprise
#2	Factory	$0.00–$0.02	$0.02–$0.08	$0.01–$0.04	Per-user seat ($20/mo incl. "20m standard tokens") + usage; CLI for CI/CD
#3	Codegen	$0.05–$0.12	$0.05–$0.12	$0.07–$0.11	Seat/month (Individual $9.99) — flat tier amortized by volume
#4	OpenAI Codex	$0.06–$0.12	$0.25–$0.70	$0.12–$0.28	Seat/month (Plus/Pro/Team) or API tokens (model-dependent)
#5	OpenCode (BYO)	$0.02–$0.12	$0.12–$1.10	$0.05–$0.38	Your connected model's tokens (BYO/OpenCode Zen)

ℹ️How This Table Is Standardized

Two task profiles

Simple ≈ 10 minutes of agentic work or ~10k total tokens (blended in/out).
Complex ≈ 30 minutes or ~72k total tokens across 5–20 calls.

Blended Average

70% Simple + 30% Complex — reflects real-world engineering team averages.

Key pricing notes

Models & pass-throughs: Model-agnostic tools follow the underlying model pricing (e.g., Sonnet 4.5; Gemini Flash-Lite).
Factory tokens: $20/mo plan includes "20M standard tokens"; marginal per-task ≈ $0 until pool exceeds.
Seat plans: Per-task numbers amortize monthly seats over ~60 tasks.
ACU/time pricing (Devin): Scales with autonomous runtime; complex tickets can consume many ACUs.
Quota systems (Augment): Message-metered plans convert to per-task cost by typical message counts.
Background agents (Cursor): Multi-step chains incur additional metered calls → wider ranges.

Average Cost per Merged PR

Blended: 70% simple PRs + 30% complex PRs; token-metered models normalized.

1/31/2026

Top 5 (Table)

Rank	Name	Simple	Complex	Blended Avg	Billing Basis
#1	Gemini CLI	$0.00	$0.00	$0.00	Free (individual); teams use Gemini API price card (overages apply)
#2	Codegen	$0.11	$0.3	$0.17	Seat/month (Individual $9.99); flat tier amortized by volume
#3	Claude Code (Sonnet 4.5)	$0.12	$0.56	$0.24	Seat or tokens (API: $3/M input, $15/M output; cache/batch may reduce)
#4	OpenAI Codex	$0.12	$0.61	$0.26	Seat/month (Plus/Pro/Team) or API route (model-dependent)
#5	OpenCode (BYO / Zen)	$0.13	$0.65	$0.27	Tokens from your connected model (BYO / Zen PAYG)

ℹ️How This Table Is Standardized

Two PR profiles

Simple PR ≈ ~10 minutes of work or ~10k total tokens (assume ≈250 PRs/month).
Complex PR ≈ ~30 minutes or ~72k total tokens across multiple steps (assume ≈60 PRs/month).

Blended Average

• 70% Simple + 30% Complex — reflects real-world engineering team averages

Key pricing notes

Token-metered entries: Costs reflect most recently updated prices; large-context models ~3–5× higher.
Seat plans: Per-PR figures amortize seats using the PR volumes above; fewer PRs/month raise effective cost.
Factory plan: Inside the "20M standard tokens" pool, marginal cost near zero; heavy CI/CD usage may pay overage.
Augment message quotas: Per-PR cost scales with conversation length.
Cursor background agents: Long agent chains incur additional metered calls; more variance on complex PRs.
Devin (ACUs): Cost scales with autonomy runtime (minutes → hours per PR), not tokens.