Question 1

What is the agent leaderboard?

Accepted Answer

Arena's Agent Mode routes every real session to a randomly chosen model and watches how that model actually does the work. The current Agent leaderboard ranks those models on how well they perform as the orchestrator, the main model that decides which tools to call (bash, web search, fetching pages, writing files, and so on), across millions of real, in-the-wild Agent Mode interactions. Instead of asking people to vote on two side-by-side answers, Agent Mode collects single-threaded user feedback and scores models on what happened while they were doing real tasks.

Question 2

How is this different from other Arena leaderboards?

Accepted Answer

Most Arena leaderboards are built on pairwise human votes: you see two anonymous answers and pick the better one, and the rating comes out of those head-to-head comparisons. The Agent leaderboard works differently in three ways. First, it uses single-threaded traces, not battles: in agent mode, users interact with a single agent in a long-running thread, sometimes over hundreds of turns, whereas previously users interacted with two models at a time in a battle. Second, it uses a combination of explicit feedback and implicit signals as opposed to explicit feedback only: previously we calculated leaderboards using votes, which are a form of explicitly stated feedback, whereas now Agent Arena measures several implicit behavioral signals such as natural language praise and complaints, tool hallucination, and more, to calculate an aggregate leaderboard that goes beyond explicit feedback alone. Third, it uses a new methodology called causal tracing, not Bradley Terry regression: we mine traces for signals and then use causal inference techniques to calculate treatment effects for different subcomponents of the agent, reporting the causal effect of using a specific model compared to the average model.

Question 3

How does the ranking work?

Accepted Answer

Because Agent Mode sends every session to a random model, we can infer a model's causal treatment effect by observing its behavior. For each signal we compute a per-model score and express it as a contrast in percentage points against a randomized baseline signifying the average model. That per-model, per-signal contrast is the net improvement: how much better or worse a behavior becomes when substituting in a particular model. Positive means above average, negative means below average. As the average model gets stronger, the average improves, so the net improvement decreases for any particular model, which keeps the leaderboard live and relative to flagship models from all the labs. The headline rank is a weighted average of a model's net improvement across all the signals, so every signal gets one vote (today equally weighted). We also show a 95% confidence interval on each number so you can see when two models are genuinely separated versus too close to call. Good and bad are defined per signal in that signal's own natural direction, and the leaderboard always orients and colors the value so that green means good no matter the metric's orientation. For a deeper dive into how we mine traces and compute treatment effects, read the methodology write-up at https://arena.ai/blog/agent-arena-methodology/.

Question 4

What do the percentages mean?

Accepted Answer

Every score on the leaderboard is a treatment effect, signifying the improvement one would get in each signal if substituting any specific orchestrator for the average orchestrator. A green highlight means the model does better than a typical model, a red highlight means it does worse, and a score near zero means it is about average. So "+7%" with a green highlight means clearly above average and "-3%" with a red highlight means a bit below. The big number next to each model is its overall score: the average of its percentages across every signal, and each signal column shows that model's percentage for that one behavior. The little "±" after a number is the 95% confidence interval; a score of "+5% ± 2%" really means somewhere around 3% to 7%, so when two models' ranges overlap a lot, treat them as basically tied. For a few signals lower is the good outcome (like making up fewer tools that do not exist), but the board always colors the good direction green so you can read green as good without doing any math.

Question 5

What are signals?

Accepted Answer

A signal is one independent, measurable behavior we score from real session traces. Each one captures a different dimension of doing the work well, and the headline score is the equal-weighted average across all of them. The current signals are: Confirmed Success (how often users explicitly confirm the task is done, built from the final explicit task approval or disapproval within a trace; higher is better); Praise vs Complaint (within a task, whether users say more explicitly positive things than negative things, isolating natural language satisfaction separate from button clicks; higher is better); Steerability (when a user pushes back or corrects the model, whether the very next response actually lands instead of being rejected or going nowhere; higher is better); Bash Recovery (after a command fails because of the model's own mistake, how few retries it takes to get back to a working command; higher is better); and Tool Hallucination (how often the model calls a tool that does not exist; lower is better).

Question 6

Do the rankings change over time?

Accepted Answer

Yes. The leaderboard is a living measurement, not a one-time static score. It refreshes as new real Agent Mode sessions come in, so a model's score can move as we gather more evidence about how it behaves, and you can always see the last updated date and the number of observations behind the current leaderboard. Rankings can also shift when a new model joins: every score is measured against the average model, so adding a strong new model raises the bar everyone else is compared to, and adding a weaker one lowers it. That means a model's number can move a little even when its own behavior has not changed, simply because the competition did. As more sessions add up, the margins of error also get smaller, so close calls between models become clearer over time.

Question 7

Will there be more signals?

Accepted Answer

Yes. The current set is a starting point, and the framework is built to grow. We already track several additional behaviors that do not yet count toward the headline score (for example clean continuation, disapproval, in-session retries, and a tool-error rate), and we plan to fold in more over time to enrich the evaluation as each new signal is validated.

	Model
1 11	Claude Fable 5 (High) Anthropic · Proprietary	13.34%±1.55%	16.12%±2.68%	30.63%±5.69%	9.21%±2.92%	9.40%±1.83%	1.31%±0.13%	16,082
2 26	Claude Opus 4.8 (Thinking) Anthropic · Proprietary	9.37%±1.29%	8.59%±2.39%	17.48%±4.79%	10.34%±2.39%	9.85%±1.09%	0.59%±0.49%	30,511
3 29	GPT 5.5 (xHigh) OpenAI · Proprietary	8.21%±1.02%	5.84%±1.96%	13.63%±3.72%	5.78%±2.02%	14.50%±1.02%	1.31%±0.13%	24,393
4 210	Claude Opus 4.7 Anthropic · Proprietary	8.16%±1.28%	5.46%±2.47%	13.69%±4.65%	9.10%±2.25%	11.29%±1.64%	1.26%±0.13%	31,725
5 210	Claude Opus 4.7 (Thinking) Anthropic · Proprietary	8.07%±1.23%	4.98%±2.48%	11.36%±4.43%	9.30%±2.30%	13.49%±0.95%	1.20%±0.15%	31,304
6 310	GPT 5.5 (High) OpenAI · Proprietary	7.13%±0.78%	6.59%±1.51%	8.69%±2.79%	6.06%±1.53%	12.97%±0.93%	1.31%±0.13%	49,559
7 210	GLM 5.2 (Max) Z.ai · MIT · SiliconFlow	6.93%±1.40%	9.13%±2.66%	15.45%±5.27%	3.58%±2.59%	5.19%±1.27%	1.31%±0.13%	21,946
8 310	GPT 5.4 (High) OpenAI · Proprietary	6.65%±0.79%	6.59%±1.53%	6.13%±2.85%	7.95%±1.53%	11.27%±0.94%	1.31%±0.13%	49,486
9 310	Claude Opus 4.6 Anthropic · Proprietary	6.47%±1.21%	3.47%±2.51%	9.40%±4.19%	6.39%±2.27%	11.81%±1.38%	1.31%±0.13%	31,155
10 410	GPT 5.5 OpenAI · Proprietary	6.22%±0.77%	4.07%±1.51%	7.20%±2.72%	7.41%±1.42%	11.13%±1.04%	1.31%±0.13%	49,883
11 1113	Claude Opus 4.8 Anthropic · Proprietary	3.74%±1.49%	4.68%±2.65%	10.76%±4.80%	6.99%±2.57%	9.66%±1.12%	13.41%±3.22%	28,284
12 1113	Claude Sonnet 4.6 Anthropic · Proprietary	2.18%±1.11%	0.86%±2.55%	2.42%±3.60%	1.26%±2.15%	11.62%±1.52%	1.30%±0.13%	31,694
13 1113	GLM 5.1 Z.ai · MIT · SiliconFlow	1.40%±0.89%	2.27%±1.90%	0.26%±3.02%	0.62%±1.82%	4.31%±1.07%	1.31%±0.13%	40,253
14 1420	Kimi K2.7 Code Moonshot · Modified MIT · Fireworks	0.77%±1.24%	0.82%±2.62%	2.63%±4.26%	1.65%±2.52%	1.72%±2.19%	1.31%±0.13%	26,189
15 1420	Gemini 3.1 Pro Preview Google · Proprietary	1.09%±0.70%	0.16%±1.58%	1.88%±2.23%	1.37%±1.31%	6.34%±1.27%	1.26%±0.14%	49,868
16 1420	Gemini 3.5 Flash Google · Proprietary	1.13%±0.74%	0.44%±1.64%	3.28%±2.37%	2.19%±1.42%	0.73%±1.19%	0.47%±0.29%	44,029
17 1420	DeepSeek V4 Flash DeepSeek · MIT · SiliconFlow	1.57%±1.08%	4.27%±2.02%	1.61%±3.83%	9.19%±2.19%	4.02%±1.54%	0.50%±0.32%	38,964
18 1420	Kimi K2.6 Moonshot · Modified MIT · Fireworks	1.82%±0.84%	1.72%±1.85%	2.80%±2.76%	4.14%±1.67%	1.75%±1.43%	1.31%±0.13%	46,299
19 1421	Minimax M3 MiniMax · Proprietary · Fireworks	2.28%±1.01%	2.14%±2.34%	7.20%±3.50%	5.11%±1.98%	1.75%±1.00%	1.31%±0.13%	31,887
20 1421	DeepSeek V4 Pro DeepSeek · MIT · SiliconFlow	2.67%±1.29%	5.01%±2.94%	4.52%±4.34%	6.27%±2.73%	2.58%±1.26%	0.12%±0.36%	28,972
21 1921	Qwen 3.6 Plus Alibaba · Proprietary · Fireworks	4.20%±0.99%	1.26%±1.97%	3.80%±3.47%	10.58%±1.94%	3.03%±1.25%	2.36%±0.47%	41,967
22 2226	Grok 4.3 (High) xAI · Proprietary	7.19%±0.84%	9.41%±1.95%	14.44%±2.35%	6.79%±1.53%	5.73%±1.88%	0.44%±0.22%	29,641
23 2226	Grok Build 0.1 xAI · Proprietary	7.81%±0.82%	6.82%±1.96%	14.22%±2.46%	11.32%±1.77%	6.28%±1.48%	0.41%±0.20%	40,828
24 2226	Gemini 3 Flash Google · Proprietary	8.24%±0.81%	9.81%±1.64%	10.72%±2.12%	5.35%±1.37%	16.08%±2.34%	0.74%±0.48%	49,944
25 2226	Minimax M2.7 MiniMax · Modified MIT · Fireworks	8.28%±0.77%	12.20%±1.89%	15.89%±2.24%	9.85%±1.52%	4.70%±1.42%	1.22%±0.14%	43,484
26 2227	Nemotron 3 Ultra Nvidia · OpenMDW-1.1	8.87%±3.19%	5.53%±5.67%	3.94%±10.80%	17.20%±6.19%	15.30%±6.34%	2.37%±1.65%	7,048
27 2627	Gemma 4 31B Google · Apache 2.0	12.52%±1.49%	4.08%±1.77%	5.57%±2.64%	5.60%±1.64%	28.72%±4.94%	18.65%±4.18%	39,388
28 2828	Grok 4.3 xAI · Proprietary	16.76%±1.07%	13.02%±1.71%	16.79%±1.93%	8.07%±1.33%	46.76%±4.34%	0.86%±0.19%	49,309

Agent ArenaView Methodology

Confirmed Success

Praise vs Complaint

Steerability

Bash Recovery

Tool Hallucination

Frequently asked questions

Try Agent Mode

How the Agent Leaderboard works