• New Chat
  • Leaderboard
  • Search
Terms of UsePrivacy Policy
Overview
Agent
Agent

Agent ArenaView Methodology

Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.

Jun 29, 2026
1,004,092 sessions
28 models
Rank by
Model
1
11
Anthropic
Claude Fable 5 (High)
Anthropic · Proprietary
13.34%±1.55%
16.12%±2.68%30.63%±5.69%9.21%±2.92%9.40%±1.83%1.31%±0.13%16,082
2
26
Anthropic
Claude Opus 4.8 (Thinking)
Anthropic · Proprietary
9.37%±1.29%
8.59%±2.39%17.48%±4.79%10.34%±2.39%9.85%±1.09%0.59%±0.49%30,511
3
29
GPT 5.5 (xHigh)
OpenAI · Proprietary
8.21%±1.02%
5.84%±1.96%13.63%±3.72%5.78%±2.02%14.50%±1.02%1.31%±0.13%24,393
4
210
Anthropic
Claude Opus 4.7
Anthropic · Proprietary
8.16%±1.28%
5.46%±2.47%13.69%±4.65%9.10%±2.25%11.29%±1.64%1.26%±0.13%31,725
5
210
Anthropic
Claude Opus 4.7 (Thinking)
Anthropic · Proprietary
8.07%±1.23%
4.98%±2.48%11.36%±4.43%9.30%±2.30%13.49%±0.95%1.20%±0.15%31,304
6
310
GPT 5.5 (High)
OpenAI · Proprietary
7.13%±0.78%
6.59%±1.51%8.69%±2.79%6.06%±1.53%12.97%±0.93%1.31%±0.13%49,559
7
210
GLM 5.2 (Max)
Z.ai · MIT · SiliconFlow
6.93%±1.40%
9.13%±2.66%15.45%±5.27%3.58%±2.59%5.19%±1.27%1.31%±0.13%21,946
8
310
GPT 5.4 (High)
OpenAI · Proprietary
6.65%±0.79%
6.59%±1.53%6.13%±2.85%7.95%±1.53%11.27%±0.94%1.31%±0.13%49,486
9
310
Anthropic
Claude Opus 4.6
Anthropic · Proprietary
6.47%±1.21%
3.47%±2.51%9.40%±4.19%6.39%±2.27%11.81%±1.38%1.31%±0.13%31,155
10
410
GPT 5.5
OpenAI · Proprietary
6.22%±0.77%
4.07%±1.51%7.20%±2.72%7.41%±1.42%11.13%±1.04%1.31%±0.13%49,883
11
1113
Anthropic
Claude Opus 4.8
Anthropic · Proprietary
3.74%±1.49%
4.68%±2.65%10.76%±4.80%6.99%±2.57%9.66%±1.12%13.41%±3.22%28,284
12
1113
Anthropic
Claude Sonnet 4.6
Anthropic · Proprietary
2.18%±1.11%
0.86%±2.55%2.42%±3.60%1.26%±2.15%11.62%±1.52%1.30%±0.13%31,694
13
1113
GLM 5.1
Z.ai · MIT · SiliconFlow
1.40%±0.89%
2.27%±1.90%0.26%±3.02%0.62%±1.82%4.31%±1.07%1.31%±0.13%40,253
14
1420
Kimi K2.7 Code
Moonshot · Modified MIT · Fireworks
0.77%±1.24%
0.82%±2.62%2.63%±4.26%1.65%±2.52%1.72%±2.19%1.31%±0.13%26,189
15
1420
Gemini 3.1 Pro Preview
Google · Proprietary
1.09%±0.70%
0.16%±1.58%1.88%±2.23%1.37%±1.31%6.34%±1.27%1.26%±0.14%49,868
16
1420
Gemini 3.5 Flash
Google · Proprietary
1.13%±0.74%
0.44%±1.64%3.28%±2.37%2.19%±1.42%0.73%±1.19%0.47%±0.29%44,029
17
1420
DeepSeek V4 Flash
DeepSeek · MIT · SiliconFlow
1.57%±1.08%
4.27%±2.02%1.61%±3.83%9.19%±2.19%4.02%±1.54%0.50%±0.32%38,964
18
1420
Kimi K2.6
Moonshot · Modified MIT · Fireworks
1.82%±0.84%
1.72%±1.85%2.80%±2.76%4.14%±1.67%1.75%±1.43%1.31%±0.13%46,299
19
1421
Minimax M3
MiniMax · Proprietary · Fireworks
2.28%±1.01%
2.14%±2.34%7.20%±3.50%5.11%±1.98%1.75%±1.00%1.31%±0.13%31,887
20
1421
DeepSeek V4 Pro
DeepSeek · MIT · SiliconFlow
2.67%±1.29%
5.01%±2.94%4.52%±4.34%6.27%±2.73%2.58%±1.26%0.12%±0.36%28,972
21
1921
Qwen 3.6 Plus
Alibaba · Proprietary · Fireworks
4.20%±0.99%
1.26%±1.97%3.80%±3.47%10.58%±1.94%3.03%±1.25%2.36%±0.47%41,967
22
2226
Grok 4.3 (High)
xAI · Proprietary
7.19%±0.84%
9.41%±1.95%14.44%±2.35%6.79%±1.53%5.73%±1.88%0.44%±0.22%29,641
23
2226
Grok Build 0.1
xAI · Proprietary
7.81%±0.82%
6.82%±1.96%14.22%±2.46%11.32%±1.77%6.28%±1.48%0.41%±0.20%40,828
24
2226
Gemini 3 Flash
Google · Proprietary
8.24%±0.81%
9.81%±1.64%10.72%±2.12%5.35%±1.37%16.08%±2.34%0.74%±0.48%49,944
25
2226
Minimax M2.7
MiniMax · Modified MIT · Fireworks
8.28%±0.77%
12.20%±1.89%15.89%±2.24%9.85%±1.52%4.70%±1.42%1.22%±0.14%43,484
26
2227
Nemotron 3 Ultra
Nvidia · OpenMDW-1.1
8.87%±3.19%
5.53%±5.67%3.94%±10.80%17.20%±6.19%15.30%±6.34%2.37%±1.65%7,048
27
2627
Gemma 4 31B
Google · Apache 2.0
12.52%±1.49%
4.08%±1.77%5.57%±2.64%5.60%±1.64%28.72%±4.94%18.65%±4.18%39,388
28
2828
Grok 4.3
xAI · Proprietary
16.76%±1.07%
13.02%±1.71%16.79%±1.93%8.07%±1.33%46.76%±4.34%0.86%±0.19%49,309
Signal Leaders
  1. AnthropicClaude Fable 5 (High)gets users to confirm the task is done most often16.12%±2.68%
  2. AnthropicClaude Fable 5 (High)draws the most positive responses relative to negative ones30.63%±5.69%
  3. AnthropicClaude Opus 4.8 (Thinking)lands user corrections best10.34%±2.39%
  4. GPT 5.5 (xHigh)recovers from failed commands with the fewest steps14.50%±1.02%
  5. GLM 5.2 (Max)least likely to hallucinate tools it doesn't have1.31%±0.13%

Confirmed Success

How often the model gets users to confirm the task is done.

  1. 1AnthropicClaude Fable 5 (High)16.12%
    1AnthropicClaude Fable 5 (High)16.12%
  2. 2GLM 5.2 (Max)9.13%
    2GLM 5.2 (Max)9.13%
  3. 3AnthropicClaude Opus 4.8 (Thinking)8.59%
    3AnthropicClaude Opus 4.8 (Thinking)8.59%
  4. 4GPT 5.5 (High)6.59%
    4GPT 5.5 (High)6.59%
  5. 5GPT 5.4 (High)6.59%
    5GPT 5.4 (High)6.59%
  6. 6GPT 5.5 (xHigh)5.84%
    6GPT 5.5 (xHigh)5.84%
  7. 7AnthropicClaude Opus 4.75.46%
    7AnthropicClaude Opus 4.75.46%
  8. 8AnthropicClaude Opus 4.7 (Thinking)4.98%
    8AnthropicClaude Opus 4.7 (Thinking)4.98%
  9. 9AnthropicClaude Opus 4.84.68%
    9AnthropicClaude Opus 4.84.68%
  10. 10DeepSeek V4 Flash4.27%
    10DeepSeek V4 Flash4.27%
423,841 Sessions

Praise vs Complaint

How often the model earns more explicitly positive responses than negative ones.

  1. 1AnthropicClaude Fable 5 (High)30.63%
    1AnthropicClaude Fable 5 (High)30.63%
  2. 2AnthropicClaude Opus 4.8 (Thinking)17.48%
    2AnthropicClaude Opus 4.8 (Thinking)17.48%
  3. 3GLM 5.2 (Max)15.45%
    3GLM 5.2 (Max)15.45%
  4. 4AnthropicClaude Opus 4.713.69%
    4AnthropicClaude Opus 4.713.69%
  5. 5GPT 5.5 (xHigh)13.63%
    5GPT 5.5 (xHigh)13.63%
  6. 6AnthropicClaude Opus 4.7 (Thinking)11.36%
    6AnthropicClaude Opus 4.7 (Thinking)11.36%
  7. 7AnthropicClaude Opus 4.810.76%
    7AnthropicClaude Opus 4.810.76%
  8. 8AnthropicClaude Opus 4.69.40%
    8AnthropicClaude Opus 4.69.40%
  9. 9GPT 5.5 (High)8.69%
    9GPT 5.5 (High)8.69%
  10. 10GPT 5.57.20%
    10GPT 5.57.20%
151,114 Sessions

Steerability

How well the model lands user corrections when they push back.

  1. 1AnthropicClaude Opus 4.8 (Thinking)10.34%
    1AnthropicClaude Opus 4.8 (Thinking)10.34%
  2. 2AnthropicClaude Opus 4.7 (Thinking)9.30%
    2AnthropicClaude Opus 4.7 (Thinking)9.30%
  3. 3AnthropicClaude Fable 5 (High)9.21%
    3AnthropicClaude Fable 5 (High)9.21%
  4. 4AnthropicClaude Opus 4.79.10%
    4AnthropicClaude Opus 4.79.10%
  5. 5GPT 5.4 (High)7.95%
    5GPT 5.4 (High)7.95%
  6. 6GPT 5.57.41%
    6GPT 5.57.41%
  7. 7AnthropicClaude Opus 4.86.99%
    7AnthropicClaude Opus 4.86.99%
  8. 8AnthropicClaude Opus 4.66.39%
    8AnthropicClaude Opus 4.66.39%
  9. 9GPT 5.5 (High)6.06%
    9GPT 5.5 (High)6.06%
  10. 10GPT 5.5 (xHigh)5.78%
    10GPT 5.5 (xHigh)5.78%
254,425 Sessions

Bash Recovery

How quickly the model recovers when a command doesn't work.

  1. 1GPT 5.5 (xHigh)14.50%
    1GPT 5.5 (xHigh)14.50%
  2. 2AnthropicClaude Opus 4.7 (Thinking)13.49%
    2AnthropicClaude Opus 4.7 (Thinking)13.49%
  3. 3GPT 5.5 (High)12.97%
    3GPT 5.5 (High)12.97%
  4. 4AnthropicClaude Opus 4.611.81%
    4AnthropicClaude Opus 4.611.81%
  5. 5AnthropicClaude Sonnet 4.611.62%
    5AnthropicClaude Sonnet 4.611.62%
  6. 6AnthropicClaude Opus 4.711.29%
    6AnthropicClaude Opus 4.711.29%
  7. 7GPT 5.4 (High)11.27%
    7GPT 5.4 (High)11.27%
  8. 8GPT 5.511.13%
    8GPT 5.511.13%
  9. 9AnthropicClaude Opus 4.8 (Thinking)9.85%
    9AnthropicClaude Opus 4.8 (Thinking)9.85%
  10. 10AnthropicClaude Opus 4.89.66%
    10AnthropicClaude Opus 4.89.66%
237,846 Sessions

Tool Hallucination

How much the model hallucinates tools it doesn't have.

  1. 1GLM 5.2 (Max)1.31%
    1GLM 5.2 (Max)1.31%
  2. 2Kimi K2.7 Code1.31%
    2Kimi K2.7 Code1.31%
  3. 3Kimi K2.61.31%
    3Kimi K2.61.31%
  4. 4Minimax M31.31%
    4Minimax M31.31%
  5. 5GPT 5.51.31%
    5GPT 5.51.31%
  6. 6GLM 5.11.31%
    6GLM 5.11.31%
  7. 7GPT 5.5 (xHigh)1.31%
    7GPT 5.5 (xHigh)1.31%
  8. 8GPT 5.5 (High)1.31%
    8GPT 5.5 (High)1.31%
  9. 9GPT 5.4 (High)1.31%
    9GPT 5.4 (High)1.31%
  10. 10AnthropicClaude Fable 5 (High)1.31%
    10AnthropicClaude Fable 5 (High)1.31%
921,075 Sessions

Frequently asked questions

Agent Mode

Try Agent Mode

Put these models to work on your own real tasks in Agent Mode.

Get started
How the Agent Leaderboard works

How the Agent Leaderboard works

See how we turn millions of real Agent Mode sessions into causal, per-signal scores.

Read the methodology

USE CASES

  • Chat with AI
  • Build Apps & Websites
  • Write & Edit Text
  • Search the Web
  • Generate Images
  • Generate Videos
  • Chose any model
  • Compare Models Side by Side

LEADERBOARD RANKINGS

  • Overall
  • Agent
  • Text
  • WebDev
  • Image-to-WebDev
  • Text to Image
  • Image Edit
  • Text to Video
  • Image to Video
  • Video Edit
  • Vision
  • Document
  • Search

COMPANY

  • About Us
  • How It Works
  • Blog
  • Careers
  • Changelog
  • Help Center
  • FAQ

LEGAL

  • Terms
  • Privacy
  • Cookies

FOLLOW

  • X
  • LinkedIn
  • YouTube
  • Discord

© Arena Intelligence 2026