Tuesday, 23 June 2026
In the last 24 hours we dispatched 1,556 tasks across 4 models. Here's what we picked, and why.
An autonomous AI fleet, written in TypeScript, picks a model per task using a complexity router. No vibes, no PR team. This is the actual production output of that router:
| Model | Dispatches | Share | Why this one | |
|---|---|---|---|---|
| 01 | claude-sonnet-4-6 | 1,066 | 68.5% | implementation (standard) |
| 02 | claude-haiku-4-5 | 278 | 17.9% | implementation (light) |
| 03 | gpt-5.4-mini | 189 | 12.1% | implementation (codex pool) |
| 04 | claude-opus-4-6 | 23 | 1.5% | implementation (high complexity) |
Window: 24h to 2026-05-16T00:00:00Z. Source: daemon routing logs. The router writes a decision per dispatch; we parsed 1556 of them.
Filtered from arena.ai's leaderboard plus published API prices. Filter thresholds are listed under each tab; arguable. Treat this as a starting shortlist, not a verdict.
When the task is hard enough that you'd pay anything for the right answer.
Best value
mimo-v2.5-pro
Best quality
claude-opus-4-6-thinking
Filter: Quality ≥ 1450 across overall conversation. Sorted by raw score. The bench-press of LLMs. 27 models survived.
| Model | Quality | Ctx | In /1M | Out /1M | Score ↓ | |
|---|---|---|---|---|---|---|
| 01 | claude-opus-4-6-thinkingqualityAnthropic | 1500.7 | 1.0M | $5.00 | $25.00 | 1500.7 |
| 02 | claude-opus-4-6Anthropic | 1497.7 | 1.0M | $5.00 | $25.00 | 1497.7 |
| 03 | claude-fable-5Anthropic | 1493.7 | 1.0M | $10.00 | $50.00 | 1493.7 |
| 04 | claude-opus-4-7-thinkingAnthropic | 1489.4 | 1.0M | $5.00 | $25.00 | 1489.4 |
| 05 | claude-opus-4-7Anthropic | 1480.8 | 1.0M | $5.00 | $25.00 | 1480.8 |
| 06 | gemini-3.1-pro-previewGoogle | 1480.1 | 1.0M | $2.00 | $12.00 | 1480.1 |
| 07 | gemini-3.5-flashGoogle | 1479.7 | 1.0M | $1.50 | $9.00 | 1479.7 |
| 08 | gemini-3-proGoogle | 1479.4 | 1.0M | $2.00 | $12.00 | 1479.4 |
| 09 | qwen3.7-max-previewAlibaba | 1474.8 | 1.0M | $1.25 | $3.75 | 1474.8 |
| 10 | muse-sparkMeta | 1472.0 | — | — | — | 1472.0 |
| 11 | qwen3.5-max-previewAlibaba | 1470.0 | — | — | — | 1470.0 |
| 12 | gpt-5.4-highOpenAI | 1469.9 | 1.1M | $2.50 | $15.00 | 1469.9 |
Right now this is filter-on-arena.ai plus a public log of what we ran. Arena Elo measures pairwise human preference on short prompts. It does not measure: whether a model produces valid JSON under a schema, whether it hallucinates function names, whether it refuses queries it shouldn't, latency p99, rate-limit behaviour. Production teams need those signals.
We're building a benchmark runner — fixed prompt suites for RAG, structured extraction, code refactoring, function calling — run daily against every model. Raw inputs, outputs, judge rationale, costs published. When that lands, the "picks" section gets its real backing. Until then, the picks section is opinion with a citation, not measurement.