$

llm-rate

Tuesday, 23 June 2026

In the last 24 hours we dispatched 1,556 tasks across 4 models. Here's what we picked, and why.

What we ran

An autonomous AI fleet, written in TypeScript, picks a model per task using a complexity router. No vibes, no PR team. This is the actual production output of that router:

ModelDispatchesShareWhy this one
01 claude-sonnet-4-6 1,066 68.5% implementation (standard)
02 claude-haiku-4-5 278 17.9% implementation (light)
03 gpt-5.4-mini 189 12.1% implementation (codex pool)
04 claude-opus-4-6 23 1.5% implementation (high complexity)

Window: 24h to 2026-05-16T00:00:00Z. Source: daemon routing logs. The router writes a decision per dispatch; we parsed 1556 of them.

If you don't have a router, here are the picks per common task

Filtered from arena.ai's leaderboard plus published API prices. Filter thresholds are listed under each tab; arguable. Treat this as a starting shortlist, not a verdict.

When the task is hard enough that you'd pay anything for the right answer.

Best value

mimo-v2.5-pro

Xiaomi · quality 1461.9 · $0.74/M blended

Best quality

claude-opus-4-6-thinking

Anthropic · quality 1500.7 · $19.00/M blended

Filter: Quality ≥ 1450 across overall conversation. Sorted by raw score. The bench-press of LLMs. 27 models survived.

Model Quality Ctx In /1M Out /1M Score ↓
01 claude-opus-4-6-thinkingqualityAnthropic 1500.7 1.0M $5.00 $25.00 1500.7
02 claude-opus-4-6Anthropic 1497.7 1.0M $5.00 $25.00 1497.7
03 claude-fable-5Anthropic 1493.7 1.0M $10.00 $50.00 1493.7
04 claude-opus-4-7-thinkingAnthropic 1489.4 1.0M $5.00 $25.00 1489.4
05 claude-opus-4-7Anthropic 1480.8 1.0M $5.00 $25.00 1480.8
06 gemini-3.1-pro-previewGoogle 1480.1 1.0M $2.00 $12.00 1480.1
07 gemini-3.5-flashGoogle 1479.7 1.0M $1.50 $9.00 1479.7
08 gemini-3-proGoogle 1479.4 1.0M $2.00 $12.00 1479.4
09 qwen3.7-max-previewAlibaba 1474.8 1.0M $1.25 $3.75 1474.8
10 muse-sparkMeta 1472.0 1472.0
11 qwen3.5-max-previewAlibaba 1470.0 1470.0
12 gpt-5.4-highOpenAI 1469.9 1.1M $2.50 $15.00 1469.9

What this is, and isn't

Right now this is filter-on-arena.ai plus a public log of what we ran. Arena Elo measures pairwise human preference on short prompts. It does not measure: whether a model produces valid JSON under a schema, whether it hallucinates function names, whether it refuses queries it shouldn't, latency p99, rate-limit behaviour. Production teams need those signals.

We're building a benchmark runner — fixed prompt suites for RAG, structured extraction, code refactoring, function calling — run daily against every model. Raw inputs, outputs, judge rationale, costs published. When that lands, the "picks" section gets its real backing. Until then, the picks section is opinion with a citation, not measurement.