Tuesday, 23 June 2026
In the last 24 hours we dispatched 1,556 tasks across 4 models. Here's what we picked, and why.
An autonomous AI fleet, written in TypeScript, picks a model per task using a complexity router. No vibes, no PR team. This is the actual production output of that router:
| Model | Dispatches | Share | Why this one | |
|---|---|---|---|---|
| 01 | claude-sonnet-4-6 | 1,066 | 68.5% | implementation (standard) |
| 02 | claude-haiku-4-5 | 278 | 17.9% | implementation (light) |
| 03 | gpt-5.4-mini | 189 | 12.1% | implementation (codex pool) |
| 04 | claude-opus-4-6 | 23 | 1.5% | implementation (high complexity) |
Window: 24h to 2026-05-16T00:00:00Z. Source: daemon routing logs. The router writes a decision per dispatch; we parsed 1556 of them.
Filtered from arena.ai's leaderboard plus published API prices. Filter thresholds are listed under each tab; arguable. Treat this as a starting shortlist, not a verdict.
High volume, schemas, low complexity. Reliability + cheap.
Best value
qwen3-235b-a22b-thinking-2507
Best quality
mimo-v2.5-pro
Filter: Blended price ≤ $2/M, quality ≥ 1250. Sorted by value. These workloads run at scale; small price diffs are real money. 106 models survived.
| Model | Quality | Ctx | In /1M | Out /1M | Value ↓ | |
|---|---|---|---|---|---|---|
| 01 | qwen3-235b-a22b-thinking-2507valueAlibaba | 1413.6 | 262k | $0.10 | $0.10 | 413600.0 |
| 02 | granite-4.1-8bIBM | 1292.5 | 131k | $0.05 | $0.10 | 344070.6 |
| 03 | gemma-3-4b-itGoogle | 1290.8 | 131k | $0.05 | $0.10 | 342129.4 |
| 04 | gemma-3n-e4b-itGoogle | 1306.2 | 33k | $0.06 | $0.12 | 300235.3 |
| 05 | deepseek-v4-flashDeepSeek | 1429.6 | 1.0M | $0.09 | $0.18 | 280790.8 |
| 06 | gemma-3-12b-itGoogle | 1334.2 | 131k | $0.05 | $0.15 | 278516.7 |
| 07 | gpt-oss-20bOpenAI | 1287.7 | 131k | $0.03 | $0.14 | 269672.0 |
| 08 | gpt-oss-120bOpenAI | 1365.5 | 131k | $0.04 | $0.18 | 265410.3 |
| 09 | gemma-3-27b-itGoogle | 1358.2 | 131k | $0.08 | $0.16 | 263389.7 |
| 10 | qwen3-30b-a3b-instruct-2507Alibaba | 1383.8 | 131k | $0.05 | $0.19 | 256618.5 |
| 11 | nvidia-nemotron-3-nano-30b-a3b-bf16Nvidia | 1349.1 | 262k | $0.06 | $0.24 | 187682.8 |
| 12 | mimo-v2.5Xiaomi | 1426.1 | 1.0M | $0.14 | $0.28 | 179025.2 |
| 13 | mimo-v2.5-proqualityXiaomi | 1461.9 | 1.0M | $0.43 | $0.87 | 62461.1 |
| 14 | deepseek-v4-proDeepSeek | 1448.3 | 1.0M | $0.43 | $0.87 | 60619.3 |
| 15 | deepseek-v4-pro-thinkingDeepSeek | 1447.2 | 1.0M | $0.43 | $0.87 | 60478.7 |
Right now this is filter-on-arena.ai plus a public log of what we ran. Arena Elo measures pairwise human preference on short prompts. It does not measure: whether a model produces valid JSON under a schema, whether it hallucinates function names, whether it refuses queries it shouldn't, latency p99, rate-limit behaviour. Production teams need those signals.
We're building a benchmark runner — fixed prompt suites for RAG, structured extraction, code refactoring, function calling — run daily against every model. Raw inputs, outputs, judge rationale, costs published. When that lands, the "picks" section gets its real backing. Until then, the picks section is opinion with a citation, not measurement.