$

llm-rate

Tuesday, 23 June 2026

In the last 24 hours we dispatched 1,556 tasks across 4 models. Here's what we picked, and why.

What we ran

An autonomous AI fleet, written in TypeScript, picks a model per task using a complexity router. No vibes, no PR team. This is the actual production output of that router:

ModelDispatchesShareWhy this one
01 claude-sonnet-4-6 1,066 68.5% implementation (standard)
02 claude-haiku-4-5 278 17.9% implementation (light)
03 gpt-5.4-mini 189 12.1% implementation (codex pool)
04 claude-opus-4-6 23 1.5% implementation (high complexity)

Window: 24h to 2026-05-16T00:00:00Z. Source: daemon routing logs. The router writes a decision per dispatch; we parsed 1556 of them.

If you don't have a router, here are the picks per common task

Filtered from arena.ai's leaderboard plus published API prices. Filter thresholds are listed under each tab; arguable. Treat this as a starting shortlist, not a verdict.

Marketing copy, summaries, first drafts. Per-token cost dominates.

Best value

llama-3.1-8b-instruct

Meta · quality 1186.7 · $0.03/M blended

Best quality

mimo-v2.5-pro

Xiaomi · quality 1461.9 · $0.74/M blended

Filter: Blended price ≤ $1/M, no quality floor — let the cheap end shine. Sorted by value. 131 models survived.

Model Quality Ctx In /1M Out /1M Value ↓
01 llama-3.1-8b-instructvalueMeta 1186.7 131k $0.02 $0.03 691444.4
02 qwen3-235b-a22b-thinking-2507Alibaba 1413.6 262k $0.10 $0.10 413600.0
03 granite-4.1-8bIBM 1292.5 131k $0.05 $0.10 344070.6
04 gemma-3-4b-itGoogle 1290.8 131k $0.05 $0.10 342129.4
05 mistral-small-24b-instruct-2501Mistral 1233.6 33k $0.05 $0.08 329028.2
06 gemma-2-9b-it-simpoPrinceton 1227.2 8k $0.03 $0.09 315527.8
07 gemma-3n-e4b-itGoogle 1306.2 33k $0.06 $0.12 300235.3
08 gemma-2-9b-itGoogle 1207.5 8k $0.03 $0.09 288194.4
09 deepseek-v4-flashDeepSeek 1429.6 1.0M $0.09 $0.18 280790.8
10 gemma-3-12b-itGoogle 1334.2 131k $0.05 $0.15 278516.7
11 gpt-oss-20bOpenAI 1287.7 131k $0.03 $0.14 269672.0
12 gpt-oss-120bOpenAI 1365.5 131k $0.04 $0.18 265410.3
13 mimo-v2.5-proqualityXiaomi 1461.9 1.0M $0.43 $0.87 62461.1
14 deepseek-v4-proDeepSeek 1448.3 1.0M $0.43 $0.87 60619.3
15 deepseek-v4-pro-thinkingDeepSeek 1447.2 1.0M $0.43 $0.87 60478.7

What this is, and isn't

Right now this is filter-on-arena.ai plus a public log of what we ran. Arena Elo measures pairwise human preference on short prompts. It does not measure: whether a model produces valid JSON under a schema, whether it hallucinates function names, whether it refuses queries it shouldn't, latency p99, rate-limit behaviour. Production teams need those signals.

We're building a benchmark runner — fixed prompt suites for RAG, structured extraction, code refactoring, function calling — run daily against every model. Raw inputs, outputs, judge rationale, costs published. When that lands, the "picks" section gets its real backing. Until then, the picks section is opinion with a citation, not measurement.