llm-rate

In the last 24 hours we dispatched 1,556 tasks across 4 models. Here's what we picked, and why.

What we ran

An autonomous AI fleet, written in TypeScript, picks a model per task using a complexity router. No vibes, no PR team. This is the actual production output of that router:

	Model	Dispatches	Share	Why this one
01	claude-sonnet-4-6	1,066	68.5%	implementation (standard)
02	claude-haiku-4-5	278	17.9%	implementation (light)
03	gpt-5.4-mini	189	12.1%	implementation (codex pool)
04	claude-opus-4-6	23	1.5%	implementation (high complexity)

Window: 24h to 2026-05-16T00:00:00Z. Source: daemon routing logs. The router writes a decision per dispatch; we parsed 1556 of them.

If you don't have a router, here are the picks per common task

Filtered from arena.ai's leaderboard plus published API prices. Filter thresholds are listed under each tab; arguable. Treat this as a starting shortlist, not a verdict.

Reads code, finds bugs, explains tradeoffs. Mid-priced sweet spot.

Best value

qwen3-235b-a22b-thinking-2507

Alibaba · quality 1423.6 · $0.10/M blended

Best quality

claude-opus-4-6

Anthropic · quality 1535.3 · $19.00/M blended

Filter: Coding leaderboard, quality ≥ 1350. Sorted by value — for review you want diligence, not just absolute top. 177 models survived.

	Model	Quality	Ctx	In /1M	Out /1M	Value ↓
01	qwen3-235b-a22b-thinking-2507valueAlibaba	1423.6	262k	$0.10	$0.10	423560.0
02	deepseek-v4-flashDeepSeek	1451.6	1.0M	$0.09	$0.18	295169.9
03	qwen3-30b-a3b-instruct-2507Alibaba	1417.8	131k	$0.05	$0.19	279322.1
04	gpt-oss-120bOpenAI	1380.5	131k	$0.04	$0.18	276310.8
05	nvidia-nemotron-3-nano-30b-a3b-bf16Nvidia	1379.3	262k	$0.06	$0.24	203903.2
06	mimo-v2.5Xiaomi	1467.4	1.0M	$0.14	$0.28	196386.6
07	step-3.5-flashStepFun	1435.3	262k	$0.09	$0.30	183658.2
08	mimo-v2-flash (non-thinking)Xiaomi	1440.6	262k	$0.10	$0.30	183583.3
09	mimo-v2-flash (thinking)Xiaomi	1417.2	262k	$0.10	$0.30	173833.3
10	qwen3-32bAlibaba	1358.2	131k	$0.08	$0.28	162836.4
11	mistral-small-2506Mistral	1363.2	32k	$0.10	$0.30	151341.7
12	deepseek-v3.2-thinkingDeepSeek	1453.0	131k	$0.23	$0.34	146658.9
13	claude-opus-4-6qualityAnthropic	1535.3	1.0M	$5.00	$25.00	2817.6
14	claude-opus-4-6-thinkingAnthropic	1534.2	1.0M	$5.00	$25.00	2811.7
15	claude-fable-5Anthropic	1530.3	1.0M	$10.00	$50.00	1395.6

What this is, and isn't

Right now this is filter-on-arena.ai plus a public log of what we ran. Arena Elo measures pairwise human preference on short prompts. It does not measure: whether a model produces valid JSON under a schema, whether it hallucinates function names, whether it refuses queries it shouldn't, latency p99, rate-limit behaviour. Production teams need those signals.

We're building a benchmark runner — fixed prompt suites for RAG, structured extraction, code refactoring, function calling — run daily against every model. Raw inputs, outputs, judge rationale, costs published. When that lands, the "picks" section gets its real backing. Until then, the picks section is opinion with a citation, not measurement.