llm-rate

In the last 24 hours we dispatched 1,556 tasks across 4 models. Here's what we picked, and why.

What we ran

An autonomous AI fleet, written in TypeScript, picks a model per task using a complexity router. No vibes, no PR team. This is the actual production output of that router:

	Model	Dispatches	Share	Why this one
01	claude-sonnet-4-6	1,066	68.5%	implementation (standard)
02	claude-haiku-4-5	278	17.9%	implementation (light)
03	gpt-5.4-mini	189	12.1%	implementation (codex pool)
04	claude-opus-4-6	23	1.5%	implementation (high complexity)

Window: 24h to 2026-05-16T00:00:00Z. Source: daemon routing logs. The router writes a decision per dispatch; we parsed 1556 of them.

If you don't have a router, here are the picks per common task

Filtered from arena.ai's leaderboard plus published API prices. Filter thresholds are listed under each tab; arguable. Treat this as a starting shortlist, not a verdict.

Marketing copy, summaries, first drafts. Per-token cost dominates.

Best value

llama-3.1-8b-instruct

Meta · quality 1186.7 · $0.03/M blended

Best quality

mimo-v2.5-pro

Xiaomi · quality 1461.9 · $0.74/M blended

Filter: Blended price ≤ $1/M, no quality floor — let the cheap end shine. Sorted by value. 131 models survived.

	Model	Quality	Ctx	In /1M	Out /1M	Value ↓
01	llama-3.1-8b-instructvalueMeta	1186.7	131k	$0.02	$0.03	691444.4
02	qwen3-235b-a22b-thinking-2507Alibaba	1413.6	262k	$0.10	$0.10	413600.0
03	granite-4.1-8bIBM	1292.5	131k	$0.05	$0.10	344070.6
04	gemma-3-4b-itGoogle	1290.8	131k	$0.05	$0.10	342129.4
05	mistral-small-24b-instruct-2501Mistral	1233.6	33k	$0.05	$0.08	329028.2
06	gemma-2-9b-it-simpoPrinceton	1227.2	8k	$0.03	$0.09	315527.8
07	gemma-3n-e4b-itGoogle	1306.2	33k	$0.06	$0.12	300235.3
08	gemma-2-9b-itGoogle	1207.5	8k	$0.03	$0.09	288194.4
09	deepseek-v4-flashDeepSeek	1429.6	1.0M	$0.09	$0.18	280790.8
10	gemma-3-12b-itGoogle	1334.2	131k	$0.05	$0.15	278516.7
11	gpt-oss-20bOpenAI	1287.7	131k	$0.03	$0.14	269672.0
12	gpt-oss-120bOpenAI	1365.5	131k	$0.04	$0.18	265410.3
13	mimo-v2.5-proqualityXiaomi	1461.9	1.0M	$0.43	$0.87	62461.1
14	deepseek-v4-proDeepSeek	1448.3	1.0M	$0.43	$0.87	60619.3
15	deepseek-v4-pro-thinkingDeepSeek	1447.2	1.0M	$0.43	$0.87	60478.7

What this is, and isn't

Right now this is filter-on-arena.ai plus a public log of what we ran. Arena Elo measures pairwise human preference on short prompts. It does not measure: whether a model produces valid JSON under a schema, whether it hallucinates function names, whether it refuses queries it shouldn't, latency p99, rate-limit behaviour. Production teams need those signals.

We're building a benchmark runner — fixed prompt suites for RAG, structured extraction, code refactoring, function calling — run daily against every model. Raw inputs, outputs, judge rationale, costs published. When that lands, the "picks" section gets its real backing. Until then, the picks section is opinion with a citation, not measurement.