llm-rate

In the last 24 hours we dispatched 1,556 tasks across 4 models. Here's what we picked, and why.

What we ran

An autonomous AI fleet, written in TypeScript, picks a model per task using a complexity router. No vibes, no PR team. This is the actual production output of that router:

	Model	Dispatches	Share	Why this one
01	claude-sonnet-4-6	1,066	68.5%	implementation (standard)
02	claude-haiku-4-5	278	17.9%	implementation (light)
03	gpt-5.4-mini	189	12.1%	implementation (codex pool)
04	claude-opus-4-6	23	1.5%	implementation (high complexity)

Window: 24h to 2026-05-16T00:00:00Z. Source: daemon routing logs. The router writes a decision per dispatch; we parsed 1556 of them.

If you don't have a router, here are the picks per common task

Filtered from arena.ai's leaderboard plus published API prices. Filter thresholds are listed under each tab; arguable. Treat this as a starting shortlist, not a verdict.

Complex multi-step problems. Where to send the gnarly stuff.

Best value

qwen3-235b-a22b-thinking-2507

Alibaba · quality 1412.2 · $0.10/M blended

Best quality

gemini-3.5-flash

Google · quality 1519.7 · $6.75/M blended

Filter: Math leaderboard, quality ≥ 1400. Sorted by quality. Use the cheaper end of this list as a router when you're sure the task is hard. 113 models survived.

	Model	Quality	Ctx	In /1M	Out /1M	Score ↓
01	gemini-3.5-flashqualityGoogle	1519.7	1.0M	$1.50	$9.00	1519.7
02	claude-opus-4-6-thinkingAnthropic	1514.6	1.0M	$5.00	$25.00	1514.6
03	claude-fable-5Anthropic	1513.0	1.0M	$10.00	$50.00	1513.0
04	claude-opus-4-6Anthropic	1508.1	1.0M	$5.00	$25.00	1508.1
05	claude-opus-4-7-thinkingAnthropic	1497.7	1.0M	$5.00	$25.00	1497.7
06	gpt-5.4-highOpenAI	1497.2	1.1M	$2.50	$15.00	1497.2
07	qwen3.7-max-previewAlibaba	1495.3	1.0M	$1.25	$3.75	1495.3
08	claude-opus-4-7Anthropic	1493.5	1.0M	$5.00	$25.00	1493.5
09	gemini-3.1-pro-previewGoogle	1491.4	1.0M	$2.00	$12.00	1491.4
10	claude-opus-4-8-thinkingAnthropic	1488.5	1.0M	$5.00	$25.00	1488.5
11	claude-opus-4-8Anthropic	1487.6	1.0M	$5.00	$25.00	1487.6
12	mimo-v2.5-proXiaomi	1484.2	1.0M	$0.43	$0.87	1484.2

What this is, and isn't

Right now this is filter-on-arena.ai plus a public log of what we ran. Arena Elo measures pairwise human preference on short prompts. It does not measure: whether a model produces valid JSON under a schema, whether it hallucinates function names, whether it refuses queries it shouldn't, latency p99, rate-limit behaviour. Production teams need those signals.

We're building a benchmark runner — fixed prompt suites for RAG, structured extraction, code refactoring, function calling — run daily against every model. Raw inputs, outputs, judge rationale, costs published. When that lands, the "picks" section gets its real backing. Until then, the picks section is opinion with a citation, not measurement.