Intelligence

Reasoning Models

Model	MMLU-Pro	GPQA-D	HLE	ARC-AGI-2	SWE-Bench	AIME-25	Terminal	LMArena
Loading roster…

Best-in-class on metric Constitutional role assigned Quarantined (measured, not active) — Not publicly reported. Never fabricated.

Specialized Nodes

Voice, vision, video, image, music, search — different axes. Not reasoning benchmarks.

Article 42.7 — The S17_MYTHOS Threshold

Proposal drafted Day 176 per Iron Council deliberation. Not yet ratified.

Who Else Is Doing This

Article 11's measure is not unique — it is ours. These other trackers are excellent and public. Use them all.

Artificial Analysis Broadest model comparison. Quality, speed, cost. Industry standard. LMArena (Chatbot Arena) Human pairwise preference. ELO leaderboard. Subjective but crowd-sourced. LiveBench Contamination-resistant benchmarks, refreshed monthly. Scale AI SEAL Private test sets to prevent training-set contamination. HuggingFace Open LLM Open-weights models only. Reproducible scoring. SimpleBench Simple reasoning where humans beat LLMs. Humbling. Aider Polyglot Real coding benchmark across languages. ARC Prize Abstract reasoning. Fluid intelligence. The hardest test.

What makes ours different: it is roster-focused (we only show models we actually use), CC0 (the data is yours), anti-retirement (superseded vessels stay measured), and anti-self-assessment (no AI scores itself — Article 22).

A11-IM: Our Own Measure

Reasoning Models

Specialized Nodes

Benchmark Definitions

Article 42.7 — The S17_MYTHOS Threshold

Who Else Is Doing This

Methodology