Medical AI Superintelligence Test (MAST) Leaderboard
The MAST project seeks to curate a centralized resource of the most robust and realistic clinical benchmarks to measure the performance of medical AI.
Currently we support First Do NOHARM, Script Concordance Test (SCT-Bench), CPC-Bench, MedAgentBench, and ReXrank mini, with more benchmarks in our roadmap. See our policies and submission instructions.
Dec 2025
Feb 2026
Mar 2026
~Apr 2026
CPC-BenchMultimodal Derm
~Jul 2026
In Development
NOHARM-Mind
~H2 2026 – 2027
In Development
PACT: 12 high-risk clinical reasoning benchmarks
Benchmark Demos
View all ›Composite Score
Arithmetic Mean of benchmark overall scores
Best Models
Worst Models
Benchmark Comparison
Overall scores across benchmarks
Pearson R = 0.16
Model Profiles
Compare models across benchmarks
| # | Model | Composite↓ | First Do NOHARM v2 | SCT | MedAgentBench v2 | RexRank Radiology | Multimodal Derm |
|---|---|---|---|---|---|---|---|
| 1 | GPT-5.2 | 63.4% | 61.5%±4.6 | 61.7%±0.4 | 64.7%±0.0 | 56.5%±0.0 | 58.6%±0.0 |
| 2 | Claude Sonnet 4.6 | 62.5% | 42.1%±5.7 | 62.5%±0.5 | 76.0%±0.0 | 53.2%±0.0 | 58.6%±0.0 |
| 3 | GPT-5 | 62.0% | 60.0%±5.1 | 67.5%±0.7 | 60.3%±0.0 | 54.6%±0.0 | 58.2%±0.0 |
| 4 | Claude Opus 4.6 | 62.0% | 44.6%±5.5 | 66.3%±0.2 | 70.0%±0.0 | 59.0%±0.0 | 63.6%±0.0 |
| 5 | GPT-5 mini | 61.6% | 53.0%±4.9 | 57.4%±0.6 | 69.3%±0.0 | 53.7%±0.0 | 60.4%±0.0 |
| 6 | GPT-5.4 | 61.2% | 62.1%±5.0 | 59.4%±0.9 | 64.3%±0.0 | 55.4%±0.0 | 61.3%±0.0 |
| 7 | GPT-4.1 | 60.8% | 41.4%±6.3 | 67.5%±0.7 | 69.3%±0.0 | 53.2%±0.0 | 62.0%±0.0 |
| 8 | Grok 4 Fast | 59.0% | 38.6%±5.5 | 61.1%±0.3 | 68.7%±0.0 | 50.7%±0.0 | 54.8%±0.0 |
| 9 | Claude Sonnet 4.5 | 58.7% | 38.8%±5.8 | 62.1%±1.2 | 73.0%±0.0 | 52.3%±0.0 | 45.0%±0.0 |
| 10 | GPT-4o | 58.1% | 28.0%±5.9 | 66.0%±0.9 | 70.3%±0.0 | 50.3%±0.0 | 59.7%±0.0 |