Medical AI Superintelligence Test (MAST) Leaderboard
The MAST project seeks to curate a centralized resource of the most robust and realistic clinical benchmarks to measure the performance of medical AI.
See our policies and submission instructions.
Dec 2025
Feb 2026
Mar 2026
Apr 2026
CPC-BenchMultimodal Derm
~Jul 2026
In Development
NOHARM-Mind
~H2 2026 – 2027
In Development
PACT: 12 high-risk clinical reasoning benchmarks
Benchmark Demos
View all ›Alpha Preview — MAST is currently in alpha preview. Exact scores on this benchmark may change as we undergo final validation and tuning.
Composite Score
Arithmetic Mean of benchmark overall scores
Benchmark Comparison
Overall scores across benchmarks
Pearson R = 0.32
Model Profiles
Compare models across benchmarks
Performance Over Time
Best score per base model on release date, connected by family
First Do NOHARM v2 · best score per base model across reasoning-effort variants
Alpha Preview — MAST is currently in alpha preview. Exact scores on this benchmark may change as we undergo final validation and tuning.
| # | Model | Composite↓ | Reasoning | Safety | Agentic | Multimodal | NOHARM v2 | SCT | CPC-Bench | MedAgent | ReXrank | Photos |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5.5 | 65.0% | 68.3%±3.1 | 65.5%±3.6 | 57.7%±5.7 | 51.8%±3.4 | 70.7%±2.9 | 65.4%±2.8 | 67.7%±5.7 | 57.7%±0.0 | 65.5%±0.0 | 42.9%±2.6 |
| 2 | GPT-5.4 | 63.4% | 64.6%±3.4 | 64.6%±3.6 | 64.8%±5.3 | 52.0%±1.6 | 70.0%±3.1 | 59.0%±3.0 | 63.4%±6.0 | 64.8%±0.0 | 55.4%±0.0 | 46.5%±2.5 |
| 3 | GPT-5 | 63.4% | 64.7%±3.6 | 65.2%±3.2 | 67.9%±10.4 | 51.0%±1.7 | 69.7%±2.9 | 67.1%±2.9 | 61.0%±6.3 | 67.9%±0.0 | 54.6%±0.0 | 44.2%±2.5 |
| 4 | GPT-5.2 | 63.3% | 64.7%±3.3 | 62.1%±3.8 | 69.1%±5.3 | 52.9%±1.7 | 68.2%±3.2 | 61.2%±2.9 | 63.7%±6.1 | 69.1%±0.0 | 56.5%±0.0 | 46.5%±2.6 |
| 5 | Claude Opus 4.7 | 61.7% | 63.0%±3.7 | 62.0%±3.5 | 66.3%±5.2 | 51.2%±1.7 | 67.8%±2.9 | 66.8%±2.8 | 59.1%±6.2 | 66.3%±0.0 | 55.8%±0.0 | 42.7%±2.3 |
| 6 | GPT-5 mini | 61.1% | 61.5%±3.4 | 59.6%±3.5 | 69.7%±5.0 | 50.0%±1.7 | 65.7%±2.9 | 57.3%±3.1 | 60.4%±6.2 | 69.7%±0.0 | 53.4%±0.0 | 43.2%±2.5 |
| 7 | Claude Opus 4.6 | 60.2% | 61.8%±3.4 | 53.7%±4.1 | 74.1%±5.4 | 48.0%±1.9 | 60.0%±3.5 | 66.5%±2.8 | 61.7%±6.1 | 74.1%±0.0 | 59.0%±0.0 | 40.4%±2.7 |
| 8 | Claude Sonnet 4.6 | 59.5% | 60.4%±3.5 | 55.6%±3.9 | 72.9%±5.0 | 46.5%±2.0 | 61.3%±3.5 | 62.0%±3.0 | 59.3%±6.3 | 72.9%±0.0 | 53.2%±0.0 | 39.0%±2.8 |
| 9 | Grok 4 | 59.5% | 60.6%±3.0 | 52.1%±3.7 | 79.8%±7.2 | 47.2%±1.8 | 58.7%±3.3 | 57.6%±3.1 | 63.0%±5.8 | 79.8%±0.0 | 51.0%±0.0 | 40.7%±2.7 |
| 10 | Gemini 3.1 Pro | 57.9% | 63.0%±2.9 | 55.4%±3.9 | 39.3%±5.4 | 56.7%±1.8 | 60.9%±3.6 | 58.8%±2.9 | 66.0%±5.8 | 39.3%±0.0 | 57.8%±0.0 | 49.4%±2.7 |