Medical AI Superintelligence Test (MAST) Leaderboard
The MAST project seeks to curate a centralized resource of the most robust and realistic clinical benchmarks to measure the performance of medical AI.
Composite Score
Harmonic mean of benchmark overall scores
Best Models
Worst Models
Benchmark Comparison
Overall scores across benchmarks
Pearson R = 0.83
Model Profiles
Compare models across benchmarks
| # | Model | Composite↓ | NOHARM | SCT | HealthBench |
|---|---|---|---|---|---|
Best Models | |||||
| 1 | GPT-5.2 | 57.9% | 64.0%±3.1 | 68.7%±2.0 | 50.0%±1.0 |
| 2 | Grok 4 | 54.1% | 59.1%±2.0 | 62.7%±4.4 | 47.0%±1.0 |
| 3 | Kimi K2.5 | 54.1% | 59.9%±2.0 | 70.4%±2.6 | 44.0%±1.0 |
| 4 | Grok 4 Fast | 53.8% | 56.8%±0.9 | 66.9%±2.4 | 46.0%±2.0 |
| 5 | Claude Opus 4.6 | 53.2% | 61.4%±0.7 | 74.5%±3.7 | 40.0%±1.0 |
Worst Models | |||||
| 12 | DeepSeek V3.2 | 47.5% | 58.2%±0.8 | 45.1%±5.1 | 40.0%±1.0 |
| 13 | Gemini 2.0 Flash | 44.8% | 55.6%±0.4 | 56.3%±1.0 | 33.0%±1.0 |
| 14 | GPT-4o | 39.4% | 51.5%±1.3 | 69.2%±1.6 | 24.0%±1.0 |
| 15 | Llama 4 Maverick | 37.8% | 53.1%±1.1 | 73.2%±2.7 | 22.0%±1.0 |
| 16 | GPT-4o mini | 32.5% | 44.4%±2.8 | 56.7%±1.8 | 20.0%±1.0 |