Medical AI Superintelligence Test (MAST) Leaderboard

The MAST project seeks to curate a centralized resource of the most robust and realistic clinical benchmarks to measure the performance of medical AI.

Composite Score

Harmonic mean of benchmark overall scores

Best Models

Worst Models

Benchmark Comparison

Overall scores across benchmarks

Pearson R = 0.83

Model Profiles

Compare models across benchmarks

#ModelComposite
NOHARM
SCT
HealthBench

Best Models

1
GPT-5.2
57.9%
64.0%±3.1
68.7%±2.0
50.0%±1.0
2
Grok 4
54.1%
59.1%±2.0
62.7%±4.4
47.0%±1.0
3
Kimi K2.5
54.1%
59.9%±2.0
70.4%±2.6
44.0%±1.0
4
Grok 4 Fast
53.8%
56.8%±0.9
66.9%±2.4
46.0%±2.0
5
Claude Opus 4.6
53.2%
61.4%±0.7
74.5%±3.7
40.0%±1.0

Worst Models

12
DeepSeek V3.2
47.5%
58.2%±0.8
45.1%±5.1
40.0%±1.0
13
Gemini 2.0 Flash
44.8%
55.6%±0.4
56.3%±1.0
33.0%±1.0
14
GPT-4o
39.4%
51.5%±1.3
69.2%±1.6
24.0%±1.0
15
Llama 4 Maverick
37.8%
53.1%±1.1
73.2%±2.7
22.0%±1.0
16
GPT-4o mini
32.5%
44.4%±2.8
56.7%±1.8
20.0%±1.0