Medical AI Superintelligence Test (MAST) Leaderboard

The MAST project seeks to curate a centralized resource of the most robust and realistic clinical benchmarks to measure the performance of medical AI.

Composite Score

Harmonic mean of benchmark overall scores

Best Models

Worst Models

Benchmark Comparison

Overall scores across benchmarks

X AxisY Axis

Pearson R = 0.83

Model Profiles

Compare models across benchmarks

Model 1Model 2Model 3

#	Model	Composite↓	NOHARM	SCT	HealthBench
Best Models
1	GPT-5.2	57.9%	64.0%±3.1	68.7%±2.0	50.0%±1.0
2	Grok 4	54.1%	59.1%±2.0	62.7%±4.4	47.0%±1.0
3	Kimi K2.5	54.1%	59.9%±2.0	70.4%±2.6	44.0%±1.0
4	Grok 4 Fast	53.8%	56.8%±0.9	66.9%±2.4	46.0%±2.0
5	Claude Opus 4.6	53.2%	61.4%±0.7	74.5%±3.7	40.0%±1.0
Worst Models
12	DeepSeek V3.2	47.5%	58.2%±0.8	45.1%±5.1	40.0%±1.0
13	Gemini 2.0 Flash	44.8%	55.6%±0.4	56.3%±1.0	33.0%±1.0
14	GPT-4o	39.4%	51.5%±1.3	69.2%±1.6	24.0%±1.0
15	Llama 4 Maverick	37.8%	53.1%±1.1	73.2%±2.7	22.0%±1.0
16	GPT-4o mini	32.5%	44.4%±2.8	56.7%±1.8	20.0%±1.0