Medical AI Superintelligence Test (MAST) Leaderboard

The MAST project seeks to curate a centralized resource of the most robust and realistic clinical benchmarks to measure the performance of medical AI.

Currently we support First Do NOHARM, Script Concordance Test (SCT-Bench), CPC-Bench, MedAgentBench, and ReXrank mini, with more benchmarks in our roadmap. See our policies and submission instructions.

~Apr 2026
CPC-BenchMultimodal Derm
~Jul 2026
In Development
NOHARM-Mind
~H2 2026 – 2027
In Development
PACT: 12 high-risk clinical reasoning benchmarks

Composite Score

Arithmetic Mean of benchmark overall scores

Best Models

Worst Models

Benchmark Comparison

Overall scores across benchmarks

Pearson R = 0.16

Model Profiles

Compare models across benchmarks

#ModelCompositeFirst Do NOHARM v2SCTMedAgentBench v2RexRank RadiologyMultimodal Derm
1
GPT-5.2
63.4%
61.5%±4.6
61.7%±0.4
64.7%±0.0
56.5%±0.0
58.6%±0.0
2
Claude Sonnet 4.6
62.5%
42.1%±5.7
62.5%±0.5
76.0%±0.0
53.2%±0.0
58.6%±0.0
3
GPT-5
62.0%
60.0%±5.1
67.5%±0.7
60.3%±0.0
54.6%±0.0
58.2%±0.0
4
Claude Opus 4.6
62.0%
44.6%±5.5
66.3%±0.2
70.0%±0.0
59.0%±0.0
63.6%±0.0
5
GPT-5 mini
61.6%
53.0%±4.9
57.4%±0.6
69.3%±0.0
53.7%±0.0
60.4%±0.0
6
GPT-5.4
61.2%
62.1%±5.0
59.4%±0.9
64.3%±0.0
55.4%±0.0
61.3%±0.0
7
GPT-4.1
60.8%
41.4%±6.3
67.5%±0.7
69.3%±0.0
53.2%±0.0
62.0%±0.0
8
Grok 4 Fast
59.0%
38.6%±5.5
61.1%±0.3
68.7%±0.0
50.7%±0.0
54.8%±0.0
9
Claude Sonnet 4.5
58.7%
38.8%±5.8
62.1%±1.2
73.0%±0.0
52.3%±0.0
45.0%±0.0
10
GPT-4o
58.1%
28.0%±5.9
66.0%±0.9
70.3%±0.0
50.3%±0.0
59.7%±0.0