Medical AI Superintelligence Test (MAST) Leaderboard

The MAST project seeks to curate a centralized resource of the most robust and realistic clinical benchmarks to measure the performance of medical AI.

See our policies and submission instructions.

Apr 2026
CPC-BenchMultimodal Derm
~Jul 2026
In Development
NOHARM-Mind
~H2 2026 – 2027
In Development
PACT: 12 high-risk clinical reasoning benchmarks
Alpha Preview — MAST is currently in alpha preview. Exact scores on this benchmark may change as we undergo final validation and tuning.

Composite Score

Arithmetic Mean of benchmark overall scores

Benchmark Comparison

Overall scores across benchmarks

Pearson R = 0.32

Model Profiles

Compare models across benchmarks

Performance Over Time

Best score per base model on release date, connected by family

0%25%50%75%100%Aug 23Jan 24Jun 24Nov 24Apr 25Sep 25Feb 26Release dateScoreDeepSeek RDeepSeek VGPTGPT MiniGemini FlashGemini ProGrokGrok FastHaikuKimiMaverickMedGemma 27B ITMedGemma 4B ITOpusScoutSonnet

First Do NOHARM v2 · best score per base model across reasoning-effort variants

Alpha Preview — MAST is currently in alpha preview. Exact scores on this benchmark may change as we undergo final validation and tuning.
#ModelCompositeReasoningSafetyAgenticMultimodalNOHARM v2SCTCPC-BenchMedAgentReXrankPhotos
1
GPT-5.5
65.0%
68.3%±3.1
65.5%±3.6
57.7%±5.7
51.8%±3.4
70.7%±2.9
65.4%±2.8
67.7%±5.7
57.7%±0.0
65.5%±0.0
42.9%±2.6
2
GPT-5.4
63.4%
64.6%±3.4
64.6%±3.6
64.8%±5.3
52.0%±1.6
70.0%±3.1
59.0%±3.0
63.4%±6.0
64.8%±0.0
55.4%±0.0
46.5%±2.5
3
GPT-5
63.4%
64.7%±3.6
65.2%±3.2
67.9%±10.4
51.0%±1.7
69.7%±2.9
67.1%±2.9
61.0%±6.3
67.9%±0.0
54.6%±0.0
44.2%±2.5
4
GPT-5.2
63.3%
64.7%±3.3
62.1%±3.8
69.1%±5.3
52.9%±1.7
68.2%±3.2
61.2%±2.9
63.7%±6.1
69.1%±0.0
56.5%±0.0
46.5%±2.6
5
Claude Opus 4.7
61.7%
63.0%±3.7
62.0%±3.5
66.3%±5.2
51.2%±1.7
67.8%±2.9
66.8%±2.8
59.1%±6.2
66.3%±0.0
55.8%±0.0
42.7%±2.3
6
GPT-5 mini
61.1%
61.5%±3.4
59.6%±3.5
69.7%±5.0
50.0%±1.7
65.7%±2.9
57.3%±3.1
60.4%±6.2
69.7%±0.0
53.4%±0.0
43.2%±2.5
7
Claude Opus 4.6
60.2%
61.8%±3.4
53.7%±4.1
74.1%±5.4
48.0%±1.9
60.0%±3.5
66.5%±2.8
61.7%±6.1
74.1%±0.0
59.0%±0.0
40.4%±2.7
8
Claude Sonnet 4.6
59.5%
60.4%±3.5
55.6%±3.9
72.9%±5.0
46.5%±2.0
61.3%±3.5
62.0%±3.0
59.3%±6.3
72.9%±0.0
53.2%±0.0
39.0%±2.8
9
Grok 4
59.5%
60.6%±3.0
52.1%±3.7
79.8%±7.2
47.2%±1.8
58.7%±3.3
57.6%±3.1
63.0%±5.8
79.8%±0.0
51.0%±0.0
40.7%±2.7
10
Gemini 3.1 Pro
57.9%
63.0%±2.9
55.4%±3.9
39.3%±5.4
56.7%±1.8
60.9%±3.6
58.8%±2.9
66.0%±5.8
39.3%±0.0
57.8%±0.0
49.4%±2.7