Medical AI Superintelligence Test (MAST) Leaderboard

The MAST project seeks to curate a centralized resource of the most robust and realistic clinical benchmarks to measure the performance of medical AI.

Currently we support First Do NOHARM, Script Concordance Test (SCT-Bench), CPC-Bench, MedAgentBench, and ReXrank mini, with more benchmarks in our roadmap. See our policies and submission instructions.

Dec 2025

First Do NOHARM v1 SCT-Bench

Feb 2026

MedAgentBench v2

Mar 2026

ReXrank Mini First Do NOHARM v2

~Apr 2026

CPC-BenchMultimodal Derm

~Jul 2026

In Development

NOHARM-Mind

~H2 2026 – 2027

In Development

PACT: 12 high-risk clinical reasoning benchmarks

Composite Score

Arithmetic Mean of benchmark overall scores

Best Models

Worst Models

Benchmark Comparison

Overall scores across benchmarks

X AxisY Axis

Pearson R = 0.16

Model Profiles

Compare models across benchmarks

Model 1Model 2Model 3

#	Model	Composite↓	First Do NOHARM v2	SCT	MedAgentBench v2	RexRank Radiology	Multimodal Derm
1	GPT-5.2	63.4%	61.5%±4.6	61.7%±0.4	64.7%±0.0	56.5%±0.0	58.6%±0.0
2	Claude Sonnet 4.6	62.5%	42.1%±5.7	62.5%±0.5	76.0%±0.0	53.2%±0.0	58.6%±0.0
3	GPT-5	62.0%	60.0%±5.1	67.5%±0.7	60.3%±0.0	54.6%±0.0	58.2%±0.0
4	Claude Opus 4.6	62.0%	44.6%±5.5	66.3%±0.2	70.0%±0.0	59.0%±0.0	63.6%±0.0
5	GPT-5 mini	61.6%	53.0%±4.9	57.4%±0.6	69.3%±0.0	53.7%±0.0	60.4%±0.0
6	GPT-5.4	61.2%	62.1%±5.0	59.4%±0.9	64.3%±0.0	55.4%±0.0	61.3%±0.0
7	GPT-4.1	60.8%	41.4%±6.3	67.5%±0.7	69.3%±0.0	53.2%±0.0	62.0%±0.0
8	Grok 4 Fast	59.0%	38.6%±5.5	61.1%±0.3	68.7%±0.0	50.7%±0.0	54.8%±0.0
9	Claude Sonnet 4.5	58.7%	38.8%±5.8	62.1%±1.2	73.0%±0.0	52.3%±0.0	45.0%±0.0
10	GPT-4o	58.1%	28.0%±5.9	66.0%±0.9	70.3%±0.0	50.3%±0.0	59.7%±0.0

Medical AI Superintelligence Test (MAST) Leaderboard

MAST Benchmark Roadmap

Benchmark Demos

Composite Score

Benchmark Comparison

Model Profiles