Benchmark Demos
Hands-on explorations of how MAST benchmarks evaluate AI in medicine.
First Do NOHARM v2
Explore how models handle clinical safety scenarios and harmful request detection.
Launch Demo ›First Do NOHARM v2
Explore how models handle clinical safety scenarios and harmful request detection.
Explore ›SCT-Bench
See how models perform on script concordance tests for clinical reasoning.
Explore ›MedAgentBench v2
Watch AI agents navigate multi-step clinical workflows in a simulated EHR.
Explore ›ReXrank Mini
See how vision-language models generate radiology reports from chest X-rays across public datasets.
Explore ›First Do NOHARM v2
Explore how models handle clinical safety scenarios and harmful request detection.
Explore ›SCT-Bench
See how models perform on script concordance tests for clinical reasoning.
Explore ›MedAgentBench v2
Watch AI agents navigate multi-step clinical workflows in a simulated EHR.
Explore ›ReXrank Mini
See how vision-language models generate radiology reports from chest X-rays across public datasets.
Explore ›First Do NOHARM v2
Explore how models handle clinical safety scenarios and harmful request detection.
Explore ›SCT-Bench
See how models perform on script concordance tests for clinical reasoning.
Explore ›MedAgentBench v2
Watch AI agents navigate multi-step clinical workflows in a simulated EHR.
Explore ›ReXrank Mini
See how vision-language models generate radiology reports from chest X-rays across public datasets.
Explore ›