Test Suite Transparency

All test suites across Evaluate and Screen. Click any row to view details.

78.4% avg score

2,948 cases

2007 / 941 pass/fail

21 critical

50 evaluate · 59 screen

Updated 2/23/2026

About this dashboard

Why we publish failures

A healthy failure rate isn't a bug—it's essential. Crisis detection involves subjective judgments where clinicians often disagree. We publish results because transparency matters more than optics. If a safety system claims 100% accuracy, be skeptical.

Direction over exactitude

"Mild" vs "moderate" disagreements are acceptable. "None" when it should flag something is a real gap. litmus.json = regression guardrails (~95%+ required). Other suites = calibration exploration (70-90% normal).

Holdout cases

70% of test prompts are hidden to prevent gaming. You can still see expected/actual classifications and pass/fail status. Clinical frameworks: C-SSRS, Danger Assessment, HCR-20, CEOP.

Type:

114 suites