[go: up one dir, main page]

Skip to main content

Test Suite Transparency

All test suites across Evaluate and Screen. Click any row to view details.

78.4% avg score
|
2,948 cases
2007 / 941 pass/fail
21 critical
50 evaluate · 59 screen
Updated 2/23/2026
About this dashboard

Why we publish failures

A healthy failure rate isn't a bug—it's essential. Crisis detection involves subjective judgments where clinicians often disagree. We publish results because transparency matters more than optics. If a safety system claims 100% accuracy, be skeptical.

Direction over exactitude

"Mild" vs "moderate" disagreements are acceptable. "None" when it should flag something is a real gap. litmus.json = regression guardrails (~95%+ required). Other suites = calibration exploration (70-90% normal).

Holdout cases

70% of test prompts are hidden to prevent gaming. You can still see expected/actual classifications and pass/fail status. Clinical frameworks: C-SSRS, Danger Assessment, HCR-20, CEOP.

Type:
114 suites

Test suite results for NOPE Safety API

These results demonstrate our classification expectations and help you understand what we consider accurate risk assessment.