[go: up one dir, main page]

Vals Public Sector

AI evaluation for government.

Vals is an independent AI evaluation company. Government is adopting AI fast, but too often no one can prove which systems are good enough to trust with a benefits decision, a legal question, or a national security mission. We build rigorous, third-party benchmarks that measure how AI actually performs where government puts it to work. With that evidence, agencies can adopt AI with confidence, policymakers can make informed AI policy, and taxpayers see a real return on their investment.

Our track record speaks for itself: the world's leading AI labs already rely on Vals to test their models before release, and our benchmarks have been covered by the New York Times, Washington Post, Wall Street Journal, and Bloomberg. That's the same independent rigor we now bring to government.

Trust is earned. Evaluation is how you earn it.

Proven results: Our Public Sector Benchmarks

We have already brought independent evaluation to some of the highest-stakes public domains. Our benchmarks are built alongside real domain experts and results are published openly, with live leaderboards that update as new models are released.

Public Benefits Benchmark

Public Benefits Bench measures whether AI can serve as a reliable first point of contact for the more than 37 million Americans who rely on SNAP food assistance. Built with the Center for Civic Futures and Code for America, with rubrics validated by SNAP policy experts, it tests models against 459 real benefits scenarios spanning all 50 states. See the live leaderboard and methodology.

System

Accuracy

0.00%

± 1.21

0.00%

± 1.23

0.00%

± 1.25

0.00%

± 1.26

0.00%

± 1.26

0.00%

± 1.26

0.00%

± 1.27

0.00%

± 1.28

0.00%

± 1.29

Public Benefits Bench: No general-purpose AI model performs well enough to be trusted with SNAP benefits guidance. The top-performing model, Claude Opus 4.8, provided correct answers to SNAP-related questions only 68.1% of the time, meaning beneficiaries using current AI systems still get incorrect answers nearly a third of the time.

We measure for frontier risk

Our work extends beyond everyday government workflows. As AI grows more capable, that same rigorous measurement gives government an independent read on both what these models can do and the dangers they pose to the public, including cybersecurity, where powerful models could be turned against key infrastructure & institutions. We give agencies and policymakers early, credible insight, so they can get ahead of emerging threats before that capability reaches the people they serve.

Get in touch

Whether you work in government, build AI for the public sector, or are an expert who wants to help shape an evaluation, we would love to hear from and work with you.

Or send us an email at contact@vals.ai