Vals Public Sector

AI evaluation for government.

Vals is an independent AI evaluation company. Government is adopting AI fast, but too often no one can prove which systems are good enough to trust with a benefits decision, a legal question, or a national security mission. We build rigorous, third-party benchmarks that measure how AI actually performs where government puts it to work. With that evidence, agencies can adopt AI with confidence, policymakers can make informed AI policy, and taxpayers see a real return on their investment.

Our track record speaks for itself: the world's leading AI labs already rely on Vals to test their models before release, and our benchmarks have been covered by the New York Times, Washington Post, Wall Street Journal, and Bloomberg. That's the same independent rigor we now bring to government.

Trust is earned. Evaluation is how you earn it.

Read Full Vals Public Sector Memo Stay in Touch

Proven results: Our Public Sector Benchmarks

We have already brought independent evaluation to some of the highest-stakes public domains. Our benchmarks are built alongside real domain experts and results are published openly, with live leaderboards that update as new models are released.

Public Benefits Benchmark

Public Benefits Bench measures whether AI can serve as a reliable first point of contact for the more than 37 million Americans who rely on SNAP food assistance. Built with the Center for Civic Futures and Code for America, with rubrics validated by SNAP policy experts, it tests models against 459 real benefits scenarios spanning all 50 states. See the live leaderboard and methodology.

System

Accuracy

Claude Opus 4.8

0.00%

± 1.21

Claude Sonnet 5

0.00%

± 1.23

MiniMax-M3

0.00%

± 1.25

DeepSeek V4

0.00%

± 1.26

Claude Sonnet 4.6

0.00%

± 1.26

GLM 5.1

0.00%

± 1.26

GPT 5.5

0.00%

± 1.27

Gemini 3.5 Flash

0.00%

± 1.28

Kimi K2.6

0.00%

± 1.29

Claude Haiku 4.5 (Nonthinking)

0.00%

± 1.30

✽ Public Benefits Bench: No general-purpose AI model performs well enough to be trusted with SNAP benefits guidance. The top-performing model, Claude Opus 4.8, provided correct answers to SNAP-related questions only 68.1% of the time, meaning beneficiaries using current AI systems still get incorrect answers nearly a third of the time.

Trusted across the hardest domains

Beyond the public sector, our benchmarks span finance, law, education, and software engineering. Our work is trusted and cited by the leading AI labs, and has been covered by the New York Times, Washington Post, Wall Street Journal, and Bloomberg.

Proprietary

Updated 7/1/2026

Legal Research Bench

models tested

Evaluating agents on legal research tasks across diverse areas of US law

Top Models

Claude Opus 4.8

Claude Sonnet 5

GPT 5.5

Updated 7/1/2026

Finance Agent v2

models tested

Evaluating agents on core financial analyst tasks

Top Models

Gemini 3.5 Flash

Claude Fable 5

Claude Opus 4.8

Updated 7/1/2026

SAGE

models tested

Student Assessment with Generative Evaluation

Top Models

Claude Opus 4.7

Gemma 4 31B IT

Claude Opus 4.8

Updated 7/1/2026

Vibe Code Bench v1.1

models tested

Can models build web applications from scratch?

Top Models

Claude Fable 5

Claude Opus 4.8

Claude Sonnet 5

We measure for frontier risk

Our work extends beyond everyday government workflows. As AI grows more capable, that same rigorous measurement gives government an independent read on both what these models can do and the dangers they pose to the public, including cybersecurity, where powerful models could be turned against key infrastructure & institutions. We give agencies and policymakers early, credible insight, so they can get ahead of emerging threats before that capability reaches the people they serve.

Get in touch

Whether you work in government, build AI for the public sector, or are an expert who wants to help shape an evaluation, we would love to hear from and work with you.