About Vals AI
Independent platform committed to advancing the future of Gen AI through unbiased benchmarks and scalable evaluation infrastructure for labs and engineering teams.
Independent platform committed to advancing the future of Gen AI through unbiased benchmarks and scalable evaluation infrastructure for labs and engineering teams.
Our benchmarks measure the capability and reliability of AI models and agents in realistic tasks. In contrast with contrived exam-style benchmarks, we focus on economically valuable and scientifically important domains—finance, healthcare, math, coding, and more. Developed in collaboration with domain experts, our datasets are carefully curated to be of extremely high quality and push models to their limits.
Our benchmarks reflect the complexity of real-world tasks, which necessitates evaluating multiple types of capabilities:
A major problem with evaluations of AI models is test-set leakage 1. Benchmark data can contaminate training sets either directly or through synthetic data 2, undermining the validity of reported results. Thus, we offer private benchmarking; for transparency and fairness, we offer (for most benchmarks):
Benchmarks often report only accuracy numbers; however, it is important to consider factors such as efficiency, cost, time taken per test, failure modes, and more. Our evaluation framework provides detailed insights into model performance through multiple metrics:
This information enables us to offer a more comprehensive, holistic view of model performance, including accuracy, reliability, efficiency, and qualitative insights.
We report standard errors alongside benchmark scores to reflect statistical uncertainty.
Our methodology depends on how the benchmark is structured:
For benchmarks evaluated once, we follow standard uncertainty reporting practice, as suggested by “Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations” by Evan Miller 3. Error bars are computed as the standard error of the mean (SEM) over instance-level scores.
In particular, let be instance-level scores. Standard error of the mean (SEM), using sample standard deviation is given by:
where is the mean.
These error bars capture measurement uncertainty in the benchmark itself. They do not reflect variability across prompts, seeds, deployment settings, or the stochastic nature of LLM generation.
When a benchmark includes multiple independent runs, we compute the SEM of the score of each runs, estimating uncertainty over runs rather than over individual instances.
Let be the average score from each of independent runs and the average score be .
Standard error over runs is then given by:
For benchmarks that combine multiple tasks, we propagate uncertainty from each component using weighted variance pooling.
Let component standard errors be and weights
The propagated standard error is then:
In all cases, we use standard statistical definitions of the standard error of the mean, with sample standard deviation where applicable.
Since LLMs are often used as part of agentic systems with general scaffolds, and part of larger workflows or products, it is important to design evaluations that measure capabilities of these kinds of systems. Our benchmarks test crucial aspects of this, such as tool-calling, multi-turn flows, coding skills and computer-use.
In the future, our benchmarks will also evolve to test not only agentic systems we design, but also custom user-provided scaffolds and products.
These benchmarks ensure comprehensive evaluation of AI systems, addressing their growing utility in real-world applications and their ability to function autonomously in larger systems.
[1] https://arxiv.org/abs/2410.08385
[2] https://arxiv.org/abs/2407.07565
[3] https://arxiv.org/abs/2411.00640