Today, Evaluating Evaluations is introducing Every Eval Ever, a unified, open data format and public dataset for AI evaluation results.
Evaluation data is everywhere. Found in model releases, in competitive leaderboards, arenas, papers, used for capability/risk measurement, model choice, and governance. But they are not format compatible with each other in any meaningful way. This has some real costs to research, reproducibility, and information parsing.
We are releasing standards for an aggregate schema, and an instance level schema! We took great care to incorporate almost every edge case that we found in how people report evaluations for LLMs. And we are not done! Multimodal and other suport is in the works.
Out of the gate, we wrote converters to convert your data from the HELM, Inspect AI and lm-eval format. We are starting to populate a massive dataset of every eval ever on Hugging Face :)
We ate our own dogfood: an internal version of this dataset helped power our big AI benchmark saturation study which we will be releasing in the coming weeks. Having all the data together unlocks levels of meta-research that is very hard to do meaningfully otherwise.
To turbocharge the data collection, we are also launching a shared task at #ACL 2026 to solicit data submissions from folks :) There will be prizes and paper authorship!
Leading this project and getting feedback from the incredible partners across the ecosystem has been truly, incredibly, a dream. Feedback was obtained from researchers in orgs like the US CAISI, Hugging Face, EleutherAI, Inspect, HELM, Technical University of Munich, Massachusetts Institute of Technology, Northeastern University, University of Copenhagen (Københavns Universitet), IBM, Noma Security, Trustible, Meridian, AI Verification and Evaluation Research Institute (AVERI), The Collective Intelligence Project, Weizenbaum Institute, and Evidence Prime.
Can't wait to see what you all do with this standard and with the data. All links in comments!