Oct 2, 2025 · 6 min read

Accelerating AI in Biology With Community-Driven Benchmarks

Abstract illustration of data flowing through a benchmarking system. On the left, circles and diamonds represent biological inputs feeding into a central hub with flowchart symbols, which then connects to a bar-and-line chart on the right.

CZI’s new benchmarking suite provides the emerging field of virtual cell modeling with the capabilities to readily assess both biological relevance and technical performance.

AI-driven virtual cell models represent one of the most promising and ambitious research frontiers, poised to guide groundbreaking experimental studies and speed discoveries about human health and disease. Yet, the biological AI field has been slowed by a major technical and systemic bottleneck: the lack of trustworthy, reproducible benchmarks to evaluate biomodel performance.

Without unified evaluation methods, the same model yields different performance scores across laboratories — not due to scientific factors, but implementation variations. This forces researchers to spend three weeks building custom evaluation pipelines for tasks that should require three hours with proper infrastructure. The result? Valuable research time diverted from discovery to debugging.

In collaboration with industry partners and a community working group initially focused on single-cell transcriptomics, the Chan Zuckerberg Initiative has released the first suite of tools to enable robust and broad task-based benchmarking to drive virtual cell model development. This standardized toolkit provides the emerging field of virtual cell modeling with the capabilities to readily assess both biological relevance and technical performance.

The impact is immediate: model developers can spend less time figuring out how to evaluate their models and more time improving them to solve real biological problems. Meanwhile, biologists can confidently evaluate prospective models before investing significant time and effort to deploy them.

Built for the Community, With the Community

CZI’s new benchmarking suite addresses a recognized community need for resources that are more usable, transparent, and biologically relevant. Following a recent workshop that convened machine learning and computational biology experts from across 42 top science and engineering institutions — including CZI, Stanford University, Harvard Medical School, Genentech, Johnson & Johnson, and NVIDIA — participants concluded in a preprint that AI model measurement in biology has been plagued by reproducibility challenges, biases, and a fragmented ecosystem of publicly available resources.

The group highlighted several areas in which current benchmarking efforts fall short. Often, model developers create bespoke benchmarks for individual publications, using custom, one-off approaches that showcase their models’ strengths. This can lead to cherry-picked results that look good in isolation but are difficult to cross-check across studies or reproduce in practice, slowing progress due to the lack of true comparability and trust in the models.

Additionally, the field has struggled with overfitting to static benchmarks. When a community aligns too tightly around a small, fixed set of tasks and metrics, developers may optimize for benchmark success rather than biological relevance. Resulting models may perform well on curated tests but fail to generalize to new datasets or research questions. In these cases, benchmarking can create the illusion of progress while stalling real-world impact.

With knowledge of these gaps, the CZI team collaborated with community working groups and industry partners to build and design a standardized benchmarking suite. The resulting resource is a living, evolving product where individual researchers, research teams, and industry partners can propose new tasks, contribute evaluation data, and share models.

Single-cell community benchmarking working group photo of smiling people waving at the camera in front of a wall with colorful text including the words “technology,” “tools,” and “future.” — CZI collaborated with community working groups and industry partners to build and design a standardized benchmarking suite that includes six tasks widely used by the biology community for single-cell analysis.

Designed To Build Better Models, Faster

CZI’s benchmarking suite is freely available on its virtual cells platform, designed for widespread adoption across expertise levels. Users can explore and apply benchmarking tools matched to their technical background. For instance, developers can opt for one of two programming tools: a command-line tool that easily allows users to reproduce benchmarking results displayed on the platform, or an open-source Python package, cz-benchmarks, co-developed with NVIDIA, for embedding evaluations alongside training or inference code. With just a few lines of code, benchmarking can be run at any development stage, including intermediate checkpoints. The modular package integrates seamlessly with experiment-tracking tools like TensorBoard or MLflow.

Users without a computational background can engage with the interactive, no-code web-based interface to explore and compare one model’s performance against others. Whether optimizing their own models or evaluating existing ones, users can sort and filter by task, dataset, or metric to find what matters most to their research.

The initial release of CZI’s benchmarking suite includes six tasks widely used by the biology community for single-cell analysis from CZI and contributors, including NVIDIA, the Allen Institute, and a single-cell community working group: cell clustering, cell type classification, cross-species integration, perturbation expression prediction, sequential ordering assessment, and cross-species disease label transfer. Unlike past benchmarking efforts that often relied on single metrics, each task in CZI’s toolkit is paired with multiple metrics for a more thorough view of performance.

“As an AI research scientist, I rely on benchmarks to understand model performance and guide improvements,” says Kasia Kedzierska, AI research scientist at the Allen Institute. “Cz-benchmarks serves as a valuable community resource, with unified datasets, models, and tasks, that is dynamic and open for contributions. Our team at the Allen Institute was glad to contribute a dataset on immune variation and flu vaccine response, along with a benchmark task built from it, and we hope this collective effort will inspire others to contribute and strengthen model development.”

“One of the biggest challenges with benchmarking today is that, beyond combing through the literature, researchers often have to implement and test models themselves to see how they perform in a specific study context,” says Rafaela Merika, graduate student at the Earlham Institute. “This makes systematic, community-based benchmarking especially critical, since shared, unbiased evaluations can dramatically reduce that workload, allowing us to spend more time pushing science forward, rather than continually chasing and re-testing new models.”

See a demo of CZI’s powerful new benchmarking command-line tool, designed to help you compare pre-trained ML models on single-cell datasets, evaluate performance across clustering, embedding, and label prediction tasks, and even use your own datasets with pre-computed representations. With this tool, you can also easily reproduce published benchmark results and accelerate your research.

What Lies Ahead

CZI is leading the development and maintenance of these tools as a living, community-driven resource that will evolve alongside the field, incorporating new data, refining metrics, and adapting to emerging biological questions. Backed by standardized, robust benchmarking, AI can live up to the hype in accelerating biological research, creating robust models to tackle some of the most complex, pressing challenges in biology and medicine today.

In the coming months, CZI will expand the suite with additional community-defined benchmarking assets, including held-out data evaluation sets, and develop tasks and metrics for other biological domains, including imaging and genetic variant effect prediction.

Virtual cell benchmarking can and should grow as shared, evolving infrastructure that is transparent, trusted, and representative of real scientific needs. With an open benchmarking ecosystem, rigorous model evaluation will become not only easier, but an expected part of building useful models in biology.

Ready to get started? Evaluate models, explore community benchmarks, or contribute to this ecosystem by visiting our platform or contacting us at virtualcellmodels@chanzuckerberg.com.