You thought we'd stop at Spatial? Here's something wild: the sequencing platform you choose affects AI accuracy more than the model itself. We just released scBench - 394 real scRNA-seq analysis problems across 6 platforms(BD, 10x Genomics, CS Genetics, Illumina, Mission Bio, Parse Biosciences). The accuracy gap between platforms? 33 points. The gap between best and worst frontier models? 24 points. Translation: models basically memorized Scanpy tutorials. Put them on an underrepresented platform and they collapse. This matters because scRNA-seq is the dominant assay in modern biology. Way more adoption than spatial, way more public data. If we want agents that can actually do computational biology, this is the test. Current state: Opus 4.6 hits 53% accuracy. Better than spatial (38%), but still failing every other routine task. Procedural stuff like normalization? Getting there (70%). But judgment calls - cell typing (35%), differential expression (27%) - that's where models break down. Turns out writing code ≠ doing science.
Unclear to me how much of this effect is due to a) difference in training on SC workflows b) difference between data sets (harder analysis problems) c) difference in technical artifacts from the machine that impact downstream reliability. Seems hard to make the benchmark actionable without disentangling these three.
It turns out transcriptomic data are not ground truth observations😙
These are the kinds of comparisons that are so sorely needed! Headed over to read your manuscript on this, but in the meantime, nice work!
This is surprising, in a pleasent way. I honestly expected the models to have a better performance. It seems that we will not be replaced just yet :)
Try it in two months ... AI coding capability is supposed to double in every 70 days : )
• arXiv Preprint: https://latch.bio/scbench • Code (example evals, trajectories, etc.): https://github.com/latchbio/scbench • Live Benchmarks: https://benchmarks.bio/