LatchBio reposted this
yesterday i wrote about why writing a biology benchmark isn't like writing a test suite. tolerances need to encode biological variation - sometimes methodological but sometimes wrong. but that's only half of it. each scBench eval is a snapshot of experimental data immediately before an analysis step. prior preprocessing already applied. target analysis not yet performed. that design choice creates a problem we had to engineer against. AnnData objects accumulate state. when you snapshot real workflows, that state comes with the data. for example you might forget adata.obsm["X_pca"] sitting in files where the eval was testing whether the agent could compute that representation it woudl . if you don't remove those fields, the eval measures whether the agent can find the answer in a file. for every eval in our benchmark, three adversarial checks before it was allowed in: can you answer from .obs/.uns directly? from textbook knowledge? do wrong answers leak the right one by being biologically implausible? any problem that did not pass one of the checks was revised or cut. one example: agent given DRG (peripheral nervous system) data, asked which brain-region signature scores highest. correct answer is microglia homeostatic - not because microglia are present, but because shared macrophage-like markers make that signature score highest in the data. textbook biology gets it wrong. computation gets it right. tolerances encode what variation is methodological. shortcut removal encodes what the agent is actually being asked to do. both required before your benchmark accuracy means anything. the best model hit 52.8%. its not much but its honest work.