LatchBio’s cover photo
LatchBio

LatchBio

Biotechnology Research

San Francisco, CA 8,579 followers

Data Infrastructure for Biology

About us

AI agents and data infrastructure for 40+ biotech solution providers. Bundled with kits + machines to serve 300+ biopharmas and R&D Labs around the world.

Website
https://latch.bio/
Industry
Biotechnology Research
Company size
11-50 employees
Headquarters
San Francisco, CA
Type
Privately Held
Founded
2021

Locations

Employees at LatchBio

Updates

  • LatchBio reposted this

    yesterday i wrote about why writing a biology benchmark isn't like writing a test suite. tolerances need to encode biological variation - sometimes methodological but sometimes wrong. but that's only half of it. each scBench eval is a snapshot of experimental data immediately before an analysis step. prior preprocessing already applied. target analysis not yet performed. that design choice creates a problem we had to engineer against. AnnData objects accumulate state. when you snapshot real workflows, that state comes with the data. for example you might forget adata.obsm["X_pca"] sitting in files where the eval was testing whether the agent could compute that representation it woudl . if you don't remove those fields, the eval measures whether the agent can find the answer in a file. for every eval in our benchmark, three adversarial checks before it was allowed in: can you answer from .obs/.uns directly? from textbook knowledge? do wrong answers leak the right one by being biologically implausible? any problem that did not pass one of the checks was revised or cut. one example: agent given DRG (peripheral nervous system) data, asked which brain-region signature scores highest. correct answer is microglia homeostatic - not because microglia are present, but because shared macrophage-like markers make that signature score highest in the data. textbook biology gets it wrong. computation gets it right. tolerances encode what variation is methodological. shortcut removal encodes what the agent is actually being asked to do. both required before your benchmark accuracy means anything. the best model hit 52.8%. its not much but its honest work.

    • No alternative text description for this image
  • LatchBio reposted this

    ask an agent to cluster a scRNA-seq dataset and report the number of clusters. what's the right answer? it depends on the resolution parameter. the algorithm. the tissue type. whether you're doing discovery or validation. two expert bioinformaticians looking at the same data will defensibly arrive at different numbers, and both could be correct. this is the grading problem when testing LLMs. in software, the test suite is the oracle. run the tests, they pass or they fail. writing good tests is not that straightforward in computational biology, it's why benchmark design for biology is a fundamentally different engineering challenge than benchmark design for code. here's what it actually takes: you have to run the analysis with many valid methods, measure the range of outputs those methods produce, and calibrate a tolerance window that accepts all of them while still rejecting genuinely wrong answers. for cell counts after QC filtering: ±50 cells on a dataset of 1.4M. for cell type distributions: ±5 percentage points per category. for gene marker lists: a recall threshold (≥0.50) instead of exact match, because an agent that finds novel markers not in the canonical set might also be correct. Leiden clustering is stochastic across random seeds. UMAP coordinates are arbitrary across library versions. if you don't know account for that in your design, your eval will flip pass/fail on two correct answers. this is what "verifiable" actually requires in biology. not just deterministic grading, but deeply reasoned tolerances that encode what variation is methodological and what variation is wrong.

    • No alternative text description for this image
  • LatchBio reposted this

    people underestimate how much of biology will be computational biology. not next year, but in the next 10 years most of biology will be in silico. in the 70s and 80s, physics went through a simulation revolution. systems that were well-understood analytically - fluid dynamics, structural mechanics - got compressed into computational models. the math was clean enough that you could write equations, discretize them, and simulate forward. it changed everything about how those fields operate. biology never got there. biological systems are too messy, too nonlinear, too context-dependent. you can't write down the equations governing cell fate decisions the way you can write navier-stokes. there was no clean mathematical framework to discretize. AI changes this. not because it solves the equations - there are no equations. but because it can learn compressed representations of non-perfect systems directly from data. it doesn't need a closed-form model. it learns the function from observations. We're going from "biology is too complex to simulate" to "biology is too complex to simulate analytically, but we can learn compressions from enough data and use those as the simulator." we're early. the models aren't reliable yet. but its all rapidly coming into place. and 5 years from now, the idea that a biologist would design an experiment without a computational model predicting the end to end outcome will seem as strange as an aerospace engineer skipping CFD.

    • No alternative text description for this image
  • LatchBio reposted this

    SWE-bench changed how people measured coding ability. biology needs the same shift. the part i'd add: you can't build the biology version from the outside. we see what breaks because we power 30+ bioinformatics platforms for companies like AtlasXomics Inc. and CS Genetics. the QC threshold that silently fails on a specific kit, the cell typing convention that changes by tissue — you only write that into a benchmark if you've seen it happen in production. that's why we built benchmarks.bio from inside real workflows, not from papers.

    View profile for Alfredo Andere 🦖

    Co-Founder and CEO at LatchBio — The Cloud for Biology | F. 30U30

    software engineering went through this exact transition. coding benchmarks tested multiple choice: function completion, syntax tricks, isolated puzzles. models got very good at those. then SWE-bench asked: can you fix a bug in a real repository? scores dropped. everyone worried. the problems that mattered in practice weren't the problems anyone had been measuring. biology is at that inflection point now. most biology evaluations still reward textbook recall. clean inputs. one right answer. but real computational biology is messy datasets from specific machines with specific artifacts. QC decisions that change by tissue type. marker genes that mean different things across platforms. a bioinformatician doesn't pick from four options - they load data, write code, make judgment calls, and produce results over many days that drive months of downstream work. if your benchmark doesn't recreate that, you're measuring something, but it's not comp bio competence. the people building benchmarks will need to be in the ground floor next to the people doing the work. not reading about data flows - sitting inside them. knowing that a QC threshold that works on Chromium will silently produce garbage on a MissionBio Tapestri run. Only then will the benchmarks be a true measure of how useful these models are to the people doing the real work. SWE has made a lot of progress in this transition, and i'm confident that with the right people behind it comp bio will too. (Pictured: if you're interested in good benchmark design for comp bio agents my co-founder and CTO Kenny Workman will be discussing some of our work at an in-person talk at UC Berkeley this coming Tuesday)

    • No alternative text description for this image
  • LatchBio reposted this

    software engineering went through this exact transition. coding benchmarks tested multiple choice: function completion, syntax tricks, isolated puzzles. models got very good at those. then SWE-bench asked: can you fix a bug in a real repository? scores dropped. everyone worried. the problems that mattered in practice weren't the problems anyone had been measuring. biology is at that inflection point now. most biology evaluations still reward textbook recall. clean inputs. one right answer. but real computational biology is messy datasets from specific machines with specific artifacts. QC decisions that change by tissue type. marker genes that mean different things across platforms. a bioinformatician doesn't pick from four options - they load data, write code, make judgment calls, and produce results over many days that drive months of downstream work. if your benchmark doesn't recreate that, you're measuring something, but it's not comp bio competence. the people building benchmarks will need to be in the ground floor next to the people doing the work. not reading about data flows - sitting inside them. knowing that a QC threshold that works on Chromium will silently produce garbage on a MissionBio Tapestri run. Only then will the benchmarks be a true measure of how useful these models are to the people doing the real work. SWE has made a lot of progress in this transition, and i'm confident that with the right people behind it comp bio will too. (Pictured: if you're interested in good benchmark design for comp bio agents my co-founder and CTO Kenny Workman will be discussing some of our work at an in-person talk at UC Berkeley this coming Tuesday)

    • No alternative text description for this image
  • LatchBio reposted this

    agent (dot) bio refuses to do most types of data analyses. By design: Most agents are designed to attempt everything. any dataset, any question, any assay type. the demo looks great. the results don't hold up. agents confidently return wrong biology - wrong cell types, wrong genes, wrong conclusions - and nothing in the system can flag it. So we made a design choice. narrow scope. deep competence. our agent handles single-cell and spatial transcriptomics workflows - specific platforms, assay types, and analysis steps. everything else gets refused. Biology doesn't grade on effort. a wrong cell type annotation wastes months of downstream experiments. a bad differential expression call sends a team chasing genes that aren't relevant. in this field, a confident wrong answer is more expensive than no answer. So when the exact same model (Opus 4.5) wrapped in a domain-specific harness almost doubles accuracy (38% to 62%), what you choose not to attempt matters most. (pictured: a recent viral example of chatGPT being very confident but wrong)

    • No alternative text description for this image
  • LatchBio reposted this

    Speaking in-person at Berkeley next Tuesday in Cory Hall. Will discuss internals of SpatialBench: - Our approach to verifiability with messy real world biology data - Building data infrastructure for hundreds of parallel agent environments - Scientific behavior of frontier models analyzing thousands of trajectories Welcome all engineers and scientists interested in practical details of large scale benchmarks for agents in biology. Thanks to Arshia Nayebnazar Aakarsh Vermani Sarrah Rose and Machine Learning at Berkeley for having me.

  • LatchBio reposted this

    Three benchmarks now exist for AI agents in computational biology: BioAgent Bench, BixBench, and our BenchmarksBio. All trying to answer whether agents can do real computational biology. BioAgent Bench (Entropic) tests pipeline execution. 10 end-to-end bioinformatics workflows - variant calling, metagenomics, differential expression. Can the agent install tools, process data, and produce the requested output? Top model Opus 4.5 hits 100% completion. But inject a decoy file from an unrelated organism and the agent incorporates it in 2 of 10 tasks. Add filler text to the prompt and agents complete 28% fewer of the necessary pipeline steps. Scoring uses an LLM judge - because finishing the pipeline is not the same as understanding the data, but it makes grading isn't fully reproducible. BixBench (FutureHouse) tests scientific reasoning. 53 real scenarios built by bioinformaticians. The agent gets data and questions (296), picks its own methods, runs analysis in a Jupyter notebook. Best open-answer accuracy at time of evaluation is 17% by Sonnet 3.5. They grade open answers with an LLM judge + some numeric/range verifiers. When an "I don't know" option is added, models perform close to random. scBench + SpatialBench (benchmarks.bio) tests step-level accuracy. 540 problems from real workflows across 11 single-cell and spatial platforms. Deterministic grading - no LLM scoring - make it fully reproducible. Best model is Opus 4.6 with 52.8% on scRNA-seq and Opus 4.5 with 38.4% on spatial. Easier tasks like normalization are approaching reliability (best model ~84%) while even the best model only reaches 48% on cell typing and 41% on differential expression - the stages where scientific judgment matters most. What everyone found: Agents can orchestrate tools but can't yet be trusted with scientific conclusions. Performance depends heavily on what technology the data came from, not just which model you use. And open-weight models lag behind closed ones - which matters when your data can't leave the building. BixBench: https://lnkd.in/dxhac6Wy BioAgent Bench: https://lnkd.in/dh2s2CBA Benchmarks.bio: https://lnkd.in/dAHmKFqK + https://lnkd.in/dWCvM6EE

    • No alternative text description for this image
  • LatchBio reposted this

    It's not "which model is best." It's what's wrapped around the model. We ran the exact same model — Opus 4.5 — on 146 real spatial biology analysis problems under three different agent harnesses. Same data. Same problems. Same model. The accuracy swing was massive. • Base harness: 38% • Claude Code: 48% • LatchBio agent: 62% That's a 23-point gap from engineering alone. For context, the gap between the best and worst frontier model on SpatialBench is 18 points. The scaffolding around the model moved the needle as much as swapping the model entirely. Where it matters most: the hard stuff. Clustering went from 33% to 66%. Differential expression from 37% to 64%. The tasks that require real scientific judgment — not just writing code — are exactly where harness design has the biggest effect. What is a harness? The tools, prompts, control flow, and execution environment that wrap a base model. Most people treat this as glue code. Our data says it can matter as much as the model itself. This doesn't mean models don't matter. They do. But progress requires joint optimization of both model and harness. Our data suggests harnesses should be treated as first-class objects in engineering and benchmarking — reported as rigorously as model versions. If you're evaluating AI agents for computational biology and only comparing base models, you're measuring the wrong thing.

    • No alternative text description for this image
  • LatchBio reposted this

    You thought we'd stop at Spatial? Here's something wild: the sequencing platform you choose affects AI accuracy more than the model itself. We just released scBench - 394 real scRNA-seq analysis problems across 6 platforms(BD, 10x Genomics, CS Genetics, Illumina, Mission Bio, Parse Biosciences). The accuracy gap between platforms? 33 points. The gap between best and worst frontier models? 24 points. Translation: models basically memorized Scanpy tutorials. Put them on an underrepresented platform and they collapse. This matters because scRNA-seq is the dominant assay in modern biology. Way more adoption than spatial, way more public data. If we want agents that can actually do computational biology, this is the test. Current state: Opus 4.6 hits 53% accuracy. Better than spatial (38%), but still failing every other routine task. Procedural stuff like normalization? Getting there (70%). But judgment calls - cell typing (35%), differential expression (27%) - that's where models break down. Turns out writing code ≠ doing science.

    • No alternative text description for this image

Similar pages

Browse jobs

Funding