Blog
Company Updates & Technology Articles
November 20, 2025
Agentex Tutorial: How to Build and Scale Long-Running Enterprise Agents
Earlier this week, we open-sourced Agentex to enable long-running enterprise agents. Today, we’re releasing a tutorial we created with Temporal that shows how to build a long-running procurement agent. It’s a concrete example of an agent that manages extended workflows, responds to external signals, and escalates to humans only when needed.
Read more
November 19, 2025
The Limits of Data Filtering in Bio-Foundation Models
In collaboration with Princeton University, UMD, SecureBio, and the Center for AI Safety, we introduce BioRiskEval, the first comprehensive framework for assessing dual-use risks in open-weight bio-foundation models. Our stress tests on the Evo 2 model reveal a critical vulnerability: dangerous knowledge removed via data filtering often persists in hidden layers or can be rapidly restored with minimal compute. These findings challenge the reliance on simple data curation and underscore the urgent need for "defense-in-depth" strategies to secure the future of biological AI.
Read more
November 18, 2025
What Enterprises Can Learn from Public GenAI Failures | Human in the Loop Episode 15
Today on the podcast, the team is talking about what happens when enterprise GenAI goes wrong. The team digs into recent public AI failures, reviewing the impact of each, whether they could have been prevented, and if so, how.
Read more
November 18, 2025
Investing in the People Behind Reliable AI
Scale AI is strengthening its commitment to the contributors behind Outlier, investing in improvements that make work more consistent, transparent, and rewarding.
Read more
November 14, 2025
Breaking Out of the Lab: Testing AI in Professional Domains
AI excels on academic tests, but it fails at real professional jobs. That's the stark finding from PRBench, our new benchmark series designed to move AI testing out of the lab and into the real world. We're launching the series with two of the most complex domains: Law and Finance. Using 1,100 high-stakes tasks sourced from 182 professionals, we tested how today's frontier models handle the nuanced, high-stakes reasoning that defines these fields. While models are great at following instructions, they fail at the expert judgment, auditable reasoning, and deep diligence required for tasks with real economic consequences.
Read more
November 13, 2025
Introducing Agentex: Open-Source Infrastructure for Enterprise AI Agents
We are open-sourcing the agentic infrastructure layer in Scale GenAI Platform: Agentex. Our Enterprise team sits down to demo Agentex and share how it’s used across our enterprise customers today. We also dive into our decision to open-source and our hopes for collaborating with the community.
Read more
November 7, 2025
Beyond "Out-of-the-Box": Why Enterprises Need Specialized RL Agents
While general-purpose AI models are powerful, they often fail to deliver on complex, specialized enterprise workflows that use private data. We share results from our real world work in the insurance and legal industries, highlighting how our RL-tuned agents outperformed leading LLMs and dive into how we achieved these performance gains.
Read more
November 5, 2025
Expanding Our Presence with New Offices Around the World
Scale AI is expanding offices in New York, London, Washington D.C., and St. Louis to support growth, innovation, and reliable AI development worldwide.
Read more
October 29, 2025
Why I Joined Scale: Building the Applications for Saudi Arabia's AI Future
Talal AlBakr joins Scale AI to build production-ready AI applications that power Saudi Arabia’s Vision 2030.
Read more
October 29, 2025
The Remote Labor Index: Measuring the Automation of Work
Can AI actually automate complex, professional jobs? The new Remote Labor Index (RLI) from Scale and the Center for AI Safety (CAIS) provides the first data-driven answer. By testing AI agents against 240 real-world, paid freelance projects, the RLI found that the best-performing agents could only successfully automate 2.5% of them. This new benchmark reveals a critical gap between AI's generative skill and the end-to-end reliability required for professional work, showing the immediate impact is augmentation, not mass automation.
Read more