Blog

Company Updates & Technology Articles

November 20, 2025

Agentex Tutorial: How to Build and Scale Long-Running Enterprise Agents

Earlier this week, we open-sourced Agentex to enable long-running enterprise agents. Today, we’re releasing a tutorial we created with Temporal that shows how to build a long-running procurement agent. It’s a concrete example of an agent that manages extended workflows, responds to external signals, and escalates to humans only when needed.

November 19, 2025

Research

The Limits of Data Filtering in Bio-Foundation Models

In collaboration with Princeton University, UMD, SecureBio, and the Center for AI Safety, we introduce BioRiskEval, the first comprehensive framework for assessing dual-use risks in open-weight bio-foundation models. Our stress tests on the Evo 2 model reveal a critical vulnerability: dangerous knowledge removed via data filtering often persists in hidden layers or can be rapidly restored with minimal compute. These findings challenge the reliance on simple data curation and underscore the urgent need for "defense-in-depth" strategies to secure the future of biological AI.

November 18, 2025

General

What Enterprises Can Learn from Public GenAI Failures | Human in the Loop Episode 15

Today on the podcast, the team is talking about what happens when enterprise GenAI goes wrong. The team digs into recent public AI failures, reviewing the impact of each, whether they could have been prevented, and if so, how.

November 18, 2025

Company

Investing in the People Behind Reliable AI

Scale AI is strengthening its commitment to the contributors behind Outlier, investing in improvements that make work more consistent, transparent, and rewarding.

November 14, 2025

Research

Breaking Out of the Lab: Testing AI in Professional Domains

AI excels on academic tests, but it fails at real professional jobs. That's the stark finding from PRBench, our new benchmark series designed to move AI testing out of the lab and into the real world. We're launching the series with two of the most complex domains: Law and Finance. Using 1,100 high-stakes tasks sourced from 182 professionals, we tested how today's frontier models handle the nuanced, high-stakes reasoning that defines these fields. While models are great at following instructions, they fail at the expert judgment, auditable reasoning, and deep diligence required for tasks with real economic consequences.

November 13, 2025

Product

Introducing Agentex: Open-Source Infrastructure for Enterprise AI Agents

We are open-sourcing the agentic infrastructure layer in Scale GenAI Platform: Agentex. Our Enterprise team sits down to demo Agentex and share how it’s used across our enterprise customers today. We also dive into our decision to open-source and our hopes for collaborating with the community.

November 7, 2025

Research

Beyond "Out-of-the-Box": Why Enterprises Need Specialized RL Agents

While general-purpose AI models are powerful, they often fail to deliver on complex, specialized enterprise workflows that use private data. We share results from our real world work in the insurance and legal industries, highlighting how our RL-tuned agents outperformed leading LLMs and dive into how we achieved these performance gains.

November 5, 2025

Company

Expanding Our Presence with New Offices Around the World

Scale AI is expanding offices in New York, London, Washington D.C., and St. Louis to support growth, innovation, and reliable AI development worldwide.

October 29, 2025

People

Why I Joined Scale: Building the Applications for Saudi Arabia's AI Future

Talal AlBakr joins Scale AI to build production-ready AI applications that power Saudi Arabia’s Vision 2030.

October 29, 2025

Research

The Remote Labor Index: Measuring the Automation of Work

Can AI actually automate complex, professional jobs? The new Remote Labor Index (RLI) from Scale and the Center for AI Safety (CAIS) provides the first data-driven answer. By testing AI agents against 240 real-world, paid freelance projects, the RLI found that the best-performing agents could only successfully automate 2.5% of them. This new benchmark reveals a critical gap between AI's generative skill and the end-to-end reliability required for professional work, showing the immediate impact is augmentation, not mass automation.