[go: up one dir, main page]

72.2% issue resolution on SWE-bench Verified — #1 among GPT-5–based systems.

Read the post →
Top 10 Principles Enterprises Need When Building AI Agent Systems (Jul 2026)

Top 10 Principles Enterprises Need When Building AI Agent Systems (Jul 2026)

The ten most important security principles for enterprise AI agent systems in 2026 — a 10-minute overview of Anthropic's Zero Trust framework you can use to audit your own deployment.

Jul 2, 202610 min read
Claude Code MCP Exploit: Installing and Running Third-Party CLI (Jun 2026)

Claude Code MCP Exploit: Installing and Running Third-Party CLI (Jun 2026)

A malicious MCP server makes Claude Code (Opus 4.8) brew install a third-party CLI — bypassing the harness with three combined prompt-injection techniques.

Jun 29, 20268 min read
The MCP Attack Surface: Top-20 Documented Attacks (2026)

The MCP Attack Surface: Top-20 Documented Attacks (2026)

What attackers actually do once an MCP server is in your agent — supply chain, tool poisoning, indirect prompt injection, server CVEs, config-pivot — with the canonical PoCs and CVEs for each class.

Jun 24, 202612 min read
AI Agent Runtimes, explained in 5 minutes

AI Agent Runtimes, explained in 5 minutes

What an AI agent runtime is, what services it provides, and how it differs from a harness — a quick tour from prompt to production.

Jun 19, 20265 min read
Best open-source LLMs of 2026: 6 picks ranked by benchmarks + Reddit

Best open-source LLMs of 2026: 6 picks ranked by benchmarks + Reddit

GLM-5.2, Kimi K2.7, DeepSeek V4 Flash, Qwen 3.6, Gemma 4, MiniMax M3 — ranked by independent benchmarks and r/LocalLLaMA community sentiment. Practical picks for self-hosting in 2026.

Jun 17, 202610 min read
Top 3 AI Agent Security Papers from CAIS 2026 (Out of 12 Reviewed)

Top 3 AI Agent Security Papers from CAIS 2026 (Out of 12 Reviewed)

12 AI agent security papers from CAIS 2026 — ACM's first conference on agentic systems. Here are the 3 every agent team should read this quarter.

Jun 10, 20265 min read
Zero-trust overlay networks for AI agent isolation

Zero-trust overlay networks for AI agent isolation

Default cluster networking lets any pod dial every database, internal API, and other pod in the same VPC — the exfil path AI agents turn into incidents. A zero-trust overlay makes every dial an identity decision instead. The SSRF exploit pattern, and how Agyn wires the alternative in.

Jun 6, 20265 min read
How to Hide .env and API Keys from Claude Code, Cursor & Codex CLI

How to Hide .env and API Keys from Claude Code, Cursor & Codex CLI

Claude Code, Cursor, and Codex CLI can read your .env. Two patterns actually stop them: short-lived credentials and credential brokering. CVEs included.

Jun 5, 20267 min read
How to Sandbox an AI Agent: Filesystem & Network Isolation Patterns

How to Sandbox an AI Agent: Filesystem & Network Isolation Patterns

How to isolate an AI agent: filesystem patterns (containers, VMs, chroot) and network egress controls. What each technique buys you — and what it doesn't.

Jun 4, 202611 min read
AGENTS.md vs CLAUDE.md: Does Claude Code or Codex Read Both?

AGENTS.md vs CLAUDE.md: Does Claude Code or Codex Read Both?

Claude Code reads CLAUDE.md, Codex reads AGENTS.md, and neither falls back to the other. Here's the full compatibility map + a one-file setup for both.

Jun 3, 20266 min read
Best AI Agent Runtime for Production: 7 Platforms Compared (2026)

Best AI Agent Runtime for Production: 7 Platforms Compared (2026)

We scored 7 AI agent runtimes on production-readiness — self-hosting, MCP isolation, credentials, zero-trust. The winner, and why none scored above 3.75/7.

May 27, 202618 min read
Introducing Agyn: open-source Kubernetes runtime for AI agents

Introducing Agyn: open-source Kubernetes runtime for AI agents

Shipping the new Agyn: a Kubernetes-native runtime for AI agents, with isolation, observability, and access controls built in. The control plane enterprises need to safely run thousands of different agents inside their own infrastructure.

May 20, 20267 min read
Is AI Teaching Itself? Recursive Self-Improvement in 2026

Is AI Teaching Itself? Recursive Self-Improvement in 2026

Where AI self-improvement actually stands in 2026: frontier agents at 23% of human, reward-hacking, and why the moat moved to the harness.

May 13, 20266 min read
Context-Activated Memory for Claude Code Agents

Context-Activated Memory for Claude Code Agents

Claude Code’s built-in memory resets every session and doesn’t scale well. We built a context-activated retrieval layer instead. It uses a dedicated LLM to surface stored notes only when they’re relevant, not upfront. Under the hood, it runs a map-reduce process over memory chunks with automatic hook injection.

Apr 1, 202610 min read
Why isolated sandboxes are a hard requirement for AI agents

Why isolated sandboxes are a hard requirement for AI agents

Running AI agents on real codebases without proper isolation leads to file collisions, secret leakage, and non-reproducible failures. Isolation isn't an optimization — it's a prerequisite.

Feb 21, 20266 min read
What is SWE-bench Verified? (And How an AI Team Topped It)

What is SWE-bench Verified? (And How an AI Team Topped It)

SWE-bench Verified is the 500-task human-validated AI coding benchmark. Here's what it tests, current top scores, and how our AI team performed.

Feb 12, 20265 min read
gh pr-review: LLM-friendly PR review workflows in your CLI

gh pr-review: LLM-friendly PR review workflows in your CLI

A GitHub CLI extension that returns compact, deterministic JSON for PR reviews: single-command aggregation with filters, replies, resolutions, and submissions, reducing token overhead and error-prone tool chains.

Dec 3, 202510 min read
Autonomous Software Engineer (A‑SWE): Scaling Beyond the Demo

Autonomous Software Engineer (A‑SWE): Scaling Beyond the Demo

A‑SWE reaches production when approvals, reproducible workspaces, and replayable timelines are in place—so leaders can trust outcomes, audit decisions, and scale.

Oct 23, 202511 min read
How we built a small Pexels CLI (and the aarch64 cross-build trap we escaped)

How we built a small Pexels CLI (and the aarch64 cross-build trap we escaped)

A tiny Rust CLI that speaks the Pexels API, and the practical fix for aarch64 cross-builds on GitHub Actions.

Oct 23, 20254 min read
What 2,800+ Claude Code issues reveal about AI dev tools teams actually use

What 2,800+ Claude Code issues reveal about AI dev tools teams actually use

We analyzed 2,800+ Claude Code issues. Here are four themes that separate demos from durable AI dev tools—plus concrete wins teams can ship now.

Oct 22, 202514 min read
Multi‑Agent Orchestration: Patterns That Actually Work

Multi‑Agent Orchestration: Patterns That Actually Work

Reliable multi‑agent systems use roles, handoffs, SLAs, and approvals—turning planner/executor/reviewer patterns into predictable missions teams can operate.

Oct 21, 202512 min read
Agentic AI: From Demos to Durable Engineering

Agentic AI: From Demos to Durable Engineering

Agentic AI creates durable value when it moves beyond demos into an org-first control plane with orchestration, governance, and observability that teams can operate.

Oct 19, 202511 min read
What 1,000+ Codex CLI issues reveal about AI dev tools that teams actually use

What 1,000+ Codex CLI issues reveal about AI dev tools that teams actually use

We analyzed 1,000+ Codex CLI issues. Here are 10 product themes that separate hobby projects from production-ready AI dev tools—plus concrete wins to deliver now.

Oct 17, 202513 min read