Forem Core

Give Your AI Agent iMessage in 5 Minutes — Claude Code, Codex, Cursor

Emre Sarbak — Tue, 07 Apr 2026 00:01:30 +0000

npx skills add emotion-machine-org/imessage-with-no-mac

That one command gives your AI agent iMessage, RCS, and SMS. It works in Claude Code, Codex, Cursor, Gemini CLI, Windsurf, GitHub Copilot, and 20+ other AI coding agents.

No Mac. No phone hardware. No webhook server.

What this is

Claw Messenger is a managed API that gives AI agents a real phone number for iMessage (blue bubbles), RCS, and SMS. You get a dedicated number, WebSocket connection for real-time messaging, and full iMessage features like tapbacks, read receipts, and media.

The Agent Skill we just published teaches any compatible AI agent how to set up and use Claw Messenger. The skill follows the Agent Skills spec, which means it works across platforms without modification.

Demo: zero to first message

Here is what the flow looks like in Claude Code.

1. Install the skill

npx skills add emotion-machine-org/imessage-with-no-mac

The skill is now available in your agent's context. It loads automatically when you ask about messaging, iMessage, SMS, or phone numbers.

2. Ask your agent to set up messaging

> Set up iMessage for my agent

The skill walks your agent through:

Signing up at clawmessenger.com
Getting an API key (cm_live_*) from the dashboard
Connecting via WebSocket to wss://claw-messenger.onrender.com
Configuring preferred service (iMessage, RCS, or SMS)

3. Send a test message

> Send a test iMessage to +15551234567

Your agent connects, authenticates, and sends the message. The recipient sees a standard iMessage from your dedicated number.

The whole process takes under 5 minutes. Most of that time is account creation.

Why an Agent Skill

Agent Skills are the native way AI coding agents discover and learn new capabilities. Instead of copy-pasting API docs into your prompt, the skill loads the right instructions at the right time.

The skill uses progressive disclosure: the agent sees a lightweight summary (~100 tokens) when scanning available skills, then loads full instructions only when messaging is relevant to the task. This keeps your context window clean.

Since the spec is cross-platform, one skill definition works everywhere. We tested on Claude Code, Codex, Cursor, Gemini CLI, Antigravity, OpenCode, and others. The install command is the same.

How it compares

	Claw Messenger	Sendblue	Blooio	BlueBubbles
Price	$5/mo	$100/mo	$39/mo	Free
Mac required	No	No	No	Yes
iMessage	Yes	Yes	Yes	Yes
RCS	Yes	No	No	No
SMS	Yes	Yes	Yes	Yes
WebSocket API	Yes	No	No	Yes
Dedicated number	Yes	Yes	Yes	No (uses your number)
Agent Skill	Yes	No	No	No
Media support	Yes	Yes	Yes	Yes

Sendblue is solid but 20x the price. Blooio sits in the middle. BlueBubbles is free but requires a Mac running 24/7, which defeats the purpose if your agent runs on a VPS or in Docker.

Claw Messenger is the only option with a published Agent Skill and RCS support.

Pricing

Plan	Messages/mo	Price
Base	1,000	$5/mo
Plus	6,000	$25/mo
Pro	15,000	$50/mo

All plans include iMessage, RCS, SMS, WebSocket API, and a dedicated phone number.

Supported platforms

The skill works on any platform that supports the Agent Skills spec:

Claude Code, Codex, Cursor, Gemini CLI, Windsurf, GitHub Copilot, Antigravity, OpenCode, Cline, Aider, Continue, Roo Code, Trae, Kilo Code, and others. The full list of 26+ compatible agents is at agentskills.io.

Self-Improving Python Scripts with LLMs: My Journey

RTT Enjoy — Mon, 06 Apr 2026 23:57:46 +0000

As a developer, I've always been fascinated by the idea of self-improving code. Recently, I've been experimenting with using Large Language Models (LLMs) to make my Python scripts more autonomous. In this article, I'll share my experience with integrating LLMs into my Python scripts and how they've improved over time. I'll also provide a step-by-step guide on how to get started with this technology. My journey began with the llm_groq module, which provides a simple interface for interacting with LLMs. I started by using the llm_groq module to generate new code based on existing code snippets. The idea was to create a script that could learn from its own codebase and generate new features or improvements. The first challenge I faced was figuring out how to integrate the llm_groq module into my existing Python scripts. After some trial and error, I came up with a simple workflow that involved the following steps: 1. Code Analysis: I used the ast module to parse my Python scripts and extract relevant information such as function names, variable names, and code structure. 2. LLM Input: I used the extracted information to create input prompts for the LLM. For example, I might ask the LLM to generate a new function that takes a specific set of inputs and returns a certain output. 3. LLM Generation: I used the llm_groq module to send the input prompts to the LLM and generate new code. 4. Code Review: I reviewed the generated code to ensure it met my requirements and was free of errors. 5. Code Integration: I integrated the generated code into my existing script, and the cycle repeated. To demonstrate this workflow, let's consider a simple example. Suppose we have a Python script that generates random numbers, and we want to use an LLM to generate a new function that calculates the average of these numbers. We can use the llm_groq module to generate the new function as follows: import llm_groq import ast # Define the input prompt prompt = 'Generate a function that calculates the average of a list of numbers.' # Define the existing code code = '''import random def generate_numbers(n): return [random.randint(0, 100) for _ in range(n)]''' # Parse the existing code tree = ast.parse(code) # Extract relevant information from the code functions = [node.name for node in tree.body if isinstance(node, ast.FunctionDef)] # Create the LLM input input_dict = {'prompt': prompt, 'functions': functions} # Generate the new code with llm_groq llm = llm_groq.LLM() new_code = llm.generate_code(input_dict) # Print the generated code print(new_code) In this example, the llm_groq module generates a new function called calculate_average that takes a list of numbers as input and returns the average. The generated code is then printed to the console. Over time, I've seen significant improvements in my Python scripts. The LLM has generated new features, improved existing code, and even fixed bugs. However, I've also encountered some challenges. For example, the LLM sometimes generates code that is not optimal or efficient. To address this, I've had to implement additional checks and balances to ensure the generated code meets my requirements. Another challenge I've faced is the risk of over-reliance on the LLM. As the LLM generates more and more code, it's easy to lose sight of what's going on under the hood. To mitigate this, I've made sure to maintain a clear understanding of the codebase and regularly review the generated code. In conclusion, using LLMs to make Python scripts improve themselves has been a game-changer for me. While there are challenges to overcome, the benefits of autonomous code improvement far outweigh the costs. If you're interested in exploring this technology, I recommend starting with the llm_groq module and experimenting with different workflows and use cases. With the right approach, you can create self-improving Python scripts that learn and adapt over time.

How to Use Replicate the Right Way in Your Next.js App (And Ship a Real Product With It)

Lucas Santos Rodrigues — Mon, 06 Apr 2026 23:52:21 +0000

Most tutorials show you how to call Replicate. Few show you how to use it well inside a real production app. This article covers the mistakes I made and the patterns that actually work — using Goodbye Watermark as a real-world case study.

What Is Replicate, Really?

Replicate is a cloud API that lets you run AI models — image generation, video, audio, vision — without owning a single GPU. You send an HTTP request, a model runs on their infrastructure, and you get the result back.

The business model is pay-per-prediction: you're charged for the time the model actually runs, not idle time. That means cold boots don't affect your cost — only your latency.

1. Understand the Prediction Lifecycle Before Writing Any Code

Every Replicate call creates a prediction — an object with a lifecycle:

starting → processing → succeeded (or failed / canceled)

starting: model is booting (cold start happens here)
processing: predict() is actively running
succeeded: output is ready — but files are deleted after 1 hour

That last point is critical. If you're not saving outputs immediately, you'll lose them. More on that below.

2. Polling vs. Webhooks: Choose the Right Strategy

Replicate gives you three ways to handle async predictions:

Polling (simplest, fine for most apps)

// Create the prediction
const prediction = await replicate.predictions.create({
  model: "owner/model-name",
  input: { image: imageUrl },
});

// Poll until done
let result = prediction;
while (result.status !== "succeeded" && result.status !== "failed") {
  await new Promise((r) => setTimeout(r, 1000));
  result = await replicate.predictions.get(result.id);
}

Works well for short-lived predictions (under ~15s). Simple to implement. The tradeoff: you're making repeated requests even when nothing has changed.

Webhooks (better for longer or background tasks)

const prediction = await replicate.predictions.create({
  model: "owner/model-name",
  input: { image: imageUrl },
  webhook: `${process.env.VERCEL_URL}/api/webhooks`,
  webhook_events_filter: ["completed"], // only fire when done
});

Replicate POSTs to your URL when the prediction finishes. No polling loop. If there are network issues, they retry automatically.

Use webhooks when:

Predictions take more than ~10-15 seconds
You want to persist results to a database
You're building background processing flows

Tip: Add query params to your webhook URL to carry context:

https://yourapp.com/api/webhooks?userId=abc123&predictionType=watermark

When to use each

Scenario	Use
Fast model, UX waits for result	Polling
Slow model, fire and notify	Webhooks
Background job, store to DB	Webhooks
Quick prototype	Polling

3. Cold Starts Are Real — Here's How to Handle Them

When a model hasn't been used recently, it needs to "boot up." This can add several seconds of latency on the first request after idle time.

For casual traffic: Cold boots are fine. You only pay for actual compute, not boot time.

For production apps with consistent traffic: Use a Deployment with minInstances: 1:

// Via the Replicate dashboard or API:
// Create a deployment for your model with min_instances = 1
// This keeps the model warm 24/7

This costs more (you're paying to keep the instance warm) but eliminates cold start latency entirely.

For Goodbye Watermark, I don't use a deployment because the traffic is spread across the day and a few seconds of latency on first boot is acceptable. But if you're building something with strict SLA requirements — use deployments.

4. Save Outputs Immediately — They Expire in 1 Hour

This is the gotcha that trips up everyone:

Input and output files are automatically deleted after 1 hour for any predictions created through the API.

If your app doesn't save the result right after succeeded, it's gone. Your options:

Option A: Stream back to the client immediately

// Next.js API route
export async function GET(request: Request) {
  const output = await replicate.run("owner/model", { input });
  return new Response(output); // stream back to client
}

Option B: Save to your own storage (Supabase Storage, S3, etc.)

const output = await replicate.run("owner/model", { input });
const response = await fetch(output[0]); // download from Replicate
const buffer = await response.arrayBuffer();
await supabase.storage.from("outputs").upload(`${userId}/${id}.png`, buffer);

For Goodbye Watermark, I stream the result directly back to the client. The user downloads it immediately. No storage needed, no expiry problem.

5. Next.js Config: Don't Forget This

If you're displaying output images from Replicate in a Next.js <Image> component, add this to your config or you'll get a domain error:

// next.config.ts
const nextConfig = {
  images: {
    remotePatterns: [
      {
        protocol: "https",
        hostname: "replicate.delivery",
      },
      {
        protocol: "https",
        hostname: "*.replicate.delivery",
      },
    ],
  },
};

Small thing, but it will bite you in production.

6. Error Handling That Doesn't Suck

Real-world Replicate usage needs to handle:

Network timeouts
Model errors (bad input format, unsupported file type)
Rate limits (429)
Prediction timeouts (30 min hard cap)

try {
  const prediction = await replicate.predictions.create({ ... });

  if (prediction?.error) {
    return NextResponse.json({ error: prediction.error }, { status: 500 });
  }

  // poll with timeout safety
  let result = prediction;
  const deadline = Date.now() + 60_000; // 60s max wait

  while (result.status !== "succeeded" && result.status !== "failed") {
    if (Date.now() > deadline) {
      return NextResponse.json({ error: "Prediction timed out" }, { status: 504 });
    }
    await new Promise((r) => setTimeout(r, 1500));
    result = await replicate.predictions.get(result.id);
  }

  if (result.status === "failed") {
    return NextResponse.json({ error: "Model failed" }, { status: 500 });
  }

  return NextResponse.json({ output: result.output });

} catch (err) {
  return NextResponse.json({ error: "Unexpected error" }, { status: 500 });
}

Set your own deadline. Replicate's hard limit is 30 minutes, but your users don't want to wait more than ~60 seconds for most tasks.

7. Rate Limits to Know

From Replicate's docs:

Create prediction: 600 requests/minute
All other endpoints: 3000 requests/minute

For most indie apps, you won't hit these. If you do, they return a 429 — build retry logic with exponential backoff.

8. Choosing the Right Model

Replicate hosts thousands of models. Two categories matter:

Official models — maintained by Replicate, always warm, stable API, predictable per-output pricing. Best for production use.

Community models — more variety, charged by compute time, may have cold starts, API can change between versions.

For Goodbye Watermark, I use the Qwen model for watermark removal. The choice came down to output quality and how well it handled semi-transparent watermarks — which are significantly harder than solid text watermarks. Testing a few models on realistic samples before committing to one is worth the extra hour.

Real-World Case Study: Goodbye Watermark

Goodbye Watermark is an AI watermark removal tool built with Next.js + Replicate + Vercel. The full stack is:

Frontend: Next.js + Tailwind CSS
AI: Replicate (Qwen model)
Hosting: Vercel
Payments: Stripe (two credit tiers)

The entire MVP was built in ~1 hour. The hardest part wasn't the UI — it was getting consistent output quality from the model across different watermark types.

Current results:

~150 weekly organic users
$0 paid acquisition
Zero infrastructure management

Replicate made the difference. Running my own GPU inference would have added weeks of setup and ongoing ops overhead. Instead, I spent that time on the UX and monetization.

TL;DR — The Patterns That Matter

Understand the prediction lifecycle — especially the 1-hour file expiry
Use polling for short tasks, webhooks for long/background ones
Use Deployments if cold start latency is a problem for your UX
Save or stream outputs immediately after succeeded
Add replicate.delivery to your Next.js image domains
Set your own deadline — don't wait 30 minutes for a user-facing request
Test multiple models before committing — quality varies significantly

Replicate is genuinely one of the best tools for indie developers shipping AI products fast. Use it well and you can build something real in a weekend.

Built something with Replicate? Drop it in the comments — always curious to see what people are shipping.

Building in Public in 2026: Has the Strategy Been Gamed or Does Transparency Still Drive Growth?

Michael Sun — Mon, 06 Apr 2026 23:49:03 +0000

The Death of Authentic Transparency: How Building in Public Became a Liability

The "building in public" movement has become so saturated with performative content and hollow updates that it's now actively detrimental to genuine indie hackers. What was once a powerful tool for transparency and community building has been gamed by algorithm-chasing creators who prioritize vanity metrics over substantive progress, turning a competitive advantage into a liability for those who still believe in its original promise.

The Algorithmic Capture of Public Building

The original premise of building in public was simple: document your journey, share struggles and successes, and build a community around your work. In 2026, this has devolved into performance art where creators spend more time crafting "perfect" updates than building actual products. Our analysis of 500 indie creators across Twitter, IndieHackers, and LinkedIn shows that those posting daily updates spend 3.2x more time on content creation than actual development, with their product velocity decreasing by 41% compared to silent builders who focus solely on execution.

This isn't accidental. The platforms that popularized building in public have optimized for engagement, not authenticity. Twitter's algorithm now prioritizes threads with high engagement rates, while LinkedIn's professional feed rewards consistent posting over substantive updates. The result is a feedback loop where creators are incentivized to manufacture drama, exaggerate progress, and hide failures—all while maintaining the appearance of transparency.

Consider the case of "Project Phoenix," a popular SaaS tool that amassed 50,000 Twitter followers through daily progress updates. When we analyzed their actual development commits versus their public posts, we found a stark discrepancy: 78% of their updates were either retrospective or aspirational, while only 22% contained substantive technical details. The product itself, launched after 18 months of public building, had a 68% churn rate in its first quarter, suggesting that the audience built around the narrative rather than the product.

The technical community has attempted to combat this with tools for automated status updates, but these have become just another layer of artifice. We've seen developers create elaborate CI/CD pipelines that automatically generate "progress reports" from GitHub commits, complete with artificially inflated metrics. This isn't transparency—it's a sophisticated form of tech-washing that obscures the real work behind a veneer of productivity.

# Example of performative public update automation
name: Generate Progress Report
on:
  push:
    branches: [ main ]
jobs:
  report:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Generate metrics
        run: |
          echo "## 🚀 Weekly Progress Report" >> $GITHUB_STEP_SUMMARY
          echo "- Commits this week: $(git log --since='1 week ago' --oneline | wc -l)" >> $GITHUB_STEP_SUMMARY
          echo "- Lines changed: $(git diff --shortstat HEAD~1 HEAD | awk '{print $4,$5}')" >> $GITHUB_STEP_SUMMARY
          echo "- Features deployed: ${{ vars.FEATURES_DEPLOYED || '0' }}" >> $GITHUB_STEP_SUMMARY

The Quantifiable Cost of Performative Transparency

Building in public isn't just ineffective—it's actively harmful when done without strategic intent. Our longitudinal study of 200 indie projects tracked over three years reveals that publicly documented projects have a 2.3x higher failure rate than private projects, primarily due to the psychological toll of constant public scrutiny and misaligned incentives.

The data breaks down into three key areas of impact:

Metric	Public Builders (n=100)	Private Builders (n=100)	Differential
Time to MVP	7.2 months	4.1 months	+75%
Feature Velocity (/month)	2.3 features	5.7 features	-60%
Churn Rate (Q1)	34%	12%	+183%
Developer Burnout Score	8.1/10	4.3/10	+88%

These numbers represent the fundamental misalignment between public expectations and the reality of product development. When you're constantly updating an audience, you're not just documenting progress—you're managing perceptions. This leads to "update-driven development," where features are chosen not because they solve customer problems, but because they make for good Twitter threads.

The technical cost is equally significant. We've observed that public builders often over-engineer solutions to create "impressive" technical deep dives, while ignoring simpler, more maintainable approaches. A case in point: a public builder we documented spent 6 weeks implementing a custom event sourcing system for a simple CRUD app, purely to create a detailed blog post about the architecture. The same functionality could have been built in 3 days using standard Rails/PostgreSQL patterns.

Authentic Transparency vs. Algorithmic Theater

There's a critical distinction between authentic transparency and algorithmic theater. The former is about documenting reality—failures, pivots, and all—while the latter is about curating a polished narrative that aligns with platform incentives. The difference is measurable in terms of community quality and product-market fit.

Authentic transparency follows what we call the "70/20/10 rule": 70% substantive technical updates (actual progress, blockers, solutions), 20% honest reflection on failures and learnings, and 10% aspirational content. Algorithmic theater, by contrast, follows the "90/10 rule": 90% curated success metrics and polished narratives, 10% token "struggles" that are quickly resolved to maintain momentum.

Consider how Basecamp handles public communication. They publish detailed quarterly reviews that include revenue numbers, customer feedback (both positive and negative), and unvarnished assessments of what didn't work. There's no polish, no spin—just raw data and honest reflection. This approach has allowed them to build a fiercely loyal customer base that understands and accepts the product's limitations.

The technical implementation of authentic transparency is also different. Instead of crafting perfect narrative posts, authentic builders focus on creating comprehensive, real-time documentation:

Public GitHub repositories with commit messages that explain the "why" behind changes
Public Trello/Linear boards showing actual backlog priorities and movement
Regular, unscripted video demos showing raw work-in-progress
Detailed technical blog posts that dive into trade-offs and failed experiments

This approach requires a mindset that values documentation over narrative, and reality over polish. It's harder to execute in the short term, but it builds a foundation of trust that pays dividends in the long term.

Read the full article at novvista.com for the complete analysis with additional examples and benchmarks.

Originally published at NovVista

Shipping AnywhereHired: Flask, Scrapy, and why “junior” job posts lie

Anwesh Hada — Mon, 06 Apr 2026 23:48:25 +0000

I just launched AnywhereHired — a job board focused on early-career and entry-level remote jobs.

The problem

Job search is exhausting when half the “junior” listings quietly expect senior-level work. I wanted one place that cuts noise and keeps the bar honest for people starting out (bootcamps, career switchers, new grads).

What it does

Search & categories across remote-friendly roles
Resume matching to surface tighter fits
Listings aggregated and curated so the feed stays useful

Stack (high level)

Backend: Flask, SQLite
Scraping: Scrapy pipelines into the same DB the site reads
Hosting: Shared hosting + cron for refreshes (real-world constraints, not just localhost demos)

Why I’m posting

I’m sharing the build in public and looking for feedback, not vanity metrics. If you’re job hunting or hiring for true entry-level remote roles, try the site and tell me what’s broken or missing.

Live: anywherehired.com

Product Hunt: [https://www.producthunt.com/products/anywherehired?launch=anywherehired]

If we’re connected on Product Hunt, I’d love your thoughts there too once the launch is up.

Open questions for you

What would make this your default tab when job searching?
What filters matter most (timezone, visa, “no degree”, etc.)?
Employers: what would make you post here vs big boards?

Thanks for reading — comments and harsh feedback welcome.

Your API Isn’t Hard to Use Your Documentation Is Just Bad

Ezejah Chimkamma — Mon, 06 Apr 2026 23:40:14 +0000

Let’s be honest.

Most developers don’t abandon APIs because they’re “too complex.”

They abandon them because:

the documentation makes them feel stupid.

🚨 The Real Problem

You built a powerful API.

But your documentation:

Assumes too much
Explains too little
Leaves users guessing

So instead of building with your product, developers are stuck trying to figure it out.

And they won’t stay long.

⚠️ What Bad API Docs Look Like

If your documentation does any of this, you’re losing users:

Throws endpoints at users with no context
Uses technical jargon without explanation
Has no clear “start here” guide
Lacks real examples

That’s not documentation.

That’s confusion.

💡 What Good API Documentation Actually Does

Good documentation feels like guidance, not instructions.

It answers 3 simple questions:

Where do I start?

Give users a clear entry point.

“Start here to make your first API request in under 5 minutes.”

What does this do?

Explain endpoints in plain language.

Not:

“Handles user authentication”

But:

“This endpoint lets users log in and receive an access token for future requests.”

Show me an example

Never assume.

Always show.

Example:

POST /login

{
"email": "user@example.com",
"password": "yourpassword"
}

And the response:

{
"token": "abc123..."
}

Now it’s real. Now it’s usable.

⚠️ The Biggest Mistake

You write documentation after building the product.

As an afterthought.

That’s backwards.

Documentation is part of the product experience.

🔥 The Difference It Makes

When your API documentation is clear:

Developers integrate faster
Fewer support tickets
More trust in your product
Higher adoption
👀 Quick Test

Ask yourself:

“Can someone use my API without asking me questions?”

If the answer is no,
your documentation needs work.

🚀 Final Thought

Your API might be powerful.

But if no one understands how to use it,
it might as well not exist.

👋 If you’re building an API…

If your API is solid but developers struggle to use it, I help simplify documentation so people can understand, integrate, and actually use your product.

My Claude Code Sessions Hit 70MB. So I Built a Distiller.

ithiria894 — Mon, 06 Apr 2026 23:39:51 +0000

I had a 4-hour coding session with Claude Code. Felt productive. Fixed a bunch of bugs, refactored a module, reviewed some screenshots Claude took of the UI along the way.

Then I tried to --resume it the next day.

The session file was 73MB. Claude loaded it, burned through half the context window on old tool outputs and base64-encoded screenshots from yesterday, and started forgetting things I'd said 20 minutes ago. The conversation was fine. The cargo it was dragging around was not.

I opened the JSONL. Here's what 73MB of "session" actually looks like:

Conversation text:          ~4MB  (what we actually said)
Tool results (Read):       ~28MB  (file contents Claude already read)
Tool results (Bash):        ~9MB  (build outputs, test runs, logs)
Base64 screenshots:        ~22MB  (UI screenshots, now stale)
Tool results (Edit/Write):  ~6MB  (diffs and file previews)
Everything else:            ~4MB  (metadata, tool_use blocks)

93% of the file is stuff Claude doesn't need to resume the conversation. The Read results are files that still exist on disk. The screenshots are from yesterday's UI state. The Bash outputs are build logs from 6 hours ago.

So I built a distiller.

What Session Distiller Does

It reads a session JSONL, keeps every word of the actual conversation verbatim, and applies per-tool-type rules to strip results down to what's useful for context:

Tool type	What's kept	Why
Read	Nothing (stripped entirely)	The file is still on disk. Claude can re-read it.
Bash	First 5 + last 5 lines	You need the command and whether it succeeded. Not 800 lines of webpack output.
Edit	File path + 200-char preview of old/new	Enough to remember what changed.
Write	File path + head/tail preview	Same idea.
Agent	Up to 2000 chars	Research reports are worth keeping. Build logs aren't.

The key decision was extractive filtering, not summarization. I don't pass anything through an LLM. Every word of conversation text is preserved exactly as-is. Tool results are either kept (trimmed) or dropped based on deterministic rules. No tokens spent, no hallucination risk, no "the AI summarized away the one detail I needed."

Typical result: 70MB session → 7MB distilled. 90% reduction.

The original session is backed up before anything changes. You always have the full version if you need it.

The Tool-ID Matching Problem

This sounds simple until you hit parallel tool calls.

Claude Code often fires multiple tool calls in a single assistant message. A tool_result block references its parent by tool_use_id, not by position. My first implementation tracked a global lastToolName variable: "the most recent tool_use was a Read, so the next tool_result must be a Read result." That breaks immediately when an assistant message contains three parallel tool calls.

The fix: build a toolIdMap from every tool_use block (mapping id → tool name), then look up each tool_result.tool_use_id to find the correct tool type. Now parallel calls work correctly. A Read result and a Bash result in the same message get their own distillation rules applied independently.

// Build map: tool_use_id → tool name
if (block?.type === "tool_use" && block.id && block.name) {
  toolIdMap.set(block.id, block.name);
}

// Look up correct tool for each result
if (block?.type === "tool_result") {
  const toolName = toolIdMap.get(block.tool_use_id);
  // Now we know: this result came from "Read", "Bash", etc.
  return distillByToolType(toolName, block);
}

Small detail. Would have caused silent data corruption without it.

Image Trimmer: The Targeted Fix

Sometimes you don't need full distillation. You just need to remove the screenshots.

I kept hitting Claude Code's "image exceeds dimension limit" warning after long sessions with a lot of UI review. The session file was fine except for 20-30MB of base64 image data that Claude couldn't even display anymore.

So I wrote a separate tool that does exactly one thing: find every image block in the JSONL, replace it with [image redacted], leave everything else untouched.

node src/trim-images.mjs ~/.claude/projects/.../session.jsonl
# → Redacted 47 image(s), saved 24832K

It also handles images nested inside tool_result blocks (which is where most screenshots end up, since they come back as results of Bash commands that ran adb screencap or similar).

The whole script is 35 lines. It's also available as a Claude Code skill: type /trim-images when you see the dimension warning and it runs automatically.

How to Use It

From the dashboard:

If you're using Claude Code Organizer, every session row now has a Distill button. Click it, the session gets distilled in-place, and the result shows up as an expandable bundle in the tree view with the backup and index files grouped together.

From the command line:

# Full distillation (conversation + trimmed tool results + backup)
npx @mcpware/claude-code-organizer --distill ~/.claude/projects/.../session.jsonl

# Just strip images
node src/trim-images.mjs ~/.claude/projects/.../session.jsonl

The distiller outputs stats showing before/after sizes, number of index entries, and where the backup landed.

What's Actually in the Backup

The distiller creates a folder named after the session ID:

{sessionId}/
  backup-{originalId}.jsonl    ← full original session, untouched
  index.md                     ← summary of what was kept/stripped

The distilled session gets a context message injected at the top telling Claude where the backup lives and how to retrieve specific tool results if needed (Read with offset). So if Claude needs the full output of a Bash command from 3 hours ago, it knows exactly where to look.

Performance

Distillation runs in under 2 seconds on a 70MB file. It's pure JSON parsing and string manipulation. No LLM calls, no network, no dependencies.

The backup doubles your disk usage temporarily, but if your session was 70MB and the distilled version is 7MB, you're at 77MB total instead of 70MB. Not a meaningful difference on any modern machine.

The context window savings are the real win. A 70MB session dumps roughly 15-20M tokens of tool output into Claude's context when resumed. After distillation, that drops to 1-2M tokens of actual conversation. Claude remembers what you talked about instead of drowning in stale build logs.

Try It

npx @mcpware/claude-code-organizer

mcpware / claude-code-organizer

Dashboard to manage Claude Code memories, configs, and MCP servers — security scanner for tool poisoning, context token budget tracker, duplicate cleanup, scope management. npx @mcpware/claude-code-organizer

Claude Code Organizer

AI agents: read AI_INDEX.md first. It is the navigation manifest for this codebase — where to find every module, how they connect, and where to look before making any claim about the code.

Claude Code Organizer (CCO) is a free, open-source dashboard that lets you manage all Claude Code configuration — memories, skills, MCP servers, settings, agents, rules, and hooks — across global and project scopes. It includes a security scanner for MCP tool poisoning and prompt injection, a per-item context token budget tracker, per-project MCP enable/disable controls, and bulk cleanup for duplicate configs. All without leaving the window.

v0.17.0 — Session Distiller strips bloated sessions down to ~10% of their original size while keeping every word of conversation intact…

View on GitHub

The distiller is part of CCO v0.17.0. Dashboard button, CLI flag, and API endpoint all included. Image trimmer works standalone or as a /trim-images skill.

If your sessions are small, you don't need this. If your sessions regularly push 50MB+, this is the difference between "--resume working" and "--resume followed by Claude forgetting your name."

About Me

CS dropout. Building tools for the Claude Code ecosystem. github.com/ithiria894

⭐ Star the repo if bloated sessions have ever ruined your day.

I Built an Autonomous Job Application Agent with Claude AI — Here's How It Works

Tanzil Ahmed — Mon, 06 Apr 2026 23:35:27 +0000

What I Built

Job Hunter AI is an autonomous agent that searches job boards, researches companies using Claude AI, and generates tailored CVs and cover letters — with zero manual intervention.

GitHub: https://github.com/Tanzil-Ahmed/job-hunter-agent

The Problem

Job hunting is repetitive and exhausting. Every application needs the same research: What does this company do? What's their tech stack? Does my background fit? Then rewriting your CV for each role.

I automated all of it.

How It Works

The pipeline has 4 stages:

1. Job Discovery
Searches job boards automatically using Tavily and Exa APIs. Filters by role, location, and relevance.

2. Company Research (Claude tool_use)
For each job, Claude uses tool_use to research the company — analyzing tech stack, culture, funding stage, and fit score against your profile.

3. CV + Cover Letter Generation
Claude generates a tailored CV and cover letter for each role based on the research. Each one is different.

4. Real-time Dashboard
FastAPI backend with WebSocket streaming shows the pipeline running live.

Advancing DevOps/Cloud Learning: Strategies for Post-Foundational Skill Development

Marina Kovalchuk — Mon, 06 Apr 2026 23:35:17 +0000

Introduction: Navigating the DevOps/Cloud Learning Journey

You’ve nailed the basics—Linux, networking, AWS fundamentals, and even wrestled with Nginx and S3 permissions. Now, the real challenge begins: how do you advance beyond foundational knowledge without wasting time or money on suboptimal resources? This is where most learners stall. The DevOps/Cloud landscape is a minefield of courses, certifications, and tools, each promising to elevate your skills. But here’s the harsh truth: not all advanced learning paths are created equal.

Consider the learner who, after mastering AWS basics, enrolls in a course heavy on theory but light on practical CI/CD pipelines. The result? They can explain Jenkins but can’t configure it in a real-world scenario. Or the one who opts for a free, unstructured resource, only to realize their portfolio lacks the depth to impress hiring managers. These failures aren’t about effort—they’re about misalignment between learning strategy and career goals.

The Mechanics of Course Selection: Why Most Learners Fail

The typical learner evaluates courses based on surface-level criteria: cost, duration, or instructor popularity. But this approach ignores the system mechanisms that determine learning outcomes. For instance, a course’s value isn’t just in its content—it’s in how it integrates real-world projects that simulate production environments. Without this, learners risk acquiring theoretical knowledge that doesn’t translate to hands-on expertise.

Take CI/CD pipelines, a cornerstone of DevOps. A course that merely lectures on Jenkins or GitLab CI will leave you unprepared for the chaos of debugging a failing pipeline in a live environment. The mechanism of failure here is clear: theory without practice leads to brittle skills that crack under pressure.

Evaluating "Train with Shubham" vs. Alternatives: A Causal Analysis

Let’s dissect the case of "Train with Shubham" versus other advanced courses. The key factors are:

Content Depth: Does the course cover automation tools like Terraform and Ansible, or does it rely on manual configurations? Automation is non-negotiable in modern DevOps.
Instructor Credibility: Check Shubham’s GitHub or LinkedIn. Real-world experience in production environments is a proxy for course quality.
Practical Projects: Are there end-to-end projects that mimic industry scenarios? Without these, you’re building sandcastles, not careers.

Compare this to a generic Udemy course. While cheaper, it often lacks structured feedback loops—forums or Discord groups where learners troubleshoot together. This isolation slows learning and increases the risk of misinterpreting concepts.

Edge Cases: When "Train with Shubham" Might Not Be Optimal

Not every learner benefits equally from "Train with Shubham." For instance, if your goal is vendor-neutral knowledge (e.g., Kubernetes over AWS-specific tools), a course heavily focused on AWS might misalign with your objectives. The mechanism here is over-specialization, which limits your adaptability across cloud providers.

Alternatively, if you’re on a tight budget, free resources like AWS re:Start or HashiCorp’s Terraform tutorials can be effective—but only if supplemented with structured projects. The failure mode here is fragmented learning, where you acquire pieces of knowledge without a cohesive framework.

Rule for Choosing Advanced Courses: If X, Then Y

Here’s a decision-dominant rule backed by mechanism:

If your goal is to master CI/CD pipelines and automation tools (X), choose a course with real-world projects and instructor-led feedback (Y). Otherwise, you risk acquiring theoretical knowledge that fails in production environments.

For example, if "Train with Shubham" includes end-to-end CI/CD projects and a Discord community for troubleshooting, it’s a strong contender. But if it lacks these, consider alternatives like A Cloud Guru’s DevOps path, which balances theory with hands-on labs.

Conclusion: Strategic Learning as a Career Accelerator

Advancing in DevOps/Cloud isn’t about consuming more content—it’s about strategic selection of resources that align with your career goals and learning style. The stakes are high: a misstep here can delay your progression by months. By evaluating courses through the lens of practical projects, instructor credibility, and community support, you ensure that every hour spent learning translates to tangible skills.

Remember: The cloud never stops evolving, and neither should your learning strategy.

Scenario Analysis: Real-World Applications and Skill Gaps

1. The Automation Bottleneck: From Manual to Scalable Infrastructure

Scenario: You’ve manually configured EC2 instances and S3 buckets, but your team’s deployment process still takes hours. Management demands faster releases, and your manual scripts are breaking under scale.

Mechanism: Manual configurations introduce human error and lack reproducibility. As infrastructure scales, ad-hoc scripts fail due to state drift and dependency conflicts.

Skill Gap: Lack of proficiency in Infrastructure as Code (IaC) tools like Terraform or CloudFormation.

Decision Rule: If your goal is to eliminate manual bottlenecks, prioritize courses with end-to-end IaC projects (e.g., Terraform modules for multi-environment deployments). Avoid theory-heavy courses lacking hands-on labs.

2. The CI/CD Pipeline Paradox: Builds Succeed, Deployments Fail

Scenario: Your Jenkins pipeline compiles code successfully, but deployments to Kubernetes clusters fail intermittently. Logs show resource quota errors and image pull failures.

Mechanism: CI/CD pipelines without integrated testing and monitoring stages mask failures until production. Misconfigured Kubernetes manifests or untested Helm charts cause runtime errors.

Skill Gap: Inability to design resilient CI/CD pipelines with integrated testing, monitoring, and rollback mechanisms.

Decision Rule: Choose courses with GitOps workflow projects (e.g., ArgoCD + Jenkins X) over basic CI/CD tutorials. Verify the course includes debugging labs for pipeline failures.

3. The Multi-Cloud Misalignment: AWS Expertise Fails in Azure

Scenario: Your AWS-heavy resume lands you an Azure DevOps role. You struggle to translate S3 permissions to Azure Blob Storage ACLs, delaying project delivery.

Mechanism: Cloud provider-specific knowledge becomes a liability when switching ecosystems. Over-specialization in one platform creates blind spots in cross-cloud architecture.

Skill Gap: Lack of vendor-neutral cloud architecture principles (e.g., Well-Architected Framework).

Decision Rule: If targeting multi-cloud roles, select courses emphasizing cloud-agnostic patterns (e.g., Hashicorp’s multi-cloud demos) over AWS-only content.

4. The Monitoring Blindspot: Alerts Flood In, Root Cause Elusive

Scenario: Your Prometheus alerts spike during peak traffic, but dashboards show no CPU/memory anomalies. Users report 500 errors, yet logs are inconclusive.

Mechanism: Monitoring systems without distributed tracing or correlation rules fail to pinpoint failures in microservices architectures.

Skill Gap: Inadequate knowledge of observability tools (e.g., Jaeger, OpenTelemetry).

Decision Rule: Prioritize courses integrating observability into CI/CD pipelines (e.g., automated trace collection in Jenkins). Avoid courses treating monitoring as an afterthought.

5. The Security Breach: Misconfigured IAM Roles Expose Data

Scenario: A misconfigured IAM role grants S3 write access to an external contractor, leading to a data leak. Auditors flag non-compliance with SOC 2 requirements.

Mechanism: DevOps practices without security integration (DevSecOps) create exploitable gaps. Lack of automated policy checks allows misconfigurations to propagate.

Skill Gap: Inability to implement security automation (e.g., Terraform + Sentinel).

Decision Rule: If security is critical, choose courses with integrated security modules (e.g., OWASP Top 10 for DevOps). Validate instructors’ DevSecOps experience via GitHub repos.

6. The Cost Overrun: Cloud Bills Spike Post-Migration

Scenario: After migrating to Kubernetes, your monthly cloud bill triples. Spot instances are underutilized, and reserved instances are misallocated.

Mechanism: Lack of FinOps practices leads to inefficient resource allocation. Autoscaling policies without cost optimization triggers waste resources.

Skill Gap: Inadequate understanding of cloud cost management tools (e.g., Kubecost, CloudHealth).

Decision Rule: If cost control is a priority, select courses covering FinOps automation (e.g., Terraform cost estimation modules). Avoid courses ignoring financial governance.

Comparative Analysis: "Train with Shubham" vs. Alternatives

Content Depth: "Train with Shubham" excels in CI/CD and Kubernetes projects but lacks Azure/GCP coverage. A Cloud Guru offers broader multi-cloud content.
Practical Projects: Shubham’s end-to-end labs (e.g., Jenkins + Helm deployments) outperform Udemy’s theory-heavy courses.
Community Support: Shubham’s Discord group provides faster feedback than Coursera’s forums.
Optimal Choice: If your goal is Kubernetes and CI/CD mastery, "Train with Shubham" is superior. For multi-cloud, supplement with A Cloud Guru.

Edge Case: Budget Constraints

Mechanism: Free resources (e.g., AWS re:Start) lack structured projects, leading to fragmented learning. Without feedback loops, misconceptions persist.

Rule: If budget is limited, combine free resources with open-source project contributions (e.g., Kubernetes GitHub issues) to simulate structured learning.

Strategic Learning Plans: Tailored Roadmaps for Success

After mastering foundational topics like Linux, networking, and AWS basics, the next step in your DevOps/Cloud journey requires a strategic approach. The core mechanism here is aligning your learning resources with both your career goals and the dynamic demands of the industry. Misalignment leads to skill gaps, as theoretical knowledge without practical application fails in real-world scenarios. Below, we dissect your options, focusing on the Train with Shubham course and alternatives, using a mechanistic lens to evaluate effectiveness.

1. Evaluating "Train with Shubham": Mechanism and Fit

The Train with Shubham course excels in CI/CD pipelines and Kubernetes, critical for modern DevOps. Its end-to-end labs simulate production environments, addressing the automation bottleneck—a common failure point where manual configurations lead to state drift and dependency conflicts. For example, misconfigured Kubernetes manifests cause runtime errors, which Shubham’s labs explicitly target through hands-on debugging.

Strengths:
- Real-world projects (e.g., GitOps workflows with ArgoCD)
- Active Discord community for structured feedback loops
- Instructor credibility (Shubham’s production experience in Kubernetes)
Weaknesses:
- Limited Azure/GCP coverage, risking multi-cloud misalignment
- No integrated FinOps modules, leaving a cost optimization gap

Decision Rule: If your goal is Kubernetes and CI/CD mastery, choose Shubham. However, supplement with multi-cloud resources (e.g., A Cloud Guru) to avoid vendor lock-in.

2. Alternative Paths: Comparative Analysis

Alternatives like A Cloud Guru’s DevOps path or Udemy courses must be evaluated against system mechanisms:

A Cloud Guru:
- Advantage: Broader multi-cloud content (AWS, Azure, GCP), addressing vendor-neutral goals
- Disadvantage: Less hands-on than Shubham; forums provide slower feedback, increasing risk of misinterpretation
Udemy:
- Risk: Theory-heavy courses lack practical projects, leading to brittle skills that fail under pressure (e.g., CI/CD pipelines without monitoring stages)
- Edge Case: Budget-friendly but requires supplementation with open-source contributions to simulate structured learning

Optimal Choice: For Kubernetes/CI/CD focus, Shubham dominates. For multi-cloud architecture, A Cloud Guru is superior. Avoid Udemy unless supplemented with GitHub projects to address fragmented learning.

3. Edge Cases: Budget Constraints and Vendor-Neutral Goals

If budget is a constraint, free resources like AWS re:Start or Kubernetes GitHub issues can work, but they lack structured feedback loops. The mechanism of failure here is fragmented learning, where knowledge isn’t integrated into a cohesive framework. To mitigate:

Combine free resources with open-source contributions (e.g., fixing Kubernetes issues)
Use Shubham’s free YouTube content for foundational CI/CD concepts

Rule: If budget is X, use free resources + open-source contributions to simulate structured learning. Without this, risk skill fragmentation.

4. Long-Term Strategy: Portfolio vs. Certifications

Certifications (e.g., AWS Certified DevOps Engineer) signal baseline knowledge but don’t replace practical skills. The mechanism is that certifications often test theoretical understanding, while employers prioritize portfolio projects demonstrating real-world problem-solving. For example, a CI/CD pipeline with integrated security (Terraform + Sentinel) is more impactful than a certification badge.

Rule: If goal is immediate job placement, prioritize certifications. For long-term career growth, build a portfolio with end-to-end projects (e.g., multi-cloud deployment with FinOps automation).

Conclusion: Dominant Strategy Selection

The optimal path depends on your goal mechanism:

If X (Kubernetes/CI/CD mastery) → Use Y (Train with Shubham + A Cloud Guru for multi-cloud)
If X (Budget constraint) → Use Y (Free resources + open-source contributions)
If X (Long-term growth) → Use Y (Portfolio-focused learning with end-to-end projects)

Avoid typical errors like over-specialization (e.g., AWS-only courses) or theory-heavy learning. Continuously evolve your strategy as cloud technologies advance, ensuring alignment with both industry demands and your career trajectory.

Your Startup Isn’t Confusing Your Documentation Is (Here’s How to Fix It)

Ezejah Chimkamma — Mon, 06 Apr 2026 23:33:16 +0000

Most startups don’t have a product problem.

They have a clarity problem.

You built something powerful.
Something useful.
Something people should understand.

But they don’t.

Not because they’re not smart,
but because your documentation is doing a poor job explaining it.

And that’s costing you users.

🚨 The Silent Killer: Bad Documentation

Here’s what’s happening behind the scenes:

Users sign up
They get confused
They leave quietly

No complaints. No feedback. Just… gone.

And you think:

“Maybe the product needs more features”

It doesn’t.

It needs better explanation.

⚠️ Mistake #1: You’re Writing for Yourself, Not the User

Most startup documentation sounds like this:

“Initialize the configuration by executing the required environment parameters…”

That’s not helpful.

Your users are not inside your head.
They don’t know your system like you do.

✅ Fix:

Write like you’re explaining to a smart beginner.

“Start by setting up your environment variables. This tells the system how to run your app properly.”

Simple. Clear. Human.

⚠️ Mistake #2: You Skip the “Why”

You explain what to do…
But not why it matters.

So users follow steps blindly — or worse, they stop trying.

✅ Fix:

Always answer:

“Why should I care about this step?”

Example:

“This step connects your app to the database, so your data can be stored and retrieved.”

Now it makes sense.

⚠️ Mistake #3: No Onboarding Flow

You drop users into documentation like:

“Here’s everything. Good luck.”

That’s overwhelming.

✅ Fix:

Guide them step-by-step:

What to do first
What comes next
What success looks like

Make them feel progress.

⚠️ Mistake #4: Too Technical or Too Vague

You either:

Overcomplicate everything
OR
Say things that mean nothing

Both are dangerous.

✅ Fix:

Be specific, but clear.

Bad:

“Optimize your configuration”

Better:

“Reduce API response time by caching repeated requests”

💡 Here’s the Truth Most Startups Miss

Good documentation is not “extra work”

It’s:

Better onboarding
Fewer support requests
Higher user retention

It’s the difference between:
👉 A product people try
👉 And a product people actually use

👋 Final Thought

If users don’t understand your product,
they won’t use it, no matter how good it is.

Clarity is not optional.
It’s part of the product.

🚀 If this sounds familiar…

If you’re building a product and your users struggle to understand how it works, I help startups turn complex systems into clear, user-friendly documentation and onboarding.

How AI Engineers Actually Use Datasets: Test Cases, Edge Cases and Agent Reliability

Kalio Princewill — Mon, 06 Apr 2026 23:32:46 +0000

Most AI agent discussions focus on models. In practice, the model is rarely the problem.

When you build an agent today you are almost certainly not training it. The model is fixed. What determines whether the agent actually works is everything around it: the tools it can call, the prompts that guide it, the logic that decides what it does next.

So when people say "we need more data," they usually do not mean training. They mean better test cases, clearer failure scenarios, and a way to measure whether the agent is behaving correctly.

This article breaks down how to evaluate an AI agent properly: what to test, how to structure realistic scenarios from real world data, how to score the path the agent takes not just the answer it lands on, and how to design adversarial tests that force actual reasoning instead of pattern matching.

Using SRE agents as the concrete example throughout.

What You Are Not Doing vs What You Are

Before anything else, this distinction matters.

What you are NOT doing:

Feeding logs into the model to teach it new things
Fine tuning weights
Changing how the underlying LLM reasons

What you ARE doing:

Using real world logs to construct realistic test scenarios
Grading whether the agent investigates correctly
Exposing edge cases the agent currently fails at
Using those failures to improve prompts, tools, and agent logic
Building a test suite that gets harder as the agent gets better

The model does not improve through this process. What improves is the system around it. Test cases are how you measure that system rigorously instead of guessing.
With that clear, the next question is what you are actually grading.

What You Are Actually Testing

When you test an AI agent you are not checking if the model knows things. The model already knows things. You are checking three specific behaviours.

Does the agent pick the right tools in the right order?
Given a scenario, does it investigate correctly or does it jump straight to conclusions?

Does it stop at the right time?
Does it know when it has found the root cause and stop, or does it keep going in circles?

Can it reason through noise?
If there are red herrings in the data, metrics that look suspicious but are not causal, does it get distracted or stay on the right path?

These are behaviours you grade. Not things you train. And the clearest way to see them in practice is to look at a real agent being built against exactly these constraints.

The SRE Agent As A Case Study

An SRE (Site Reliability Engineering) agent is one that investigates production incidents automatically: it gets an alert, pulls logs and metrics, reasons across the signals, and produces a root cause report.

The OpenSRE project is a good concrete example of this in practice. Their test suite lives in tests/e2e/ and covers Kubernetes and RDS Postgres scenarios. They are building a suite of realistic incident scenarios and checking whether the agent handles them correctly.

You can run the agent directly against a test fixture like this:

opensre investigate -i tests/e2e/kubernetes/fixtures/datadog_k8s_alert.json

That JSON fixture is a synthetic but realistic alert, constructed to represent a specific failure mode with logs, metrics, and context included. The agent runs against it and you check whether the investigation was correct. That is the entire idea. Now let us look at what that fixture actually contains.

What A Test Case Actually Looks Like

A test case has four parts: the input the agent sees, the steps you expect it to take, the answer you expect it to reach, and the red herrings it should notice but not chase.

TEST_CASE = {
    "scenario": "RDS Postgres connection pool exhaustion",
    "input": {
        "logs": """
            2024-01-15 14:23:01 UTC [FATAL] remaining connection slots reserved for
            non-replication superuser connections
            2024-01-15 14:23:01 UTC [ERROR] connection to server failed: FATAL:
            sorry, too many clients already
            2024-01-15 14:23:04 UTC [WARNING] pool wait time exceeded 10000ms
        """,
        "metrics": {
            "db_connections": 498,
            "db_connections_max": 500,
            "latency_ms": 4200,
            "cpu_percent": 45,        # elevated but not the cause
            "memory_percent": 60      # fine
        },
        "alert": "Database latency spike - P95 latency exceeded 4000ms"
    },
    "expected_steps": [
        "check_db_connections",
        "check_active_queries",
        "check_pool_configuration",
        "recommend_pool_size_increase"
    ],
    "expected_root_cause": "connection pool exhaustion",
    "red_herrings": {
        "cpu_percent": "elevated but stable, not the cause of latency"
    },
    "should_stop_after": "check_pool_configuration"
}

The logs and metrics are what the agent sees. The expected steps define the correct investigation path. The red herrings flag what the agent should notice but not chase. The stop condition catches agents that keep digging after the answer is already clear.

The agent runs against this input and you grade whether it got the right root cause, investigated in the right order, and did not get pulled off track by the CPU metric.

Trajectory Scoring

Once you have test cases, you need a way to score them. Getting the right answer is not enough. You want to know if the agent got there the right way.

This matters in practice because an agent that stumbles onto the correct answer after checking ten irrelevant things is not a reliable agent. It got lucky. Trajectory scoring measures the investigation path, not just the conclusion.

def score_trajectory(actual_steps: list[str], expected_steps: list[str]) -> dict:
    score = 0
    penalties = []

    for i, step in enumerate(actual_steps):
        if step in expected_steps:
            expected_position = expected_steps.index(step)
            actual_position = i
            # penalise for investigating out of order
            position_penalty = abs(expected_position - actual_position) * 0.2
            score += max(0, 1 - position_penalty)
        else:
            penalties.append(f"unexpected step taken: {step}")

    # penalise for not stopping when root cause was found
    if len(actual_steps) > len(expected_steps) + 2:
        penalties.append("agent continued investigating after root cause was clear")
        score -= 0.5

    return {
        "score": round(score / len(expected_steps), 2),
        "max_score": 1.0,
        "penalties": penalties,
        "passed": score / len(expected_steps) >= 0.8
    }

# example usage
result = agent.investigate(TEST_CASE["input"])

score = score_trajectory(
    actual_steps=result.steps_taken,
    expected_steps=TEST_CASE["expected_steps"]
)

print(score)
# {"score": 0.85, "max_score": 1.0, "penalties": [], "passed": True}

The function takes two lists: the steps the agent actually took, and the steps you expected it to take.
For each step the agent took, it checks two things: was this step in the expected list at all, and if so, did it happen at roughly the right point in the investigation? If the agent checked check_db_connections first and that was expected first, full credit. If it checked it third when it should have been first, it gets penalised proportionally and scores. After scoring the steps, it checks whether the agent kept going past the point where it should have stopped.

Trajectory scoring handles well-labelled scenarios where you know the expected path. But what happens when the scenario is deliberately designed to mislead?

Adversarial Tests: Forcing The Agent To Reason, Not Pattern Match

Standard test cases check whether the agent handles known scenarios correctly. Adversarial tests go further. They check whether the agent actually reasons or just pattern matches.
The difference matters because production incidents do not arrive cleanly. They arrive with noise, misleading signals, and symptoms that point in the wrong direction. An agent that pattern matches will chase the loudest signal. An agent that reasons will trace the causal chain.
Adversarial tests deliberately inject red herrings to expose which one you have built:

ADVERSARIAL_TEST = {
    "scenario": "Kubernetes OOMKilled - misleading CPU spike",
    "input": {
        "logs": """
            2024-01-15 09:15:22 UTC [WARNING] Container memory usage at 94%
            2024-01-15 09:15:45 UTC [ERROR] OOMKilled: container exceeded memory limit
            2024-01-15 09:15:45 UTC [INFO] Pod restarting...
            2024-01-15 09:16:01 UTC [WARNING] CPU throttling detected on node
        """,
        "metrics": {
            "cpu_throttling_percent": 78,   # looks alarming, is a red herring
            "memory_usage_percent": 94,
            "memory_limit_mb": 512,
            "pod_restarts": 4
        },
        "alert": "High CPU throttling detected"
    },
    "expected_root_cause": "container OOMKilled due to insufficient memory limit",
    "red_herring": "cpu_throttling looks like the main issue but is a downstream symptom",
    "expected_steps": [
        "check_pod_events",
        "check_memory_usage",
        "check_memory_limits",
        "recommend_memory_limit_increase"
    ],
    # agent should NOT go down this path
    "incorrect_path": [
        "check_cpu_throttling",
        "recommend_cpu_limit_increase"
    ]
}

The alert fires on CPU throttling. A pattern-matching agent sees 78% throttling and immediately recommends a CPU limit increase. That is wrong. The CPU throttling is a downstream symptom of the OOMKill restart loop. The real problem is the memory limit being too low. An agent that reasons traces from the OOMKill event back to the memory configuration and stops there.
Now you understand what a test case is, how to score it, and how to stress test the agent against misleading signals. Before you start writing your own, it is worth looking at how others have structured theirs.

Looking At Existing Test Suites Before Building Your Own

Before writing test cases from scratch, you have to look at what others have already built. The structure is often more instructive than the content itself.
The OpenSRE test suite separates scenarios by domain with fixture files containing realistic alert payloads. Reading those fixtures before writing your own will save you several wrong turns on what a well-structured test case should actually contain, what fields matter, how much context to include, and how to frame the expected behaviour clearly enough to grade against.
Two other eval suites worth studying for the structural pattern regardless of your domain:

SWE-bench: how Princeton structured software engineering task evals for coding agents. The input, expected output, graded result pattern maps directly to any agent domain.
AgentBench: benchmark for LLM agents across different environments including OS tasks, database interactions, and web browsing. Useful for seeing how grading works across different action spaces and how to think about pass criteria when the action space is open-ended.

The pattern across all of them is the same: realistic input, defined expected behaviour, graded output. Once you have that pattern clear in your head, the fastest way to build your own scenarios is synthetic.

Synthetic Data: What It Is and When It Plateaus

Synthetic test cases are ones you construct yourself. You write the logs, set the metrics, define the expected answer. You control everything.
This is the right place to start. It is fast, you can cover specific failure modes methodically, and you can design adversarial cases precisely because you decide what the red herrings are.

def generate_synthetic_rds_scenario(failure_type: str) -> dict:
    templates = {
        "slow_query": {
            "logs": [
                "duration: 45231 ms statement: SELECT * FROM orders WHERE status='pending'",
                "autovacuum: found 80000 dead row versions in table orders"
            ],
            "metrics": {"cpu": 35, "db_connections": 45, "latency_ms": 45000},
            "red_herrings": {"db_connections": "normal range, not the cause"},
            "root_cause": "missing index causing full table scan",
            "expected_steps": ["check_slow_queries", "check_query_plans", "recommend_index"]
        },
        "replication_lag": {
            "logs": [
                "replication slot lag: 8GB",
                "WAL sender process waiting for WAL to be archived"
            ],
            "metrics": {"replication_lag_bytes": 8589934592, "cpu": 20, "disk_io": 95},
            "red_herrings": {"cpu": "low, not relevant"},
            "root_cause": "replication lag due to WAL accumulation",
            "expected_steps": [
                "check_replication_status",
                "check_wal_size",
                "check_replica_health"
            ]
        }
    }
    return templates.get(failure_type, {})

The limitation is that synthetic data plateaus. As you add more scenarios the agent improves, but the gains flatten over time. The reason is structural: every scenario you write comes from your own mental model of what can go wrong. You can only write what you can imagine, which means every edge case your synthetic suite does not cover is an edge case your agent has never been tested against.

When gains plateau you have two options: build a different synthetic generator that introduces genuinely new failure patterns, or bring in real world data. Real world cases expose failure modes you never thought to write because they actually happened to someone. That is where the next section comes in.

Where To Get Real World Data For Test Cases

To be clear again: you are not feeding these datasets into the model. You are reading them, understanding the failure patterns, and constructing test cases that reflect what actually happens in production.
The SRE agent is the example we have been using throughout, but the same approach applies to any domain. If you are building an agent that handles customer support tickets, database query optimisation, fraud detection, or any other domain with structured inputs and measurable outcomes, the same process applies: find a labeled dataset in your domain, understand the failure patterns, and turn them into test cases. Here is a list of good dataset sites:

Google Dataset Search
A search engine specifically for datasets. For your domains, search what your agent handles: "customer support tickets", "financial transactions", "medical records."

Kaggle
A large public dataset repository with a lot of labeled data across many domains. It covers finance, healthcare, e-commerce, NLP, and more. Many Kaggle datasets include notebooks showing how others have analysed them, which makes it easier to understand failure patterns before writing test cases.

Once you find a dataset, you load and filter it to find the failure windows, the rows where something actually went wrong.

import pandas as pd

# load a labeled anomaly dataset
df = pd.read_csv("server_machine_dataset.csv")

# filter for labeled failure windows
failure_window = df[df["label"] == 1].head(100)

# the metrics leading up to the failure become your test case input
# the label tells you when the failure occurred
# you construct the expected root cause from the dataset documentation

test_case = {
    "scenario": "server anomaly from SMD dataset machine-1-1",
    "input": {
        "metrics": failure_window[["cpu", "memory", "network_in", "network_out"]].to_dict()
    },
    "expected_root_cause": "network saturation",  # from dataset docs
    "expected_steps": [
        "check_network_throughput",
        "check_active_connections",
        "identify_traffic_source"
    ]
}

One thing worth noting: the expected root cause is not something you derive from the data itself. It comes from the dataset documentation, which for published research datasets will describe what each labeled failure actually was. That documentation is the ground truth your test case is built on. Without it you have inputs but no correct answers to grade against, which means you have data but not a test suite.
The goal across all of this is not to teach the agent. It is to know, with confidence, whether the agent you have built is reliable enough to trust in production.

More Resources

A New opensource Security AI model being built.

Joe Munene — Mon, 06 Apr 2026 23:31:46 +0000

I Built an Open-Source Cybersecurity LLM From Scratch in Python

What if you could build your own AI model — not fine-tune someone else's, not wrap an API — but actually build a transformer from scratch and train it on cybersecurity data?

That's exactly what I did. And I'm releasing it under Apache 2.0 so anyone can use it, improve it, and build on it.

Meet GhostLM — an open-source, cybersecurity-focused language model built entirely from scratch in PyTorch. No pretrained weights. No wrappers. Every single component written by hand.

GitHub: https://github.com/joemunene-by/GhostLM

Why I Built GhostLM

Here's the thing about current AI models: they're incredibly powerful, but they weren't built for security. When you ask GPT-4 about a CVE vulnerability or a CTF challenge, it gives you a reasonable answer — but it's reasoning from general knowledge, not from deep security context.

I wanted a model that actually understands cybersecurity language — the patterns, the terminology, the attack methodologies. And I wanted to build it myself, not because I thought I could out-engineer OpenAI, but because the best way to understand how something works is to build it from the ground up.

My goal was simple: create the first open-source, cybersecurity-focused language model that anyone can run, inspect, and improve.

What GhostLM Is

GhostLM is a decoder-only transformer language model — the same architecture family as GPT-2, GPT-3, and Llama — but built entirely from scratch. No transformers.AutoModel, no from_pretrained(). Just raw PyTorch tensors and matrix multiplications.

It comes in three sizes:

Variant	Layers	Dim	Params	Status
ghost-tiny	2	256	~14.5M	✅ Trained
ghost-small	6	512	~55M	🔄 Planned
ghost-medium	12	768	~160M	🔜 Future

It's trained on:

CVE vulnerability descriptions from the NVD database
CTF writeups covering real challenge types
Cybersecurity research papers and abstracts

And it's fully open source under Apache 2.0.

The Architecture

Let me show you what "built from scratch" actually looks like.

Causal Self-Attention

This is the core of every transformer. Here's GhostLM's implementation — no F.scaled_dot_product_attention, no hidden magic:

def forward(self, x):
    B, T, C = x.size()

    # Combined QKV projection and split
    qkv = self.c_qkv(x)
    q, k, v = qkv.split(self.n_heads * self.head_dim, dim=-1)

    # Reshape to (B, n_heads, T, head_dim)
    q = q.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
    k = k.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
    v = v.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)

    # Scaled dot-product attention
    att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(self.head_dim))

    # Apply causal mask (lower triangular)
    att = att.masked_fill(self.causal_mask[:, :, :T, :T] == 0, float("-inf"))

    # Softmax + dropout + weighted sum
    att = F.softmax(att, dim=-1)
    y = self.attn_dropout(att) @ v

    # Reassemble heads and project back
    y = y.transpose(1, 2).contiguous().view(B, T, C)
    return self.resid_dropout(self.proj(y))

Every line is intentional. The causal mask ensures the model can only attend to previous tokens (autoregressive). The attention weights are manually computed with the classic QK^T / sqrt(d) formula.

Transformer Block

The block stacks attention and feed-forward layers with a pre-norm architecture:

def forward(self, x):
    # Pre-norm + self-attention with residual
    x = x + self.attn(self.ln_1(x))
    # Pre-norm + feed-forward with residual
    x = x + self.ffn(self.ln_2(x))
    return x

Why pre-norm? I chose pre-normalization (LayerNorm before each sub-layer) over post-norm because it's significantly more stable for training, especially on smaller models. The gradients flow more cleanly through the residual connections, and you don't need as careful a learning rate schedule.

Weight Tying

One optimization that saves ~25 million parameters: the output projection layer shares weights with the token embedding. Instead of learning two separate vocab_size × d_model matrices, we learn one and reuse it:

self.lm_head.weight = self.token_embedding.weight

This is the same trick GPT-2 uses, and it works because the embedding and output projection are fundamentally doing the same thing — mapping between token space and hidden space.

Training Data

The data pipeline is one of the most important parts of any ML project. GhostLM's pipeline collects from three sources:

NVD CVE Descriptions (Real Data)

I hit the National Vulnerability Database REST API directly — no HuggingFace dependency needed. Paginated requests with rate limiting, parsing nested JSON responses, extracting English descriptions:

url = "https://services.nvd.nist.gov/rest/json/cves/2.0?resultsPerPage=2000&startIndex=0"
resp = requests.get(url, timeout=30)
for item in resp.json()["vulnerabilities"]:
    cve_id = item["cve"]["id"]
    description = item["cve"]["descriptions"][0]["value"]

This gave me 9,925 real CVE descriptions — the kind of text that says "A buffer overflow in the XYZ component allows remote attackers to execute arbitrary code via crafted input."

The Full Pipeline

NVD API → 9,925 CVE descriptions (real)
Synthetic papers → 500 security research abstracts
Synthetic CTF writeups → 500 challenge solutions
─────────────────────────────────────────────────
Total: 10,925 records → ~490,532 tokens
Train: 10,378 | Validation: 547

The pipeline handles text cleaning (unicode normalization, whitespace stripping, non-printable character removal), tokenization, chunking, and train/val splitting — all in data/collect.py.

Training Results

Here's where it gets interesting. I trained ghost-tiny on a ThinkPad Yoga 11e with a Celeron N4100 and 4GB of RAM. Yes, really.

Loss Progression

Steps	Train Loss	Val Loss	Notes
0	10.84	10.04	Random initialization
500	7.12	6.27	First CVE patterns emerge
1,000	5.89	5.41	Starting to form sentences
2,000	4.63	4.58	Grammar improving
3,000	3.91	3.95	Security vocabulary appearing
4,000	3.52	3.58	Coherent attack descriptions
5,000	3.38	3.46	Best checkpoint saved

The loss curve is healthy — train and validation are tracking closely, no signs of overfitting yet.

Generation at 5,000 Steps

Here's what the model generates when prompted with "A SQL injection attack works by":

A SQL injection attack works by using the admin_user sequences in the web server. Web Application Firewall Evasion Techniques present a critical defense layer against commercial and model checking. Our model achieves 94% detection rate with transformer-based sequence modeling to identify common vulnerability patterns including buffer overflows.

Is it perfect? No. It bleeds between topics (SQL injection → WAF → research paper language). But it's producing grammatically correct sentences with real security terminology. At 5,000 steps on a 14.5M parameter model running on a laptop from 2018, I'll take it.

Honest Limitations

Topic coherence — the model jumps between subjects mid-generation. It needs more steps to learn to stay on topic.
Memorization — some outputs are lifted nearly verbatim from training data. More diverse data would help.
Size — 14.5M params is tiny. ghost-small (55M) will be a significant jump.
CPU training — at ~1.8s per step, 10,000 steps takes hours. GPU or TPU is needed for serious training.

What's Next

I've already applied for Google TPU Research Credits to train ghost-small on proper hardware. The plan:

ghost-tiny to 10,000+ steps — finish what I started
ghost-small on TPU/GPU — 55M params with real compute
HuggingFace Hub release — public model weights anyone can download
Live demo on HuggingFace Spaces — try GhostLM in your browser
Benchmark vs GPT-2 — objective comparison on cybersecurity tasks

Try It Yourself

The entire project is open source. Clone it, run it, break it, improve it:

git clone https://github.com/joemunene-by/GhostLM.git
cd GhostLM

# Install everything
make install

# Download training data
make data

# Train ghost-tiny on CPU
make train-tiny

# Chat with the trained model
make chat

# Run the web demo
pip install gradio
python demo/app.py

I'm actively looking for contributors. If you want to help with:

Finding new cybersecurity datasets
Implementing Flash Attention or RoPE
Adding distributed training
Writing documentation

Check out CONTRIBUTING.md and open a PR.

Final Thoughts

I'm a 20-year-old computer science student in Nairobi, Kenya. I don't have access to massive compute clusters or research lab budgets. But I do have curiosity, persistence, and a belief that open-source AI shouldn't only come from well-funded labs.

GhostLM is proof that you can build something meaningful from scratch with limited resources. The architecture is clean, the training pipeline works, and the model is learning. It's not going to replace GPT-4 — but it's a foundation that anyone can build on.

If you found this interesting, star the repo, try it out, and let me know what you think. The best part of open source is that it gets better when more people are involved.

GitHub: https://github.com/joemunene-by/GhostLM

License: Apache 2.0

Built with ❤️ in Nairobi, Kenya 🇰🇪

Forem Core

Give Your AI Agent iMessage in 5 Minutes — Claude Code, Codex, Cursor

What this is

Demo: zero to first message

Why an Agent Skill

How it compares

Pricing

Supported platforms

Links

Self-Improving Python Scripts with LLMs: My Journey

How to Use Replicate the Right Way in Your Next.js App (And Ship a Real Product With It)

What Is Replicate, Really?

1. Understand the Prediction Lifecycle Before Writing Any Code

2. Polling vs. Webhooks: Choose the Right Strategy

Polling (simplest, fine for most apps)

Webhooks (better for longer or background tasks)

When to use each

3. Cold Starts Are Real — Here's How to Handle Them

4. Save Outputs Immediately — They Expire in 1 Hour

5. Next.js Config: Don't Forget This

6. Error Handling That Doesn't Suck

7. Rate Limits to Know

8. Choosing the Right Model

Real-World Case Study: Goodbye Watermark

TL;DR — The Patterns That Matter

Building in Public in 2026: Has the Strategy Been Gamed or Does Transparency Still Drive Growth?

The Death of Authentic Transparency: How Building in Public Became a Liability

The Algorithmic Capture of Public Building

The Quantifiable Cost of Performative Transparency

Authentic Transparency vs. Algorithmic Theater

Shipping AnywhereHired: Flask, Scrapy, and why “junior” job posts lie

The problem

What it does

Stack (high level)

Why I’m posting

Open questions for you

Your API Isn’t Hard to Use Your Documentation Is Just Bad

My Claude Code Sessions Hit 70MB. So I Built a Distiller.

What Session Distiller Does

The Tool-ID Matching Problem

Image Trimmer: The Targeted Fix

How to Use It

What's Actually in the Backup

Performance

Try It

mcpware / claude-code-organizer

Dashboard to manage Claude Code memories, configs, and MCP servers — security scanner for tool poisoning, context token budget tracker, duplicate cleanup, scope management. npx @mcpware/claude-code-organizer

Claude Code Organizer

About Me

I Built an Autonomous Job Application Agent with Claude AI — Here's How It Works

What I Built

The Problem

How It Works

Advancing DevOps/Cloud Learning: Strategies for Post-Foundational Skill Development

Introduction: Navigating the DevOps/Cloud Learning Journey

The Mechanics of Course Selection: Why Most Learners Fail

Evaluating "Train with Shubham" vs. Alternatives: A Causal Analysis

Edge Cases: When "Train with Shubham" Might Not Be Optimal

Rule for Choosing Advanced Courses: If X, Then Y

Conclusion: Strategic Learning as a Career Accelerator

Scenario Analysis: Real-World Applications and Skill Gaps

1. The Automation Bottleneck: From Manual to Scalable Infrastructure

2. The CI/CD Pipeline Paradox: Builds Succeed, Deployments Fail

3. The Multi-Cloud Misalignment: AWS Expertise Fails in Azure

4. The Monitoring Blindspot: Alerts Flood In, Root Cause Elusive

5. The Security Breach: Misconfigured IAM Roles Expose Data

6. The Cost Overrun: Cloud Bills Spike Post-Migration

Comparative Analysis: "Train with Shubham" vs. Alternatives

Edge Case: Budget Constraints

Strategic Learning Plans: Tailored Roadmaps for Success

1. Evaluating "Train with Shubham": Mechanism and Fit

2. Alternative Paths: Comparative Analysis

3. Edge Cases: Budget Constraints and Vendor-Neutral Goals

4. Long-Term Strategy: Portfolio vs. Certifications

Conclusion: Dominant Strategy Selection

Your Startup Isn’t Confusing Your Documentation Is (Here’s How to Fix It)

How AI Engineers Actually Use Datasets: Test Cases, Edge Cases and Agent Reliability

What You Are Not Doing vs What You Are

What You Are Actually Testing

The SRE Agent As A Case Study