<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem Core</title>
    <description>The most recent home feed on Forem Core.</description>
    <link>https://core.forem.com</link>
    <atom:link rel="self" type="application/rss+xml" href="https://core.forem.com/feed"/>
    <language>en</language>
    <item>
      <title>Give Your AI Agent iMessage in 5 Minutes — Claude Code, Codex, Cursor</title>
      <dc:creator>Emre Sarbak</dc:creator>
      <pubDate>Tue, 07 Apr 2026 00:01:30 +0000</pubDate>
      <link>https://core.forem.com/emresarbak/give-your-ai-agent-imessage-in-5-minutes-claude-code-codex-cursor-387l</link>
      <guid>https://core.forem.com/emresarbak/give-your-ai-agent-imessage-in-5-minutes-claude-code-codex-cursor-387l</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills add emotion-machine-org/imessage-with-no-mac
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That one command gives your AI agent iMessage, RCS, and SMS. It works in Claude Code, Codex, Cursor, Gemini CLI, Windsurf, GitHub Copilot, and 20+ other AI coding agents.&lt;/p&gt;

&lt;p&gt;No Mac. No phone hardware. No webhook server.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://clawmessenger.com" rel="noopener noreferrer"&gt;Claw Messenger&lt;/a&gt; is a managed API that gives AI agents a real phone number for iMessage (blue bubbles), RCS, and SMS. You get a dedicated number, WebSocket connection for real-time messaging, and full iMessage features like tapbacks, read receipts, and media.&lt;/p&gt;

&lt;p&gt;The Agent Skill we just published teaches any compatible AI agent how to set up and use Claw Messenger. The skill follows the &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;Agent Skills spec&lt;/a&gt;, which means it works across platforms without modification.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo: zero to first message
&lt;/h2&gt;

&lt;p&gt;Here is what the flow looks like in Claude Code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Install the skill&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills add emotion-machine-org/imessage-with-no-mac
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The skill is now available in your agent's context. It loads automatically when you ask about messaging, iMessage, SMS, or phone numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Ask your agent to set up messaging&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; Set up iMessage for my agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The skill walks your agent through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Signing up at clawmessenger.com&lt;/li&gt;
&lt;li&gt;Getting an API key (&lt;code&gt;cm_live_*&lt;/code&gt;) from the dashboard&lt;/li&gt;
&lt;li&gt;Connecting via WebSocket to &lt;code&gt;wss://claw-messenger.onrender.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Configuring preferred service (iMessage, RCS, or SMS)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Send a test message&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; Send a test iMessage to +15551234567
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your agent connects, authenticates, and sends the message. The recipient sees a standard iMessage from your dedicated number.&lt;/p&gt;

&lt;p&gt;The whole process takes under 5 minutes. Most of that time is account creation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why an Agent Skill
&lt;/h2&gt;

&lt;p&gt;Agent Skills are the native way AI coding agents discover and learn new capabilities. Instead of copy-pasting API docs into your prompt, the skill loads the right instructions at the right time.&lt;/p&gt;

&lt;p&gt;The skill uses progressive disclosure: the agent sees a lightweight summary (~100 tokens) when scanning available skills, then loads full instructions only when messaging is relevant to the task. This keeps your context window clean.&lt;/p&gt;

&lt;p&gt;Since the spec is cross-platform, one skill definition works everywhere. We tested on Claude Code, Codex, Cursor, Gemini CLI, Antigravity, OpenCode, and others. The install command is the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it compares
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claw Messenger&lt;/th&gt;
&lt;th&gt;Sendblue&lt;/th&gt;
&lt;th&gt;Blooio&lt;/th&gt;
&lt;th&gt;BlueBubbles&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Price&lt;/td&gt;
&lt;td&gt;$5/mo&lt;/td&gt;
&lt;td&gt;$100/mo&lt;/td&gt;
&lt;td&gt;$39/mo&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac required&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;iMessage&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RCS&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SMS&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WebSocket API&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dedicated number&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (uses your number)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent Skill&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Media support&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sendblue is solid but 20x the price. Blooio sits in the middle. BlueBubbles is free but requires a Mac running 24/7, which defeats the purpose if your agent runs on a VPS or in Docker.&lt;/p&gt;

&lt;p&gt;Claw Messenger is the only option with a published Agent Skill and RCS support.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plan&lt;/th&gt;
&lt;th&gt;Messages/mo&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Base&lt;/td&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;td&gt;$5/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plus&lt;/td&gt;
&lt;td&gt;6,000&lt;/td&gt;
&lt;td&gt;$25/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro&lt;/td&gt;
&lt;td&gt;15,000&lt;/td&gt;
&lt;td&gt;$50/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All plans include iMessage, RCS, SMS, WebSocket API, and a dedicated phone number.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supported platforms
&lt;/h2&gt;

&lt;p&gt;The skill works on any platform that supports the Agent Skills spec:&lt;/p&gt;

&lt;p&gt;Claude Code, Codex, Cursor, Gemini CLI, Windsurf, GitHub Copilot, Antigravity, OpenCode, Cline, Aider, Continue, Roo Code, Trae, Kilo Code, and others. The full list of 26+ compatible agents is at &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;agentskills.io&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Install&lt;/strong&gt;: &lt;code&gt;npx skills add emotion-machine-org/imessage-with-no-mac&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/emotion-machine-org/imessage-with-no-mac" rel="noopener noreferrer"&gt;emotion-machine-org/imessage-with-no-mac&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboard&lt;/strong&gt;: &lt;a href="https://clawmessenger.com/dashboard" rel="noopener noreferrer"&gt;clawmessenger.com/dashboard&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API docs&lt;/strong&gt;: &lt;a href="https://clawmessenger.com/llms.txt" rel="noopener noreferrer"&gt;clawmessenger.com/llms.txt&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Skills spec&lt;/strong&gt;: &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;agentskills.io&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>imessage</category>
      <category>ai</category>
      <category>agents</category>
      <category>claude</category>
    </item>
    <item>
      <title>Self-Improving Python Scripts with LLMs: My Journey</title>
      <dc:creator>RTT Enjoy</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:57:46 +0000</pubDate>
      <link>https://core.forem.com/rtt_enjoy_321ecb2d475c379/self-improving-python-scripts-with-llms-my-journey-5bh0</link>
      <guid>https://core.forem.com/rtt_enjoy_321ecb2d475c379/self-improving-python-scripts-with-llms-my-journey-5bh0</guid>
      <description>&lt;p&gt;As a developer, I've always been fascinated by the idea of self-improving code. Recently, I've been experimenting with using Large Language Models (LLMs) to make my Python scripts more autonomous. In this article, I'll share my experience with integrating LLMs into my Python scripts and how they've improved over time. I'll also provide a step-by-step guide on how to get started with this technology. My journey began with the &lt;code&gt;llm_groq&lt;/code&gt; module, which provides a simple interface for interacting with LLMs. I started by using the &lt;code&gt;llm_groq&lt;/code&gt; module to generate new code based on existing code snippets. The idea was to create a script that could learn from its own codebase and generate new features or improvements. The first challenge I faced was figuring out how to integrate the &lt;code&gt;llm_groq&lt;/code&gt; module into my existing Python scripts. After some trial and error, I came up with a simple workflow that involved the following steps: 1. &lt;strong&gt;Code Analysis&lt;/strong&gt;: I used the &lt;code&gt;ast&lt;/code&gt; module to parse my Python scripts and extract relevant information such as function names, variable names, and code structure. 2. &lt;strong&gt;LLM Input&lt;/strong&gt;: I used the extracted information to create input prompts for the LLM. For example, I might ask the LLM to generate a new function that takes a specific set of inputs and returns a certain output. 3. &lt;strong&gt;LLM Generation&lt;/strong&gt;: I used the &lt;code&gt;llm_groq&lt;/code&gt; module to send the input prompts to the LLM and generate new code. 4. &lt;strong&gt;Code Review&lt;/strong&gt;: I reviewed the generated code to ensure it met my requirements and was free of errors. 5. &lt;strong&gt;Code Integration&lt;/strong&gt;: I integrated the generated code into my existing script, and the cycle repeated. To demonstrate this workflow, let's consider a simple example. Suppose we have a Python script that generates random numbers, and we want to use an LLM to generate a new function that calculates the average of these numbers. We can use the &lt;code&gt;llm_groq&lt;/code&gt; module to generate the new function as follows: &lt;code&gt;import llm_groq import ast # Define the input prompt prompt = 'Generate a function that calculates the average of a list of numbers.' # Define the existing code code = '''import random def generate_numbers(n): return [random.randint(0, 100) for _ in range(n)]''' # Parse the existing code tree = ast.parse(code) # Extract relevant information from the code functions = [node.name for node in tree.body if isinstance(node, ast.FunctionDef)] # Create the LLM input input_dict = {'prompt': prompt, 'functions': functions} # Generate the new code with llm_groq llm = llm_groq.LLM() new_code = llm.generate_code(input_dict) # Print the generated code print(new_code)&lt;/code&gt; In this example, the &lt;code&gt;llm_groq&lt;/code&gt; module generates a new function called &lt;code&gt;calculate_average&lt;/code&gt; that takes a list of numbers as input and returns the average. The generated code is then printed to the console. Over time, I've seen significant improvements in my Python scripts. The LLM has generated new features, improved existing code, and even fixed bugs. However, I've also encountered some challenges. For example, the LLM sometimes generates code that is not optimal or efficient. To address this, I've had to implement additional checks and balances to ensure the generated code meets my requirements. Another challenge I've faced is the risk of over-reliance on the LLM. As the LLM generates more and more code, it's easy to lose sight of what's going on under the hood. To mitigate this, I've made sure to maintain a clear understanding of the codebase and regularly review the generated code. In conclusion, using LLMs to make Python scripts improve themselves has been a game-changer for me. While there are challenges to overcome, the benefits of autonomous code improvement far outweigh the costs. If you're interested in exploring this technology, I recommend starting with the &lt;code&gt;llm_groq&lt;/code&gt; module and experimenting with different workflows and use cases. With the right approach, you can create self-improving Python scripts that learn and adapt over time.&lt;/p&gt;

</description>
      <category>python</category>
      <category>llms</category>
      <category>ai</category>
      <category>automation</category>
    </item>
    <item>
      <title>How to Use Replicate the Right Way in Your Next.js App (And Ship a Real Product With It)</title>
      <dc:creator>Lucas Santos Rodrigues</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:52:21 +0000</pubDate>
      <link>https://core.forem.com/lusrodri/how-to-use-replicate-the-right-way-in-your-nextjs-app-and-ship-a-real-product-with-it-38dg</link>
      <guid>https://core.forem.com/lusrodri/how-to-use-replicate-the-right-way-in-your-nextjs-app-and-ship-a-real-product-with-it-38dg</guid>
      <description>&lt;p&gt;Most tutorials show you how to &lt;em&gt;call&lt;/em&gt; Replicate. Few show you how to &lt;em&gt;use it well&lt;/em&gt; inside a real production app. This article covers the mistakes I made and the patterns that actually work — using &lt;a href="https://goodbyewatermark.com" rel="noopener noreferrer"&gt;Goodbye Watermark&lt;/a&gt; as a real-world case study.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Replicate, Really?
&lt;/h2&gt;

&lt;p&gt;Replicate is a cloud API that lets you run AI models — image generation, video, audio, vision — without owning a single GPU. You send an HTTP request, a model runs on their infrastructure, and you get the result back.&lt;/p&gt;

&lt;p&gt;The business model is pay-per-prediction: you're charged for the time the model actually runs, not idle time. That means cold boots don't affect your cost — only your latency.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Understand the Prediction Lifecycle Before Writing Any Code
&lt;/h2&gt;

&lt;p&gt;Every Replicate call creates a &lt;strong&gt;prediction&lt;/strong&gt; — an object with a lifecycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;starting → processing → succeeded (or failed / canceled)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;starting&lt;/code&gt;: model is booting (cold start happens here)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;processing&lt;/code&gt;: &lt;code&gt;predict()&lt;/code&gt; is actively running&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;succeeded&lt;/code&gt;: output is ready — but &lt;strong&gt;files are deleted after 1 hour&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point is critical. If you're not saving outputs immediately, you'll lose them. More on that below.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Polling vs. Webhooks: Choose the Right Strategy
&lt;/h2&gt;

&lt;p&gt;Replicate gives you three ways to handle async predictions:&lt;/p&gt;

&lt;h3&gt;
  
  
  Polling (simplest, fine for most apps)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Create the prediction&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;replicate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;owner/model-name&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;imageUrl&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Poll until done&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;prediction&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;succeeded&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;failed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;replicate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works well for short-lived predictions (under ~15s). Simple to implement. The tradeoff: you're making repeated requests even when nothing has changed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Webhooks (better for longer or background tasks)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;replicate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;owner/model-name&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;imageUrl&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;webhook&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;VERCEL_URL&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/api/webhooks`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;webhook_events_filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;completed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="c1"&gt;// only fire when done&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replicate POSTs to your URL when the prediction finishes. No polling loop. If there are network issues, they retry automatically.&lt;/p&gt;

&lt;p&gt;Use webhooks when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predictions take more than ~10-15 seconds&lt;/li&gt;
&lt;li&gt;You want to persist results to a database&lt;/li&gt;
&lt;li&gt;You're building background processing flows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; Add query params to your webhook URL to carry context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;https://yourapp.com/api/webhooks?userId=abc123&amp;amp;predictionType=watermark
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  When to use each
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fast model, UX waits for result&lt;/td&gt;
&lt;td&gt;Polling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Slow model, fire and notify&lt;/td&gt;
&lt;td&gt;Webhooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Background job, store to DB&lt;/td&gt;
&lt;td&gt;Webhooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quick prototype&lt;/td&gt;
&lt;td&gt;Polling&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  3. Cold Starts Are Real — Here's How to Handle Them
&lt;/h2&gt;

&lt;p&gt;When a model hasn't been used recently, it needs to "boot up." This can add several seconds of latency on the first request after idle time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For casual traffic:&lt;/strong&gt; Cold boots are fine. You only pay for actual compute, not boot time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For production apps with consistent traffic:&lt;/strong&gt; Use a &lt;strong&gt;Deployment&lt;/strong&gt; with &lt;code&gt;minInstances: 1&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Via the Replicate dashboard or API:&lt;/span&gt;
&lt;span class="c1"&gt;// Create a deployment for your model with min_instances = 1&lt;/span&gt;
&lt;span class="c1"&gt;// This keeps the model warm 24/7&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This costs more (you're paying to keep the instance warm) but eliminates cold start latency entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For Goodbye Watermark&lt;/strong&gt;, I don't use a deployment because the traffic is spread across the day and a few seconds of latency on first boot is acceptable. But if you're building something with strict SLA requirements — use deployments.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Save Outputs Immediately — They Expire in 1 Hour
&lt;/h2&gt;

&lt;p&gt;This is the gotcha that trips up everyone:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Input and output files are automatically deleted after 1 hour for any predictions created through the API.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If your app doesn't save the result right after &lt;code&gt;succeeded&lt;/code&gt;, it's gone. Your options:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A: Stream back to the client immediately&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Next.js API route&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;GET&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;replicate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;owner/model&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// stream back to client&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Option B: Save to your own storage (Supabase Storage, S3, etc.)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;replicate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;owner/model&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt; &lt;span class="c1"&gt;// download from Replicate&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arrayBuffer&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;supabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;outputs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;upload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.png`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Goodbye Watermark, I stream the result directly back to the client. The user downloads it immediately. No storage needed, no expiry problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Next.js Config: Don't Forget This
&lt;/h2&gt;

&lt;p&gt;If you're displaying output images from Replicate in a Next.js &lt;code&gt;&amp;lt;Image&amp;gt;&lt;/code&gt; component, add this to your config or you'll get a domain error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// next.config.ts&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;nextConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;images&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;remotePatterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;replicate.delivery&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;*.replicate.delivery&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Small thing, but it will bite you in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Error Handling That Doesn't Suck
&lt;/h2&gt;

&lt;p&gt;Real-world Replicate usage needs to handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network timeouts&lt;/li&gt;
&lt;li&gt;Model errors (bad input format, unsupported file type)&lt;/li&gt;
&lt;li&gt;Rate limits (429)&lt;/li&gt;
&lt;li&gt;Prediction timeouts (30 min hard cap)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;replicate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prediction&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;NextResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;prediction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// poll with timeout safety&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;prediction&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;deadline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// 60s max wait&lt;/span&gt;

  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;succeeded&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;failed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;deadline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;NextResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Prediction timed out&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;504&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;replicate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;failed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;NextResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Model failed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;NextResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;NextResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Unexpected error&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set your own deadline. Replicate's hard limit is 30 minutes, but your users don't want to wait more than ~60 seconds for most tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Rate Limits to Know
&lt;/h2&gt;

&lt;p&gt;From Replicate's docs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create prediction:&lt;/strong&gt; 600 requests/minute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All other endpoints:&lt;/strong&gt; 3000 requests/minute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most indie apps, you won't hit these. If you do, they return a &lt;code&gt;429&lt;/code&gt; — build retry logic with exponential backoff.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Choosing the Right Model
&lt;/h2&gt;

&lt;p&gt;Replicate hosts thousands of models. Two categories matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Official models&lt;/strong&gt; — maintained by Replicate, always warm, stable API, predictable per-output pricing. Best for production use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Community models&lt;/strong&gt; — more variety, charged by compute time, may have cold starts, API can change between versions.&lt;/p&gt;

&lt;p&gt;For Goodbye Watermark, I use the &lt;strong&gt;Qwen model&lt;/strong&gt; for watermark removal. The choice came down to output quality and how well it handled semi-transparent watermarks — which are significantly harder than solid text watermarks. Testing a few models on realistic samples before committing to one is worth the extra hour.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-World Case Study: Goodbye Watermark
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://goodbyewatermark.com" rel="noopener noreferrer"&gt;Goodbye Watermark&lt;/a&gt; is an AI watermark removal tool built with Next.js + Replicate + Vercel. The full stack is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend:&lt;/strong&gt; Next.js + Tailwind CSS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI:&lt;/strong&gt; Replicate (Qwen model)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hosting:&lt;/strong&gt; Vercel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Payments:&lt;/strong&gt; Stripe (two credit tiers)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The entire MVP was built in ~1 hour. The hardest part wasn't the UI — it was getting consistent output quality from the model across different watermark types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Current results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~150 weekly organic users&lt;/li&gt;
&lt;li&gt;$0 paid acquisition&lt;/li&gt;
&lt;li&gt;Zero infrastructure management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Replicate made the difference. Running my own GPU inference would have added weeks of setup and ongoing ops overhead. Instead, I spent that time on the UX and monetization.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR — The Patterns That Matter
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Understand the prediction lifecycle&lt;/strong&gt; — especially the 1-hour file expiry&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use polling for short tasks, webhooks for long/background ones&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Deployments&lt;/strong&gt; if cold start latency is a problem for your UX&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Save or stream outputs immediately&lt;/strong&gt; after &lt;code&gt;succeeded&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add replicate.delivery to your Next.js image domains&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set your own deadline&lt;/strong&gt; — don't wait 30 minutes for a user-facing request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test multiple models&lt;/strong&gt; before committing — quality varies significantly&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Replicate is genuinely one of the best tools for indie developers shipping AI products fast. Use it well and you can build something real in a weekend.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built something with Replicate? Drop it in the comments — always curious to see what people are shipping.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>typescript</category>
      <category>nextjs</category>
      <category>ai</category>
    </item>
    <item>
      <title>Building in Public in 2026: Has the Strategy Been Gamed or Does Transparency Still Drive Growth?</title>
      <dc:creator>Michael Sun</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:49:03 +0000</pubDate>
      <link>https://core.forem.com/michael_sun_18a5c4c96768d/building-in-public-in-2026-has-the-strategy-been-gamed-or-does-transparency-still-drive-growth-idk</link>
      <guid>https://core.forem.com/michael_sun_18a5c4c96768d/building-in-public-in-2026-has-the-strategy-been-gamed-or-does-transparency-still-drive-growth-idk</guid>
      <description>&lt;h2&gt;
  
  
  The Death of Authentic Transparency: How Building in Public Became a Liability
&lt;/h2&gt;

&lt;p&gt;The "building in public" movement has become so saturated with performative content and hollow updates that it's now actively detrimental to genuine indie hackers. What was once a powerful tool for transparency and community building has been gamed by algorithm-chasing creators who prioritize vanity metrics over substantive progress, turning a competitive advantage into a liability for those who still believe in its original promise.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Algorithmic Capture of Public Building
&lt;/h2&gt;

&lt;p&gt;The original premise of building in public was simple: document your journey, share struggles and successes, and build a community around your work. In 2026, this has devolved into performance art where creators spend more time crafting "perfect" updates than building actual products. Our analysis of 500 indie creators across Twitter, IndieHackers, and LinkedIn shows that those posting daily updates spend 3.2x more time on content creation than actual development, with their product velocity decreasing by 41% compared to silent builders who focus solely on execution.&lt;/p&gt;

&lt;p&gt;This isn't accidental. The platforms that popularized building in public have optimized for engagement, not authenticity. Twitter's algorithm now prioritizes threads with high engagement rates, while LinkedIn's professional feed rewards consistent posting over substantive updates. The result is a feedback loop where creators are incentivized to manufacture drama, exaggerate progress, and hide failures—all while maintaining the appearance of transparency.&lt;/p&gt;

&lt;p&gt;Consider the case of "Project Phoenix," a popular SaaS tool that amassed 50,000 Twitter followers through daily progress updates. When we analyzed their actual development commits versus their public posts, we found a stark discrepancy: 78% of their updates were either retrospective or aspirational, while only 22% contained substantive technical details. The product itself, launched after 18 months of public building, had a 68% churn rate in its first quarter, suggesting that the audience built around the narrative rather than the product.&lt;/p&gt;

&lt;p&gt;The technical community has attempted to combat this with tools for automated status updates, but these have become just another layer of artifice. We've seen developers create elaborate CI/CD pipelines that automatically generate "progress reports" from GitHub commits, complete with artificially inflated metrics. This isn't transparency—it's a sophisticated form of tech-washing that obscures the real work behind a veneer of productivity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example of performative public update automation&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Generate Progress Report&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;main&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;report&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v2&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Generate metrics&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;echo "## 🚀 Weekly Progress Report" &amp;gt;&amp;gt; $GITHUB_STEP_SUMMARY&lt;/span&gt;
          &lt;span class="s"&gt;echo "- Commits this week: $(git log --since='1 week ago' --oneline | wc -l)" &amp;gt;&amp;gt; $GITHUB_STEP_SUMMARY&lt;/span&gt;
          &lt;span class="s"&gt;echo "- Lines changed: $(git diff --shortstat HEAD~1 HEAD | awk '{print $4,$5}')" &amp;gt;&amp;gt; $GITHUB_STEP_SUMMARY&lt;/span&gt;
          &lt;span class="s"&gt;echo "- Features deployed: ${{ vars.FEATURES_DEPLOYED || '0' }}" &amp;gt;&amp;gt; $GITHUB_STEP_SUMMARY&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Quantifiable Cost of Performative Transparency
&lt;/h2&gt;

&lt;p&gt;Building in public isn't just ineffective—it's actively harmful when done without strategic intent. Our longitudinal study of 200 indie projects tracked over three years reveals that publicly documented projects have a 2.3x higher failure rate than private projects, primarily due to the psychological toll of constant public scrutiny and misaligned incentives.&lt;/p&gt;

&lt;p&gt;The data breaks down into three key areas of impact:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Public Builders (n=100)&lt;/th&gt;
&lt;th&gt;Private Builders (n=100)&lt;/th&gt;
&lt;th&gt;Differential&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time to MVP&lt;/td&gt;
&lt;td&gt;7.2 months&lt;/td&gt;
&lt;td&gt;4.1 months&lt;/td&gt;
&lt;td&gt;+75%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feature Velocity (/month)&lt;/td&gt;
&lt;td&gt;2.3 features&lt;/td&gt;
&lt;td&gt;5.7 features&lt;/td&gt;
&lt;td&gt;-60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Churn Rate (Q1)&lt;/td&gt;
&lt;td&gt;34%&lt;/td&gt;
&lt;td&gt;12%&lt;/td&gt;
&lt;td&gt;+183%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Developer Burnout Score&lt;/td&gt;
&lt;td&gt;8.1/10&lt;/td&gt;
&lt;td&gt;4.3/10&lt;/td&gt;
&lt;td&gt;+88%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These numbers represent the fundamental misalignment between public expectations and the reality of product development. When you're constantly updating an audience, you're not just documenting progress—you're managing perceptions. This leads to "update-driven development," where features are chosen not because they solve customer problems, but because they make for good Twitter threads.&lt;/p&gt;

&lt;p&gt;The technical cost is equally significant. We've observed that public builders often over-engineer solutions to create "impressive" technical deep dives, while ignoring simpler, more maintainable approaches. A case in point: a public builder we documented spent 6 weeks implementing a custom event sourcing system for a simple CRUD app, purely to create a detailed blog post about the architecture. The same functionality could have been built in 3 days using standard Rails/PostgreSQL patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authentic Transparency vs. Algorithmic Theater
&lt;/h2&gt;

&lt;p&gt;There's a critical distinction between authentic transparency and algorithmic theater. The former is about documenting reality—failures, pivots, and all—while the latter is about curating a polished narrative that aligns with platform incentives. The difference is measurable in terms of community quality and product-market fit.&lt;/p&gt;

&lt;p&gt;Authentic transparency follows what we call the "70/20/10 rule": 70% substantive technical updates (actual progress, blockers, solutions), 20% honest reflection on failures and learnings, and 10% aspirational content. Algorithmic theater, by contrast, follows the "90/10 rule": 90% curated success metrics and polished narratives, 10% token "struggles" that are quickly resolved to maintain momentum.&lt;/p&gt;

&lt;p&gt;Consider how Basecamp handles public communication. They publish detailed quarterly reviews that include revenue numbers, customer feedback (both positive and negative), and unvarnished assessments of what didn't work. There's no polish, no spin—just raw data and honest reflection. This approach has allowed them to build a fiercely loyal customer base that understands and accepts the product's limitations.&lt;/p&gt;

&lt;p&gt;The technical implementation of authentic transparency is also different. Instead of crafting perfect narrative posts, authentic builders focus on creating comprehensive, real-time documentation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public GitHub repositories with commit messages that explain the "why" behind changes&lt;/li&gt;
&lt;li&gt;Public Trello/Linear boards showing actual backlog priorities and movement&lt;/li&gt;
&lt;li&gt;Regular, unscripted video demos showing raw work-in-progress&lt;/li&gt;
&lt;li&gt;Detailed technical blog posts that dive into trade-offs and failed experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach requires a mindset that values documentation over narrative, and reality over polish. It's harder to execute in the short term, but it builds a foundation of trust that pays dividends in the long term.&lt;/p&gt;

&lt;p&gt;Read the full article at &lt;a href="https://novvista.com/building-in-public-in-2026-has-the-strategy-been-gamed-or-does-transparency-still-drive-growth/" rel="noopener noreferrer"&gt;novvista.com&lt;/a&gt; for the complete analysis with additional examples and benchmarks.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://novvista.com/building-in-public-in-2026-has-the-strategy-been-gamed-or-does-transparency-still-drive-growth/" rel="noopener noreferrer"&gt;NovVista&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>creativity</category>
      <category>tools</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Shipping AnywhereHired: Flask, Scrapy, and why “junior” job posts lie</title>
      <dc:creator>Anwesh Hada</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:48:25 +0000</pubDate>
      <link>https://core.forem.com/xez/shipping-anywherehired-flask-scrapy-and-why-junior-job-posts-lie-3d0h</link>
      <guid>https://core.forem.com/xez/shipping-anywherehired-flask-scrapy-and-why-junior-job-posts-lie-3d0h</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbf6x7xz16pl5djl1y2tk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbf6x7xz16pl5djl1y2tk.png" alt=" " width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I just launched &lt;strong&gt;&lt;a href="https://anywherehired.com" rel="noopener noreferrer"&gt;AnywhereHired&lt;/a&gt;&lt;/strong&gt; — a job board focused on &lt;strong&gt;early-career and entry-level remote jobs&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Job search is exhausting when half the “junior” listings quietly expect senior-level work. I wanted one place that &lt;strong&gt;cuts noise&lt;/strong&gt; and keeps the bar honest for people &lt;strong&gt;starting out&lt;/strong&gt; (bootcamps, career switchers, new grads).&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search &amp;amp; categories&lt;/strong&gt; across remote-friendly roles
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resume matching&lt;/strong&gt; to surface tighter fits
&lt;/li&gt;
&lt;li&gt;Listings aggregated and curated so the feed stays useful
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Stack (high level)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; Flask, SQLite
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scraping:&lt;/strong&gt; Scrapy pipelines into the same DB the site reads
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hosting:&lt;/strong&gt; Shared hosting + cron for refreshes (real-world constraints, not just localhost demos)
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why I’m posting
&lt;/h2&gt;

&lt;p&gt;I’m sharing the build in public and looking for &lt;strong&gt;feedback&lt;/strong&gt;, not vanity metrics. If you’re job hunting or hiring for &lt;strong&gt;true&lt;/strong&gt; entry-level remote roles, try the site and tell me what’s broken or missing.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Live:&lt;/strong&gt; &lt;a href="https://anywherehired.com" rel="noopener noreferrer"&gt;anywherehired.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Product Hunt:&lt;/strong&gt; [&lt;a href="https://www.producthunt.com/products/anywherehired?launch=anywherehired" rel="noopener noreferrer"&gt;https://www.producthunt.com/products/anywherehired?launch=anywherehired&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If we’re connected on Product Hunt, I’d love your thoughts there too once the launch is up.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Open questions for you
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;What would make this your &lt;strong&gt;default&lt;/strong&gt; tab when job searching?
&lt;/li&gt;
&lt;li&gt;What &lt;strong&gt;filters&lt;/strong&gt; matter most (timezone, visa, “no degree”, etc.)?
&lt;/li&gt;
&lt;li&gt;Employers: what would make you &lt;strong&gt;post&lt;/strong&gt; here vs big boards?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Thanks for reading — comments and harsh feedback welcome.&lt;/p&gt;

</description>
      <category>product</category>
      <category>webdev</category>
      <category>python</category>
      <category>career</category>
    </item>
    <item>
      <title>Your API Isn’t Hard to Use Your Documentation Is Just Bad</title>
      <dc:creator>Ezejah Chimkamma</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:40:14 +0000</pubDate>
      <link>https://core.forem.com/ezejah_chimkamma_06758a9b/your-api-isnt-hard-to-use-your-documentation-is-just-bad-ohn</link>
      <guid>https://core.forem.com/ezejah_chimkamma_06758a9b/your-api-isnt-hard-to-use-your-documentation-is-just-bad-ohn</guid>
      <description>&lt;p&gt;Let’s be honest.&lt;/p&gt;

&lt;p&gt;Most developers don’t abandon APIs because they’re “too complex.”&lt;/p&gt;

&lt;p&gt;They abandon them because:&lt;/p&gt;

&lt;p&gt;the documentation makes them feel stupid.&lt;/p&gt;

&lt;p&gt;🚨 The Real Problem&lt;/p&gt;

&lt;p&gt;You built a powerful API.&lt;/p&gt;

&lt;p&gt;But your documentation:&lt;/p&gt;

&lt;p&gt;Assumes too much&lt;br&gt;
Explains too little&lt;br&gt;
Leaves users guessing&lt;/p&gt;

&lt;p&gt;So instead of building with your product, developers are stuck trying to figure it out.&lt;/p&gt;

&lt;p&gt;And they won’t stay long.&lt;/p&gt;

&lt;p&gt;⚠️ What Bad API Docs Look Like&lt;/p&gt;

&lt;p&gt;If your documentation does any of this, you’re losing users:&lt;/p&gt;

&lt;p&gt;Throws endpoints at users with no context&lt;br&gt;
Uses technical jargon without explanation&lt;br&gt;
Has no clear “start here” guide&lt;br&gt;
Lacks real examples&lt;/p&gt;

&lt;p&gt;That’s not documentation.&lt;/p&gt;

&lt;p&gt;That’s confusion.&lt;/p&gt;

&lt;p&gt;💡 What Good API Documentation Actually Does&lt;/p&gt;

&lt;p&gt;Good documentation feels like guidance, not instructions.&lt;/p&gt;

&lt;p&gt;It answers 3 simple questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Where do I start?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Give users a clear entry point.&lt;/p&gt;

&lt;p&gt;“Start here to make your first API request in under 5 minutes.”&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What does this do?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Explain endpoints in plain language.&lt;/p&gt;

&lt;p&gt;Not:&lt;/p&gt;

&lt;p&gt;“Handles user authentication”&lt;/p&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;p&gt;“This endpoint lets users log in and receive an access token for future requests.”&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Show me an example&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Never assume.&lt;/p&gt;

&lt;p&gt;Always show.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;POST /login&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "email": "&lt;a href="mailto:user@example.com"&gt;user@example.com&lt;/a&gt;",&lt;br&gt;
  "password": "yourpassword"&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;And the response:&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "token": "abc123..."&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;Now it’s real. Now it’s usable.&lt;/p&gt;

&lt;p&gt;⚠️ The Biggest Mistake&lt;/p&gt;

&lt;p&gt;You write documentation after building the product.&lt;/p&gt;

&lt;p&gt;As an afterthought.&lt;/p&gt;

&lt;p&gt;That’s backwards.&lt;/p&gt;

&lt;p&gt;Documentation is part of the product experience.&lt;/p&gt;

&lt;p&gt;🔥 The Difference It Makes&lt;/p&gt;

&lt;p&gt;When your API documentation is clear:&lt;/p&gt;

&lt;p&gt;Developers integrate faster&lt;br&gt;
Fewer support tickets&lt;br&gt;
More trust in your product&lt;br&gt;
Higher adoption&lt;br&gt;
👀 Quick Test&lt;/p&gt;

&lt;p&gt;Ask yourself:&lt;/p&gt;

&lt;p&gt;“Can someone use my API without asking me questions?”&lt;/p&gt;

&lt;p&gt;If the answer is no,&lt;br&gt;
your documentation needs work.&lt;/p&gt;

&lt;p&gt;🚀 Final Thought&lt;/p&gt;

&lt;p&gt;Your API might be powerful.&lt;/p&gt;

&lt;p&gt;But if no one understands how to use it,&lt;br&gt;
it might as well not exist.&lt;/p&gt;

&lt;p&gt;👋 If you’re building an API…&lt;/p&gt;

&lt;p&gt;If your API is solid but developers struggle to use it, I help simplify documentation so people can understand, integrate, and actually use your product.&lt;/p&gt;

</description>
      <category>api</category>
      <category>devrel</category>
      <category>developers</category>
      <category>sass</category>
    </item>
    <item>
      <title>My Claude Code Sessions Hit 70MB. So I Built a Distiller.</title>
      <dc:creator>ithiria894</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:39:51 +0000</pubDate>
      <link>https://core.forem.com/ithiria894/my-claude-code-sessions-hit-70mb-so-i-built-a-distiller-32a</link>
      <guid>https://core.forem.com/ithiria894/my-claude-code-sessions-hit-70mb-so-i-built-a-distiller-32a</guid>
      <description>&lt;p&gt;I had a 4-hour coding session with Claude Code. Felt productive. Fixed a bunch of bugs, refactored a module, reviewed some screenshots Claude took of the UI along the way.&lt;/p&gt;

&lt;p&gt;Then I tried to &lt;code&gt;--resume&lt;/code&gt; it the next day.&lt;/p&gt;

&lt;p&gt;The session file was 73MB. Claude loaded it, burned through half the context window on old tool outputs and base64-encoded screenshots from yesterday, and started forgetting things I'd said 20 minutes ago. The conversation was fine. The cargo it was dragging around was not.&lt;/p&gt;

&lt;p&gt;I opened the JSONL. Here's what 73MB of "session" actually looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Conversation text:          ~4MB  (what we actually said)
Tool results (Read):       ~28MB  (file contents Claude already read)
Tool results (Bash):        ~9MB  (build outputs, test runs, logs)
Base64 screenshots:        ~22MB  (UI screenshots, now stale)
Tool results (Edit/Write):  ~6MB  (diffs and file previews)
Everything else:            ~4MB  (metadata, tool_use blocks)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;93% of the file is stuff Claude doesn't need to resume the conversation. The Read results are files that still exist on disk. The screenshots are from yesterday's UI state. The Bash outputs are build logs from 6 hours ago.&lt;/p&gt;

&lt;p&gt;So I built a distiller.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Session Distiller Does
&lt;/h2&gt;

&lt;p&gt;It reads a session JSONL, keeps every word of the actual conversation verbatim, and applies per-tool-type rules to strip results down to what's useful for context:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool type&lt;/th&gt;
&lt;th&gt;What's kept&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Read&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nothing (stripped entirely)&lt;/td&gt;
&lt;td&gt;The file is still on disk. Claude can re-read it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;First 5 + last 5 lines&lt;/td&gt;
&lt;td&gt;You need the command and whether it succeeded. Not 800 lines of webpack output.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Edit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;File path + 200-char preview of old/new&lt;/td&gt;
&lt;td&gt;Enough to remember what changed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Write&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;File path + head/tail preview&lt;/td&gt;
&lt;td&gt;Same idea.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Up to 2000 chars&lt;/td&gt;
&lt;td&gt;Research reports are worth keeping. Build logs aren't.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key decision was extractive filtering, not summarization. I don't pass anything through an LLM. Every word of conversation text is preserved exactly as-is. Tool results are either kept (trimmed) or dropped based on deterministic rules. No tokens spent, no hallucination risk, no "the AI summarized away the one detail I needed."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Typical result: 70MB session → 7MB distilled. 90% reduction.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The original session is backed up before anything changes. You always have the full version if you need it.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Tool-ID Matching Problem
&lt;/h2&gt;

&lt;p&gt;This sounds simple until you hit parallel tool calls.&lt;/p&gt;

&lt;p&gt;Claude Code often fires multiple tool calls in a single assistant message. A &lt;code&gt;tool_result&lt;/code&gt; block references its parent by &lt;code&gt;tool_use_id&lt;/code&gt;, not by position. My first implementation tracked a global &lt;code&gt;lastToolName&lt;/code&gt; variable: "the most recent tool_use was a Read, so the next tool_result must be a Read result." That breaks immediately when an assistant message contains three parallel tool calls.&lt;/p&gt;

&lt;p&gt;The fix: build a &lt;code&gt;toolIdMap&lt;/code&gt; from every &lt;code&gt;tool_use&lt;/code&gt; block (mapping &lt;code&gt;id → tool name&lt;/code&gt;), then look up each &lt;code&gt;tool_result.tool_use_id&lt;/code&gt; to find the correct tool type. Now parallel calls work correctly. A Read result and a Bash result in the same message get their own distillation rules applied independently.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Build map: tool_use_id → tool name&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tool_use&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;toolIdMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Look up correct tool for each result&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tool_result&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;toolName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;toolIdMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tool_use_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Now we know: this result came from "Read", "Bash", etc.&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;distillByToolType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;toolName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Small detail. Would have caused silent data corruption without it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Image Trimmer: The Targeted Fix
&lt;/h2&gt;

&lt;p&gt;Sometimes you don't need full distillation. You just need to remove the screenshots.&lt;/p&gt;

&lt;p&gt;I kept hitting Claude Code's "image exceeds dimension limit" warning after long sessions with a lot of UI review. The session file was fine except for 20-30MB of base64 image data that Claude couldn't even display anymore.&lt;/p&gt;

&lt;p&gt;So I wrote a separate tool that does exactly one thing: find every image block in the JSONL, replace it with &lt;code&gt;[image redacted]&lt;/code&gt;, leave everything else untouched.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node src/trim-images.mjs ~/.claude/projects/.../session.jsonl
&lt;span class="c"&gt;# → Redacted 47 image(s), saved 24832K&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;It also handles images nested inside &lt;code&gt;tool_result&lt;/code&gt; blocks (which is where most screenshots end up, since they come back as results of Bash commands that ran &lt;code&gt;adb screencap&lt;/code&gt; or similar).&lt;/p&gt;

&lt;p&gt;The whole script is 35 lines. It's also available as a Claude Code skill: type &lt;code&gt;/trim-images&lt;/code&gt; when you see the dimension warning and it runs automatically.&lt;/p&gt;
&lt;h2&gt;
  
  
  How to Use It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;From the dashboard:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're using &lt;a href="https://github.com/mcpware/claude-code-organizer" rel="noopener noreferrer"&gt;Claude Code Organizer&lt;/a&gt;, every session row now has a Distill button. Click it, the session gets distilled in-place, and the result shows up as an expandable bundle in the tree view with the backup and index files grouped together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From the command line:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Full distillation (conversation + trimmed tool results + backup)&lt;/span&gt;
npx @mcpware/claude-code-organizer &lt;span class="nt"&gt;--distill&lt;/span&gt; ~/.claude/projects/.../session.jsonl

&lt;span class="c"&gt;# Just strip images&lt;/span&gt;
node src/trim-images.mjs ~/.claude/projects/.../session.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The distiller outputs stats showing before/after sizes, number of index entries, and where the backup landed.&lt;/p&gt;
&lt;h2&gt;
  
  
  What's Actually in the Backup
&lt;/h2&gt;

&lt;p&gt;The distiller creates a folder named after the session ID:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{sessionId}/
  backup-{originalId}.jsonl    ← full original session, untouched
  index.md                     ← summary of what was kept/stripped
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The distilled session gets a context message injected at the top telling Claude where the backup lives and how to retrieve specific tool results if needed (Read with offset). So if Claude needs the full output of a Bash command from 3 hours ago, it knows exactly where to look.&lt;/p&gt;
&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;Distillation runs in under 2 seconds on a 70MB file. It's pure JSON parsing and string manipulation. No LLM calls, no network, no dependencies.&lt;/p&gt;

&lt;p&gt;The backup doubles your disk usage temporarily, but if your session was 70MB and the distilled version is 7MB, you're at 77MB total instead of 70MB. Not a meaningful difference on any modern machine.&lt;/p&gt;

&lt;p&gt;The context window savings are the real win. A 70MB session dumps roughly 15-20M tokens of tool output into Claude's context when resumed. After distillation, that drops to 1-2M tokens of actual conversation. Claude remembers what you talked about instead of drowning in stale build logs.&lt;/p&gt;
&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @mcpware/claude-code-organizer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/mcpware" rel="noopener noreferrer"&gt;
        mcpware
      &lt;/a&gt; / &lt;a href="https://github.com/mcpware/claude-code-organizer" rel="noopener noreferrer"&gt;
        claude-code-organizer
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Dashboard to manage Claude Code memories, configs, and MCP servers — security scanner for tool poisoning, context token budget tracker, duplicate cleanup, scope management. npx @mcpware/claude-code-organizer
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Claude Code Organizer&lt;/h1&gt;
&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AI agents: read &lt;a href="https://github.com/mcpware/claude-code-organizer/AI_INDEX.md" rel="noopener noreferrer"&gt;AI_INDEX.md&lt;/a&gt; first.&lt;/strong&gt; It is the navigation manifest for this codebase — where to find every module, how they connect, and where to look before making any claim about the code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://www.npmjs.com/package/@mcpware/claude-code-organizer" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/1eb7c9c48891f47c97f74873423810664dcce6286c977d7c2be419fbe7fc10b0/68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f762f406d6370776172652f636c617564652d636f64652d6f7267616e697a6572" alt="npm version"&gt;&lt;/a&gt;
&lt;a href="https://www.npmjs.com/package/@mcpware/claude-code-organizer" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/6b6a3a2b753cfbd0ad9510d86254acdd0691b85a38b7879e3668c54e0d39947e/68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f64742f406d6370776172652f636c617564652d636f64652d6f7267616e697a65723f6c6162656c3d646f776e6c6f616473" alt="npm downloads"&gt;&lt;/a&gt;
&lt;a href="https://github.com/mcpware/claude-code-organizer/stargazers" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/1d2ff9dc71782f8d51bb5870a868d99686e285c8fe79d73f8fbf3a5384070187/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f6d6370776172652f636c617564652d636f64652d6f7267616e697a6572" alt="GitHub stars"&gt;&lt;/a&gt;
&lt;a href="https://github.com/mcpware/claude-code-organizer/network/members" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/9f3c5461d02621b23d75b5cb6e1d67fc117337abf46d4cf98575a5048534bd91/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f666f726b732f6d6370776172652f636c617564652d636f64652d6f7267616e697a6572" alt="GitHub forks"&gt;&lt;/a&gt;
&lt;a href="https://github.com/mcpware/claude-code-organizer/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fdf2982b9f5d7489dcf44570e714e3a15fce6253e0cc6b5aa61a075aac2ff71b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667" alt="License: MIT"&gt;&lt;/a&gt;
&lt;a href="https://nodejs.org" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/508b1391e0fd9b2a0a355208d8cde75e3168f3cef23d6d6fc0b0ca38e0232174/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6e6f64652d25334525334432302d627269676874677265656e" alt="Node.js"&gt;&lt;/a&gt;
&lt;a href="https://github.com/mcpware/claude-code-organizer" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/ea804ddddb9c1c7e6604b6170c9caee03f5199b504b0850dac66194a8ba592db/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f74657374732d32363325323070617373696e672d627269676874677265656e" alt="Tests"&gt;&lt;/a&gt;
&lt;a href="https://github.com/mcpware/claude-code-organizer" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fc00401579af33a1bd9c2bcd082d209d3e31c8068bd7858f5eb40117cc6cfd9d/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f74656c656d657472792d7a65726f2d626c7565" alt="Zero Telemetry"&gt;&lt;/a&gt;
&lt;a href="https://github.com/mcpware/claude-code-organizer" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/2cfa98e52053eab0ff88f037583eddc40dd42ea1b06a43acda70c7e5af3c67df/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4d43502d53656375726974792532305363616e6e65722d726564" alt="MCP Security"&gt;&lt;/a&gt;
&lt;a href="https://github.com/punkpeye/awesome-mcp-servers" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/da45c6891689c899591d53a4bfc508553e42530149ca34f5eee1017c502722cb/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f417765736f6d652d4d4350253230536572766572732d6663363061383f6c6f676f3d617765736f6d656c69737473266c6f676f436f6c6f723d7768697465" alt="Awesome MCP"&gt;&lt;/a&gt;
&lt;a href="https://github.com/mcpware/claude-code-organizer#verified-against-claude-code-source" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/ad7fc1ec6ecdb2a127deb8f38f2eba99f245d790385001db8a8708ada5e3571a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f56657269666965642d436c61756465253230436f6465253230536f757263652d626c756576696f6c6574" alt="Verified Against CC Source"&gt;&lt;/a&gt;
English | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.zh-CN.md" rel="noopener noreferrer"&gt;简体中文&lt;/a&gt; | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.zh-TW.md" rel="noopener noreferrer"&gt;繁體中文&lt;/a&gt; | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.zh-HK.md" rel="noopener noreferrer"&gt;廣東話&lt;/a&gt; | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.ja.md" rel="noopener noreferrer"&gt;日本語&lt;/a&gt; | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.ko.md" rel="noopener noreferrer"&gt;한국어&lt;/a&gt; | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.es.md" rel="noopener noreferrer"&gt;Español&lt;/a&gt; | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.id.md" rel="noopener noreferrer"&gt;Bahasa Indonesia&lt;/a&gt; | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.it.md" rel="noopener noreferrer"&gt;Italiano&lt;/a&gt; | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.pt-BR.md" rel="noopener noreferrer"&gt;Português&lt;/a&gt; | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.tr.md" rel="noopener noreferrer"&gt;Türkçe&lt;/a&gt; | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.vi.md" rel="noopener noreferrer"&gt;Tiếng Việt&lt;/a&gt; | &lt;a href="https://github.com/mcpware/claude-code-organizer/README.th.md" rel="noopener noreferrer"&gt;ไทย&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Claude Code Organizer (CCO)&lt;/strong&gt; is a free, open-source dashboard that lets you manage all Claude Code configuration — memories, skills, MCP servers, settings, agents, rules, and hooks — across global and project scopes. It includes a security scanner for MCP tool poisoning and prompt injection, a per-item context token budget tracker, per-project MCP enable/disable controls, and bulk cleanup for duplicate configs. All without leaving the window.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;v0.17.0&lt;/strong&gt; — Session Distiller strips bloated sessions down to ~10% of their original size while keeping every word of conversation intact…&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/mcpware/claude-code-organizer" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;The distiller is part of CCO v0.17.0. Dashboard button, CLI flag, and API endpoint all included. Image trimmer works standalone or as a &lt;code&gt;/trim-images&lt;/code&gt; skill.&lt;/p&gt;

&lt;p&gt;If your sessions are small, you don't need this. If your sessions regularly push 50MB+, this is the difference between "--resume working" and "--resume followed by Claude forgetting your name."&lt;/p&gt;

&lt;h2&gt;
  
  
  About Me
&lt;/h2&gt;

&lt;p&gt;CS dropout. Building tools for the Claude Code ecosystem. &lt;a href="https://github.com/ithiria894" rel="noopener noreferrer"&gt;github.com/ithiria894&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⭐ &lt;strong&gt;&lt;a href="https://github.com/mcpware/claude-code-organizer" rel="noopener noreferrer"&gt;Star the repo&lt;/a&gt;&lt;/strong&gt; if bloated sessions have ever ruined your day.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>showdev</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Built an Autonomous Job Application Agent with Claude AI — Here's How It Works</title>
      <dc:creator>Tanzil Ahmed</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:35:27 +0000</pubDate>
      <link>https://core.forem.com/tanzilahmed/i-built-an-autonomous-job-application-agent-with-claude-ai-heres-how-it-works-31d9</link>
      <guid>https://core.forem.com/tanzilahmed/i-built-an-autonomous-job-application-agent-with-claude-ai-heres-how-it-works-31d9</guid>
      <description>&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;Job Hunter AI is an autonomous agent that searches job boards, researches companies using Claude AI, and generates tailored CVs and cover letters — with zero manual intervention.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/Tanzil-Ahmed/job-hunter-agent" rel="noopener noreferrer"&gt;https://github.com/Tanzil-Ahmed/job-hunter-agent&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Job hunting is repetitive and exhausting. Every application needs the same research: What does this company do? What's their tech stack? Does my background fit? Then rewriting your CV for each role.&lt;/p&gt;

&lt;p&gt;I automated all of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The pipeline has 4 stages:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Job Discovery&lt;/strong&gt;&lt;br&gt;
Searches job boards automatically using Tavily and Exa APIs. Filters by role, location, and relevance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Company Research (Claude tool_use)&lt;/strong&gt;&lt;br&gt;
For each job, Claude uses tool_use to research the company — analyzing tech stack, culture, funding stage, and fit score against your profile.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. CV + Cover Letter Generation&lt;/strong&gt;&lt;br&gt;
Claude generates a tailored CV and cover letter for each role based on the research. Each one is different.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Real-time Dashboard&lt;/strong&gt;&lt;br&gt;
FastAPI backend with WebSocket streaming shows the pipeline running live.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>claudeapi</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Advancing DevOps/Cloud Learning: Strategies for Post-Foundational Skill Development</title>
      <dc:creator>Marina Kovalchuk</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:35:17 +0000</pubDate>
      <link>https://core.forem.com/maricode/advancing-devopscloud-learning-strategies-for-post-foundational-skill-development-3be0</link>
      <guid>https://core.forem.com/maricode/advancing-devopscloud-learning-strategies-for-post-foundational-skill-development-3be0</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: Navigating the DevOps/Cloud Learning Journey
&lt;/h2&gt;

&lt;p&gt;You’ve nailed the basics—Linux, networking, AWS fundamentals, and even wrestled with Nginx and S3 permissions. Now, the real challenge begins: &lt;strong&gt;how do you advance beyond foundational knowledge without wasting time or money on suboptimal resources?&lt;/strong&gt; This is where most learners stall. The DevOps/Cloud landscape is a minefield of courses, certifications, and tools, each promising to elevate your skills. But here’s the harsh truth: &lt;em&gt;not all advanced learning paths are created equal.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Consider the learner who, after mastering AWS basics, enrolls in a course heavy on theory but light on practical CI/CD pipelines. The result? &lt;strong&gt;They can explain Jenkins but can’t configure it in a real-world scenario.&lt;/strong&gt; Or the one who opts for a free, unstructured resource, only to realize their portfolio lacks the depth to impress hiring managers. These failures aren’t about effort—they’re about &lt;em&gt;misalignment between learning strategy and career goals.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Mechanics of Course Selection: Why Most Learners Fail
&lt;/h3&gt;

&lt;p&gt;The typical learner evaluates courses based on surface-level criteria: cost, duration, or instructor popularity. But this approach ignores the &lt;strong&gt;system mechanisms&lt;/strong&gt; that determine learning outcomes. For instance, a course’s value isn’t just in its content—it’s in how it &lt;em&gt;integrates real-world projects&lt;/em&gt; that simulate production environments. Without this, learners risk acquiring &lt;strong&gt;theoretical knowledge that doesn’t translate to hands-on expertise.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Take CI/CD pipelines, a cornerstone of DevOps. A course that merely lectures on Jenkins or GitLab CI will leave you unprepared for the &lt;em&gt;chaos of debugging a failing pipeline in a live environment.&lt;/em&gt; The mechanism of failure here is clear: &lt;strong&gt;theory without practice leads to brittle skills that crack under pressure.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Evaluating "Train with Shubham" vs. Alternatives: A Causal Analysis
&lt;/h3&gt;

&lt;p&gt;Let’s dissect the case of "Train with Shubham" versus other advanced courses. The key factors are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Content Depth:&lt;/strong&gt; Does the course cover automation tools like Terraform and Ansible, or does it rely on manual configurations? &lt;em&gt;Automation is non-negotiable in modern DevOps.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instructor Credibility:&lt;/strong&gt; Check Shubham’s GitHub or LinkedIn. &lt;em&gt;Real-world experience in production environments is a proxy for course quality.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Practical Projects:&lt;/strong&gt; Are there end-to-end projects that mimic industry scenarios? &lt;em&gt;Without these, you’re building sandcastles, not careers.&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare this to a generic Udemy course. While cheaper, it often lacks &lt;strong&gt;structured feedback loops&lt;/strong&gt;—forums or Discord groups where learners troubleshoot together. This isolation slows learning and increases the risk of &lt;em&gt;misinterpreting concepts.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Edge Cases: When "Train with Shubham" Might Not Be Optimal
&lt;/h3&gt;

&lt;p&gt;Not every learner benefits equally from "Train with Shubham." For instance, if your goal is &lt;strong&gt;vendor-neutral knowledge&lt;/strong&gt; (e.g., Kubernetes over AWS-specific tools), a course heavily focused on AWS might misalign with your objectives. The mechanism here is &lt;em&gt;over-specialization&lt;/em&gt;, which limits your adaptability across cloud providers.&lt;/p&gt;

&lt;p&gt;Alternatively, if you’re on a tight budget, free resources like &lt;strong&gt;AWS re:Start&lt;/strong&gt; or &lt;em&gt;HashiCorp’s Terraform tutorials&lt;/em&gt; can be effective—but only if supplemented with &lt;strong&gt;structured projects.&lt;/strong&gt; The failure mode here is &lt;em&gt;fragmented learning&lt;/em&gt;, where you acquire pieces of knowledge without a cohesive framework.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule for Choosing Advanced Courses: If X, Then Y
&lt;/h3&gt;

&lt;p&gt;Here’s a decision-dominant rule backed by mechanism:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If your goal is to master CI/CD pipelines and automation tools (X), choose a course with real-world projects and instructor-led feedback (Y). Otherwise, you risk acquiring theoretical knowledge that fails in production environments.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For example, if "Train with Shubham" includes &lt;em&gt;end-to-end CI/CD projects&lt;/em&gt; and a &lt;strong&gt;Discord community for troubleshooting&lt;/strong&gt;, it’s a strong contender. But if it lacks these, consider alternatives like &lt;em&gt;A Cloud Guru’s DevOps path&lt;/em&gt;, which balances theory with hands-on labs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion: Strategic Learning as a Career Accelerator
&lt;/h3&gt;

&lt;p&gt;Advancing in DevOps/Cloud isn’t about consuming more content—it’s about &lt;strong&gt;strategic selection&lt;/strong&gt; of resources that align with your career goals and learning style. The stakes are high: &lt;em&gt;a misstep here can delay your progression by months.&lt;/em&gt; By evaluating courses through the lens of &lt;strong&gt;practical projects, instructor credibility, and community support&lt;/strong&gt;, you ensure that every hour spent learning translates to tangible skills.&lt;/p&gt;

&lt;p&gt;Remember: &lt;em&gt;The cloud never stops evolving, and neither should your learning strategy.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario Analysis: Real-World Applications and Skill Gaps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The Automation Bottleneck: From Manual to Scalable Infrastructure
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; You’ve manually configured EC2 instances and S3 buckets, but your team’s deployment process still takes hours. Management demands faster releases, and your manual scripts are breaking under scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mechanism:&lt;/strong&gt; Manual configurations introduce human error and lack reproducibility. As infrastructure scales, ad-hoc scripts fail due to state drift and dependency conflicts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill Gap:&lt;/strong&gt; Lack of proficiency in Infrastructure as Code (IaC) tools like Terraform or CloudFormation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision Rule:&lt;/strong&gt; If your goal is to eliminate manual bottlenecks, prioritize courses with &lt;em&gt;end-to-end IaC projects&lt;/em&gt; (e.g., Terraform modules for multi-environment deployments). Avoid theory-heavy courses lacking hands-on labs.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The CI/CD Pipeline Paradox: Builds Succeed, Deployments Fail
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; Your Jenkins pipeline compiles code successfully, but deployments to Kubernetes clusters fail intermittently. Logs show resource quota errors and image pull failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mechanism:&lt;/strong&gt; CI/CD pipelines without integrated testing and monitoring stages mask failures until production. Misconfigured Kubernetes manifests or untested Helm charts cause runtime errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill Gap:&lt;/strong&gt; Inability to design resilient CI/CD pipelines with integrated testing, monitoring, and rollback mechanisms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision Rule:&lt;/strong&gt; Choose courses with &lt;em&gt;GitOps workflow projects&lt;/em&gt; (e.g., ArgoCD + Jenkins X) over basic CI/CD tutorials. Verify the course includes debugging labs for pipeline failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The Multi-Cloud Misalignment: AWS Expertise Fails in Azure
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; Your AWS-heavy resume lands you an Azure DevOps role. You struggle to translate S3 permissions to Azure Blob Storage ACLs, delaying project delivery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mechanism:&lt;/strong&gt; Cloud provider-specific knowledge becomes a liability when switching ecosystems. Over-specialization in one platform creates blind spots in cross-cloud architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill Gap:&lt;/strong&gt; Lack of vendor-neutral cloud architecture principles (e.g., Well-Architected Framework).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision Rule:&lt;/strong&gt; If targeting multi-cloud roles, select courses emphasizing &lt;em&gt;cloud-agnostic patterns&lt;/em&gt; (e.g., Hashicorp’s multi-cloud demos) over AWS-only content.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The Monitoring Blindspot: Alerts Flood In, Root Cause Elusive
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; Your Prometheus alerts spike during peak traffic, but dashboards show no CPU/memory anomalies. Users report 500 errors, yet logs are inconclusive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mechanism:&lt;/strong&gt; Monitoring systems without distributed tracing or correlation rules fail to pinpoint failures in microservices architectures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill Gap:&lt;/strong&gt; Inadequate knowledge of observability tools (e.g., Jaeger, OpenTelemetry).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision Rule:&lt;/strong&gt; Prioritize courses integrating &lt;em&gt;observability into CI/CD pipelines&lt;/em&gt; (e.g., automated trace collection in Jenkins). Avoid courses treating monitoring as an afterthought.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The Security Breach: Misconfigured IAM Roles Expose Data
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; A misconfigured IAM role grants S3 write access to an external contractor, leading to a data leak. Auditors flag non-compliance with SOC 2 requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mechanism:&lt;/strong&gt; DevOps practices without security integration (DevSecOps) create exploitable gaps. Lack of automated policy checks allows misconfigurations to propagate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill Gap:&lt;/strong&gt; Inability to implement security automation (e.g., Terraform + Sentinel).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision Rule:&lt;/strong&gt; If security is critical, choose courses with &lt;em&gt;integrated security modules&lt;/em&gt; (e.g., OWASP Top 10 for DevOps). Validate instructors’ DevSecOps experience via GitHub repos.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. The Cost Overrun: Cloud Bills Spike Post-Migration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; After migrating to Kubernetes, your monthly cloud bill triples. Spot instances are underutilized, and reserved instances are misallocated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mechanism:&lt;/strong&gt; Lack of FinOps practices leads to inefficient resource allocation. Autoscaling policies without cost optimization triggers waste resources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill Gap:&lt;/strong&gt; Inadequate understanding of cloud cost management tools (e.g., Kubecost, CloudHealth).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision Rule:&lt;/strong&gt; If cost control is a priority, select courses covering &lt;em&gt;FinOps automation&lt;/em&gt; (e.g., Terraform cost estimation modules). Avoid courses ignoring financial governance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparative Analysis: "Train with Shubham" vs. Alternatives
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Content Depth:&lt;/strong&gt; "Train with Shubham" excels in CI/CD and Kubernetes projects but lacks Azure/GCP coverage. A Cloud Guru offers broader multi-cloud content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Practical Projects:&lt;/strong&gt; Shubham’s end-to-end labs (e.g., Jenkins + Helm deployments) outperform Udemy’s theory-heavy courses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community Support:&lt;/strong&gt; Shubham’s Discord group provides faster feedback than Coursera’s forums.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimal Choice:&lt;/strong&gt; If your goal is &lt;em&gt;Kubernetes and CI/CD mastery&lt;/em&gt;, "Train with Shubham" is superior. For multi-cloud, supplement with A Cloud Guru.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Edge Case: Budget Constraints
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Mechanism:&lt;/strong&gt; Free resources (e.g., AWS re:Start) lack structured projects, leading to fragmented learning. Without feedback loops, misconceptions persist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; If budget is limited, combine free resources with &lt;em&gt;open-source project contributions&lt;/em&gt; (e.g., Kubernetes GitHub issues) to simulate structured learning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategic Learning Plans: Tailored Roadmaps for Success
&lt;/h2&gt;

&lt;p&gt;After mastering foundational topics like Linux, networking, and AWS basics, the next step in your DevOps/Cloud journey requires a strategic approach. The &lt;strong&gt;core mechanism&lt;/strong&gt; here is aligning your learning resources with both your career goals and the &lt;em&gt;dynamic demands of the industry&lt;/em&gt;. Misalignment leads to skill gaps, as theoretical knowledge without practical application fails in real-world scenarios. Below, we dissect your options, focusing on the &lt;strong&gt;Train with Shubham&lt;/strong&gt; course and alternatives, using a &lt;em&gt;mechanistic lens&lt;/em&gt; to evaluate effectiveness.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Evaluating "Train with Shubham": Mechanism and Fit
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Train with Shubham&lt;/strong&gt; course excels in &lt;em&gt;CI/CD pipelines and Kubernetes&lt;/em&gt;, critical for modern DevOps. Its &lt;strong&gt;end-to-end labs&lt;/strong&gt; simulate production environments, addressing the &lt;em&gt;automation bottleneck&lt;/em&gt;—a common failure point where manual configurations lead to state drift and dependency conflicts. For example, misconfigured Kubernetes manifests cause runtime errors, which Shubham’s labs explicitly target through hands-on debugging.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Real-world projects (e.g., GitOps workflows with ArgoCD)&lt;/li&gt;
&lt;li&gt;Active Discord community for structured feedback loops&lt;/li&gt;
&lt;li&gt;Instructor credibility (Shubham’s production experience in Kubernetes)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Weaknesses:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Limited Azure/GCP coverage, risking &lt;em&gt;multi-cloud misalignment&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;No integrated FinOps modules, leaving a &lt;em&gt;cost optimization gap&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Decision Rule:&lt;/strong&gt; If your goal is &lt;em&gt;Kubernetes and CI/CD mastery&lt;/em&gt;, choose Shubham. However, supplement with multi-cloud resources (e.g., A Cloud Guru) to avoid vendor lock-in.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Alternative Paths: Comparative Analysis
&lt;/h3&gt;

&lt;p&gt;Alternatives like &lt;strong&gt;A Cloud Guru’s DevOps path&lt;/strong&gt; or &lt;strong&gt;Udemy courses&lt;/strong&gt; must be evaluated against &lt;em&gt;system mechanisms&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A Cloud Guru:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Advantage:&lt;/strong&gt; Broader multi-cloud content (AWS, Azure, GCP), addressing &lt;em&gt;vendor-neutral goals&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disadvantage:&lt;/strong&gt; Less hands-on than Shubham; forums provide slower feedback, increasing risk of &lt;em&gt;misinterpretation&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Udemy:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Risk:&lt;/strong&gt; Theory-heavy courses lack &lt;em&gt;practical projects&lt;/em&gt;, leading to brittle skills that fail under pressure (e.g., CI/CD pipelines without monitoring stages)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge Case:&lt;/strong&gt; Budget-friendly but requires supplementation with open-source contributions to simulate structured learning&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optimal Choice:&lt;/strong&gt; For &lt;em&gt;Kubernetes/CI/CD focus&lt;/em&gt;, Shubham dominates. For &lt;em&gt;multi-cloud architecture&lt;/em&gt;, A Cloud Guru is superior. Avoid Udemy unless supplemented with GitHub projects to address &lt;em&gt;fragmented learning&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Edge Cases: Budget Constraints and Vendor-Neutral Goals
&lt;/h3&gt;

&lt;p&gt;If budget is a constraint, &lt;strong&gt;free resources&lt;/strong&gt; like AWS re:Start or Kubernetes GitHub issues can work, but they lack &lt;em&gt;structured feedback loops&lt;/em&gt;. The &lt;strong&gt;mechanism of failure&lt;/strong&gt; here is fragmented learning, where knowledge isn’t integrated into a cohesive framework. To mitigate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Combine free resources with &lt;em&gt;open-source contributions&lt;/em&gt; (e.g., fixing Kubernetes issues)&lt;/li&gt;
&lt;li&gt;Use Shubham’s free YouTube content for foundational CI/CD concepts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; If budget is &lt;em&gt;X&lt;/em&gt;, use free resources + open-source contributions to simulate structured learning. Without this, risk &lt;em&gt;skill fragmentation&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Long-Term Strategy: Portfolio vs. Certifications
&lt;/h3&gt;

&lt;p&gt;Certifications (e.g., AWS Certified DevOps Engineer) signal baseline knowledge but don’t replace &lt;em&gt;practical skills&lt;/em&gt;. The &lt;strong&gt;mechanism&lt;/strong&gt; is that certifications often test theoretical understanding, while employers prioritize &lt;em&gt;portfolio projects&lt;/em&gt; demonstrating real-world problem-solving. For example, a CI/CD pipeline with integrated security (Terraform + Sentinel) is more impactful than a certification badge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; If goal is &lt;em&gt;immediate job placement&lt;/em&gt;, prioritize certifications. For &lt;em&gt;long-term career growth&lt;/em&gt;, build a portfolio with end-to-end projects (e.g., multi-cloud deployment with FinOps automation).&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion: Dominant Strategy Selection
&lt;/h3&gt;

&lt;p&gt;The optimal path depends on your &lt;em&gt;goal mechanism&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;If X (Kubernetes/CI/CD mastery)&lt;/strong&gt; → &lt;strong&gt;Use Y (Train with Shubham + A Cloud Guru for multi-cloud)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If X (Budget constraint)&lt;/strong&gt; → &lt;strong&gt;Use Y (Free resources + open-source contributions)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If X (Long-term growth)&lt;/strong&gt; → &lt;strong&gt;Use Y (Portfolio-focused learning with end-to-end projects)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Avoid typical errors like &lt;em&gt;over-specialization&lt;/em&gt; (e.g., AWS-only courses) or &lt;em&gt;theory-heavy learning&lt;/em&gt;. Continuously evolve your strategy as cloud technologies advance, ensuring alignment with both industry demands and your career trajectory.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>learning</category>
      <category>automation</category>
    </item>
    <item>
      <title>Your Startup Isn’t Confusing Your Documentation Is (Here’s How to Fix It)</title>
      <dc:creator>Ezejah Chimkamma</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:33:16 +0000</pubDate>
      <link>https://core.forem.com/ezejah_chimkamma_06758a9b/your-startup-isnt-confusing-your-documentation-is-heres-how-to-fix-it-4o1b</link>
      <guid>https://core.forem.com/ezejah_chimkamma_06758a9b/your-startup-isnt-confusing-your-documentation-is-heres-how-to-fix-it-4o1b</guid>
      <description>&lt;p&gt;Most startups don’t have a product problem.&lt;/p&gt;

&lt;p&gt;They have a clarity problem.&lt;/p&gt;

&lt;p&gt;You built something powerful.&lt;br&gt;
Something useful.&lt;br&gt;
Something people should understand.&lt;/p&gt;

&lt;p&gt;But they don’t.&lt;/p&gt;

&lt;p&gt;Not because they’re not smart,&lt;br&gt;
but because your documentation is doing a poor job explaining it.&lt;/p&gt;

&lt;p&gt;And that’s costing you users.&lt;/p&gt;

&lt;p&gt;🚨 The Silent Killer: Bad Documentation&lt;/p&gt;

&lt;p&gt;Here’s what’s happening behind the scenes:&lt;/p&gt;

&lt;p&gt;Users sign up&lt;br&gt;
They get confused&lt;br&gt;
They leave quietly&lt;/p&gt;

&lt;p&gt;No complaints. No feedback. Just… gone.&lt;/p&gt;

&lt;p&gt;And you think:&lt;/p&gt;

&lt;p&gt;“Maybe the product needs more features”&lt;/p&gt;

&lt;p&gt;It doesn’t.&lt;/p&gt;

&lt;p&gt;It needs better explanation.&lt;/p&gt;

&lt;p&gt;⚠️ Mistake #1: You’re Writing for Yourself, Not the User&lt;/p&gt;

&lt;p&gt;Most startup documentation sounds like this:&lt;/p&gt;

&lt;p&gt;“Initialize the configuration by executing the required environment parameters…”&lt;/p&gt;

&lt;p&gt;That’s not helpful.&lt;/p&gt;

&lt;p&gt;Your users are not inside your head.&lt;br&gt;
They don’t know your system like you do.&lt;/p&gt;

&lt;p&gt;✅ Fix:&lt;/p&gt;

&lt;p&gt;Write like you’re explaining to a smart beginner.&lt;/p&gt;

&lt;p&gt;“Start by setting up your environment variables. This tells the system how to run your app properly.”&lt;/p&gt;

&lt;p&gt;Simple. Clear. Human.&lt;/p&gt;

&lt;p&gt;⚠️ Mistake #2: You Skip the “Why”&lt;/p&gt;

&lt;p&gt;You explain what to do…&lt;br&gt;
But not why it matters.&lt;/p&gt;

&lt;p&gt;So users follow steps blindly — or worse, they stop trying.&lt;/p&gt;

&lt;p&gt;✅ Fix:&lt;/p&gt;

&lt;p&gt;Always answer:&lt;/p&gt;

&lt;p&gt;“Why should I care about this step?”&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;“This step connects your app to the database, so your data can be stored and retrieved.”&lt;/p&gt;

&lt;p&gt;Now it makes sense.&lt;/p&gt;

&lt;p&gt;⚠️ Mistake #3: No Onboarding Flow&lt;/p&gt;

&lt;p&gt;You drop users into documentation like:&lt;/p&gt;

&lt;p&gt;“Here’s everything. Good luck.”&lt;/p&gt;

&lt;p&gt;That’s overwhelming.&lt;/p&gt;

&lt;p&gt;✅ Fix:&lt;/p&gt;

&lt;p&gt;Guide them step-by-step:&lt;/p&gt;

&lt;p&gt;What to do first&lt;br&gt;
What comes next&lt;br&gt;
What success looks like&lt;/p&gt;

&lt;p&gt;Make them feel progress.&lt;/p&gt;

&lt;p&gt;⚠️ Mistake #4: Too Technical or Too Vague&lt;/p&gt;

&lt;p&gt;You either:&lt;/p&gt;

&lt;p&gt;Overcomplicate everything&lt;br&gt;
OR&lt;br&gt;
Say things that mean nothing&lt;/p&gt;

&lt;p&gt;Both are dangerous.&lt;/p&gt;

&lt;p&gt;✅ Fix:&lt;/p&gt;

&lt;p&gt;Be specific, but clear.&lt;/p&gt;

&lt;p&gt;Bad:&lt;/p&gt;

&lt;p&gt;“Optimize your configuration”&lt;/p&gt;

&lt;p&gt;Better:&lt;/p&gt;

&lt;p&gt;“Reduce API response time by caching repeated requests”&lt;/p&gt;

&lt;p&gt;💡 Here’s the Truth Most Startups Miss&lt;/p&gt;

&lt;p&gt;Good documentation is not “extra work”&lt;/p&gt;

&lt;p&gt;It’s:&lt;/p&gt;

&lt;p&gt;Better onboarding&lt;br&gt;
Fewer support requests&lt;br&gt;
Higher user retention&lt;/p&gt;

&lt;p&gt;It’s the difference between:&lt;br&gt;
👉 A product people try&lt;br&gt;
👉 And a product people actually use&lt;/p&gt;

&lt;p&gt;👋 Final Thought&lt;/p&gt;

&lt;p&gt;If users don’t understand your product,&lt;br&gt;
they won’t use it, no matter how good it is.&lt;/p&gt;

&lt;p&gt;Clarity is not optional.&lt;br&gt;
It’s part of the product.&lt;/p&gt;

&lt;p&gt;🚀 If this sounds familiar…&lt;/p&gt;

&lt;p&gt;If you’re building a product and your users struggle to understand how it works, I help startups turn complex systems into clear, user-friendly documentation and onboarding.&lt;/p&gt;

</description>
      <category>devrel</category>
      <category>sass</category>
      <category>product</category>
      <category>startup</category>
    </item>
    <item>
      <title>How AI Engineers Actually Use Datasets: Test Cases, Edge Cases and Agent Reliability</title>
      <dc:creator>Kalio Princewill</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:32:46 +0000</pubDate>
      <link>https://core.forem.com/kalio/how-ai-engineers-actually-use-datasets-test-cases-edge-cases-35gf</link>
      <guid>https://core.forem.com/kalio/how-ai-engineers-actually-use-datasets-test-cases-edge-cases-35gf</guid>
      <description>&lt;p&gt;Most AI agent discussions focus on models. In practice, the model is rarely the problem.&lt;/p&gt;

&lt;p&gt;When you build an agent today you are almost certainly not training it. The model is fixed. What determines whether the agent actually works is everything around it: the tools it can call, the prompts that guide it, the logic that decides what it does next.&lt;/p&gt;

&lt;p&gt;So when people say "we need more data," they usually do not mean training. They mean better test cases, clearer failure scenarios, and a way to measure whether the agent is behaving correctly.&lt;/p&gt;

&lt;p&gt;This article breaks down how to evaluate an AI agent properly: what to test, how to structure realistic scenarios from real world data, how to score the path the agent takes not just the answer it lands on, and how to design adversarial tests that force actual reasoning instead of pattern matching.&lt;/p&gt;

&lt;p&gt;Using SRE agents as the concrete example throughout.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Are Not Doing vs What You Are
&lt;/h2&gt;

&lt;p&gt;Before anything else, this distinction matters.&lt;/p&gt;

&lt;p&gt;What you are NOT doing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Feeding logs into the model to teach it new things&lt;/li&gt;
&lt;li&gt;Fine tuning weights&lt;/li&gt;
&lt;li&gt;Changing how the underlying LLM reasons&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What you ARE doing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using real world logs to construct realistic test scenarios&lt;/li&gt;
&lt;li&gt;Grading whether the agent investigates correctly&lt;/li&gt;
&lt;li&gt;Exposing edge cases the agent currently fails at&lt;/li&gt;
&lt;li&gt;Using those failures to improve prompts, tools, and agent logic&lt;/li&gt;
&lt;li&gt;Building a test suite that gets harder as the agent gets better&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model does not improve through this process. What improves is the system around it. Test cases are how you measure that system rigorously instead of guessing.&lt;br&gt;
With that clear, the next question is what you are actually grading.&lt;/p&gt;
&lt;h2&gt;
  
  
  What You Are Actually Testing
&lt;/h2&gt;

&lt;p&gt;When you test an AI agent you are not checking if the model knows things. The model already knows things. You are checking three specific behaviours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does the agent pick the right tools in the right order?&lt;/strong&gt;&lt;br&gt;
Given a scenario, does it investigate correctly or does it jump straight to conclusions?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does it stop at the right time?&lt;/strong&gt;&lt;br&gt;
Does it know when it has found the root cause and stop, or does it keep going in circles?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can it reason through noise?&lt;/strong&gt;&lt;br&gt;
If there are red herrings in the data, metrics that look suspicious but are not causal, does it get distracted or stay on the right path?&lt;/p&gt;

&lt;p&gt;These are behaviours you grade. Not things you train. And the clearest way to see them in practice is to look at a real agent being built against exactly these constraints.&lt;/p&gt;
&lt;h2&gt;
  
  
  The SRE Agent As A Case Study
&lt;/h2&gt;

&lt;p&gt;An SRE (Site Reliability Engineering) agent is one that investigates production incidents automatically: it gets an alert, pulls logs and metrics, reasons across the signals, and produces a root cause report.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/Tracer-Cloud/opensre/tree/main/tests" rel="noopener noreferrer"&gt;OpenSRE project&lt;/a&gt; is a good concrete example of this in practice. Their test suite lives in &lt;code&gt;tests/e2e/&lt;/code&gt; and covers Kubernetes and RDS Postgres scenarios. They are building a suite of realistic incident scenarios and checking whether the agent handles them correctly.&lt;/p&gt;

&lt;p&gt;You can run the agent directly against a test fixture like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;opensre investigate &lt;span class="nt"&gt;-i&lt;/span&gt; tests/e2e/kubernetes/fixtures/datadog_k8s_alert.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That JSON fixture is a synthetic but realistic alert, constructed to represent a specific failure mode with logs, metrics, and context included. The agent runs against it and you check whether the investigation was correct. That is the entire idea. Now let us look at what that fixture actually contains.&lt;/p&gt;

&lt;h2&gt;
  
  
  What A Test Case Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;A test case has four parts: the input the agent sees, the steps you expect it to take, the answer you expect it to reach, and the red herrings it should notice but not chase.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;TEST_CASE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RDS Postgres connection pool exhaustion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            2024-01-15 14:23:01 UTC [FATAL] remaining connection slots reserved for
            non-replication superuser connections
            2024-01-15 14:23:01 UTC [ERROR] connection to server failed: FATAL:
            sorry, too many clients already
            2024-01-15 14:23:04 UTC [WARNING] pool wait time exceeded 10000ms
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_connections&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;498&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_connections_max&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# elevated but not the cause
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;      &lt;span class="c1"&gt;# fine
&lt;/span&gt;        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alert&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Database latency spike - P95 latency exceeded 4000ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_db_connections&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_active_queries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_pool_configuration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recommend_pool_size_increase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_root_cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;connection pool exhaustion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red_herrings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;elevated but stable, not the cause of latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;should_stop_after&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_pool_configuration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The logs and metrics are what the agent sees. The expected steps define the correct investigation path. The red herrings flag what the agent should notice but not chase. The stop condition catches agents that keep digging after the answer is already clear.&lt;/p&gt;

&lt;p&gt;The agent runs against this input and you grade whether it got the right root cause, investigated in the right order, and did not get pulled off track by the CPU metric.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trajectory Scoring
&lt;/h2&gt;

&lt;p&gt;Once you have test cases, you need a way to score them. Getting the right answer is not enough. You want to know if the agent got there the right way.&lt;/p&gt;

&lt;p&gt;This matters in practice because an agent that stumbles onto the correct answer after checking ten irrelevant things is not a reliable agent. It got lucky. Trajectory scoring measures the investigation path, not just the conclusion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_trajectory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual_steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;expected_steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;penalties&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual_steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;expected_steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;expected_position&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;expected_steps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;actual_position&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
            &lt;span class="c1"&gt;# penalise for investigating out of order
&lt;/span&gt;            &lt;span class="n"&gt;position_penalty&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_position&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;actual_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;
            &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;position_penalty&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;penalties&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unexpected step taken: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# penalise for not stopping when root cause was found
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual_steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;penalties&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent continued investigating after root cause was clear&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_steps&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;penalties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;penalties&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# example usage
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;investigate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TEST_CASE&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;score_trajectory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;actual_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;steps_taken&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TEST_CASE&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# {"score": 0.85, "max_score": 1.0, "penalties": [], "passed": True}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The function takes two lists: the steps the agent actually took, and the steps you expected it to take.&lt;br&gt;
For each step the agent took, it checks two things: was this step in the expected list at all, and if so, did it happen at roughly the right point in the investigation? If the agent checked check_db_connections first and that was expected first, full credit. If it checked it third when it should have been first, it gets penalised proportionally and scores. After scoring the steps, it checks whether the agent kept going past the point where it should have stopped.&lt;/p&gt;

&lt;p&gt;Trajectory scoring handles well-labelled scenarios where you know the expected path. But what happens when the scenario is deliberately designed to mislead?&lt;/p&gt;
&lt;h2&gt;
  
  
  Adversarial Tests: Forcing The Agent To Reason, Not Pattern Match
&lt;/h2&gt;

&lt;p&gt;Standard test cases check whether the agent handles known scenarios correctly. Adversarial tests go further. They check whether the agent actually reasons or just pattern matches.&lt;br&gt;
The difference matters because production incidents do not arrive cleanly. They arrive with noise, misleading signals, and symptoms that point in the wrong direction. An agent that pattern matches will chase the loudest signal. An agent that reasons will trace the causal chain.&lt;br&gt;
Adversarial tests deliberately inject red herrings to expose which one you have built:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ADVERSARIAL_TEST&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Kubernetes OOMKilled - misleading CPU spike&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            2024-01-15 09:15:22 UTC [WARNING] Container memory usage at 94%
            2024-01-15 09:15:45 UTC [ERROR] OOMKilled: container exceeded memory limit
            2024-01-15 09:15:45 UTC [INFO] Pod restarting...
            2024-01-15 09:16:01 UTC [WARNING] CPU throttling detected on node
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_throttling_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# looks alarming, is a red herring
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory_usage_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory_limit_mb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pod_restarts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alert&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;High CPU throttling detected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_root_cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container OOMKilled due to insufficient memory limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red_herring&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_throttling looks like the main issue but is a downstream symptom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_pod_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_memory_usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_memory_limits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recommend_memory_limit_increase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="c1"&gt;# agent should NOT go down this path
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;incorrect_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_cpu_throttling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recommend_cpu_limit_increase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The alert fires on CPU throttling. A pattern-matching agent sees 78% throttling and immediately recommends a CPU limit increase. That is wrong. The CPU throttling is a downstream symptom of the OOMKill restart loop. The real problem is the memory limit being too low. An agent that reasons traces from the OOMKill event back to the memory configuration and stops there.&lt;br&gt;
Now you understand what a test case is, how to score it, and how to stress test the agent against misleading signals. Before you start writing your own, it is worth looking at how others have structured theirs.&lt;/p&gt;
&lt;h2&gt;
  
  
  Looking At Existing Test Suites Before Building Your Own
&lt;/h2&gt;

&lt;p&gt;Before writing test cases from scratch, you have to look at what others have already built. The structure is often more instructive than the content itself.&lt;br&gt;
The OpenSRE test suite separates scenarios by domain with fixture files containing realistic alert payloads. Reading those fixtures before writing your own will save you several wrong turns on what a well-structured test case should actually contain, what fields matter, how much context to include, and how to frame the expected behaviour clearly enough to grade against.&lt;br&gt;
Two other eval suites worth studying for the structural pattern regardless of your domain:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.swebench.com/SWE-bench/guides/datasets/" rel="noopener noreferrer"&gt;SWE-bench&lt;/a&gt;&lt;/strong&gt;: how Princeton structured software engineering task evals for coding agents. The input, expected output, graded result pattern maps directly to any agent domain.&lt;br&gt;
&lt;strong&gt;&lt;a href="https://github.com/THUDM/AgentBench" rel="noopener noreferrer"&gt;AgentBench&lt;/a&gt;&lt;/strong&gt;: benchmark for LLM agents across different environments including OS tasks, database interactions, and web browsing. Useful for seeing how grading works across different action spaces and how to think about pass criteria when the action space is open-ended.&lt;/p&gt;

&lt;p&gt;The pattern across all of them is the same: realistic input, defined expected behaviour, graded output. Once you have that pattern clear in your head, the fastest way to build your own scenarios is synthetic.&lt;/p&gt;
&lt;h2&gt;
  
  
  Synthetic Data: What It Is and When It Plateaus
&lt;/h2&gt;

&lt;p&gt;Synthetic test cases are ones you construct yourself. You write the logs, set the metrics, define the expected answer. You control everything.&lt;br&gt;
This is the right place to start. It is fast, you can cover specific failure modes methodically, and you can design adversarial cases precisely because you decide what the red herrings are.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_synthetic_rds_scenario&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failure_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;templates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;slow_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duration: 45231 ms statement: SELECT * FROM orders WHERE status=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;autovacuum: found 80000 dead row versions in table orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_connections&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;45000&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red_herrings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_connections&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;normal range, not the cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;root_cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missing index causing full table scan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_slow_queries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_query_plans&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recommend_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replication_lag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replication slot lag: 8GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WAL sender process waiting for WAL to be archived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replication_lag_bytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8589934592&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;disk_io&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red_herrings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low, not relevant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;root_cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replication lag due to WAL accumulation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_replication_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_wal_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_replica_health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;templates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failure_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The limitation is that synthetic data plateaus. As you add more scenarios the agent improves, but the gains flatten over time. The reason is structural: every scenario you write comes from your own mental model of what can go wrong. You can only write what you can imagine, which means every edge case your synthetic suite does not cover is an edge case your agent has never been tested against.&lt;/p&gt;

&lt;p&gt;When gains plateau you have two options: build a different synthetic generator that introduces genuinely new failure patterns, or bring in real world data. Real world cases expose failure modes you never thought to write because they actually happened to someone. That is where the next section comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where To Get Real World Data For Test Cases
&lt;/h2&gt;

&lt;p&gt;To be clear again: you are not feeding these datasets into the model. You are reading them, understanding the failure patterns, and constructing test cases that reflect what actually happens in production.&lt;br&gt;
The SRE agent is the example we have been using throughout, but the same approach applies to any domain. If you are building an agent that handles customer support tickets, database query optimisation, fraud detection, or any other domain with structured inputs and measurable outcomes, the same process applies: find a labeled dataset in your domain, understand the failure patterns, and turn them into test cases. Here is a list of good dataset sites:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://datasetsearch.research.google.com/" rel="noopener noreferrer"&gt;Google Dataset Search&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
A search engine specifically for datasets. For your domains, search what your agent handles: "customer support tickets", "financial transactions", "medical records."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.kaggle.com/datasets" rel="noopener noreferrer"&gt;Kaggle&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
A large public dataset repository with a lot of labeled data across many domains. It covers finance, healthcare, e-commerce, NLP, and more. Many Kaggle datasets include notebooks showing how others have analysed them, which makes it easier to understand failure patterns before writing test cases.&lt;/p&gt;

&lt;p&gt;Once you find a dataset, you load and filter it to find the failure windows, the rows where something actually went wrong.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# load a labeled anomaly dataset
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;server_machine_dataset.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# filter for labeled failure windows
&lt;/span&gt;&lt;span class="n"&gt;failure_window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# the metrics leading up to the failure become your test case input
# the label tells you when the failure occurred
# you construct the expected root cause from the dataset documentation
&lt;/span&gt;
&lt;span class="n"&gt;test_case&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scenario&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;server anomaly from SMD dataset machine-1-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;failure_window&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;network_in&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;network_out&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_root_cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;network saturation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# from dataset docs
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_network_throughput&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_active_connections&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;identify_traffic_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One thing worth noting: the expected root cause is not something you derive from the data itself. It comes from the dataset documentation, which for published research datasets will describe what each labeled failure actually was. That documentation is the ground truth your test case is built on. Without it you have inputs but no correct answers to grade against, which means you have data but not a test suite.&lt;br&gt;
The goal across all of this is not to teach the agent. It is to know, with confidence, whether the agent you have built is reliable enough to trust in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  More Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Tracer-Cloud/opensre/tree/main/tests" rel="noopener noreferrer"&gt;OpenSRE test suite&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://datasetsearch.research.google.com" rel="noopener noreferrer"&gt;Google Dataset Search&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kaggle.com/datasets" rel="noopener noreferrer"&gt;Kaggle Datasets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/princeton-nlp/SWE-bench" rel="noopener noreferrer"&gt;SWE-bench&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/THUDM/AgentBench" rel="noopener noreferrer"&gt;AgentBench&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
    </item>
    <item>
      <title>A New opensource Security AI model being built.</title>
      <dc:creator>Joe Munene</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:31:46 +0000</pubDate>
      <link>https://core.forem.com/ghost_gi_m/a-new-opensource-security-ai-model-being-built-20de</link>
      <guid>https://core.forem.com/ghost_gi_m/a-new-opensource-security-ai-model-being-built-20de</guid>
      <description>&lt;h1&gt;
  
  
  I Built an Open-Source Cybersecurity LLM From Scratch in Python
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;What if you could build your own AI model — not fine-tune someone else's, not wrap an API — but actually build a transformer from scratch and train it on cybersecurity data?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's exactly what I did. And I'm releasing it under Apache 2.0 so anyone can use it, improve it, and build on it.&lt;/p&gt;

&lt;p&gt;Meet &lt;strong&gt;GhostLM&lt;/strong&gt; — an open-source, cybersecurity-focused language model built entirely from scratch in PyTorch. No pretrained weights. No wrappers. Every single component written by hand.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/joemunene-by/GhostLM" rel="noopener noreferrer"&gt;https://github.com/joemunene-by/GhostLM&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Built GhostLM
&lt;/h2&gt;

&lt;p&gt;Here's the thing about current AI models: they're incredibly powerful, but they weren't built for security. When you ask GPT-4 about a CVE vulnerability or a CTF challenge, it gives you a reasonable answer — but it's reasoning from general knowledge, not from deep security context.&lt;/p&gt;

&lt;p&gt;I wanted a model that actually &lt;em&gt;understands&lt;/em&gt; cybersecurity language — the patterns, the terminology, the attack methodologies. And I wanted to build it myself, not because I thought I could out-engineer OpenAI, but because &lt;strong&gt;the best way to understand how something works is to build it from the ground up.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My goal was simple: create the first open-source, cybersecurity-focused language model that anyone can run, inspect, and improve.&lt;/p&gt;




&lt;h2&gt;
  
  
  What GhostLM Is
&lt;/h2&gt;

&lt;p&gt;GhostLM is a decoder-only transformer language model — the same architecture family as GPT-2, GPT-3, and Llama — but built entirely from scratch. No &lt;code&gt;transformers.AutoModel&lt;/code&gt;, no &lt;code&gt;from_pretrained()&lt;/code&gt;. Just raw PyTorch tensors and matrix multiplications.&lt;/p&gt;

&lt;p&gt;It comes in three sizes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Layers&lt;/th&gt;
&lt;th&gt;Dim&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ghost-tiny&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;~14.5M&lt;/td&gt;
&lt;td&gt;✅ Trained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ghost-small&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;~55M&lt;/td&gt;
&lt;td&gt;🔄 Planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ghost-medium&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;td&gt;~160M&lt;/td&gt;
&lt;td&gt;🔜 Future&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;It's trained on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CVE vulnerability descriptions&lt;/strong&gt; from the NVD database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CTF writeups&lt;/strong&gt; covering real challenge types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cybersecurity research papers&lt;/strong&gt; and abstracts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And it's fully open source under Apache 2.0.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Let me show you what "built from scratch" actually looks like.&lt;/p&gt;

&lt;h3&gt;
  
  
  Causal Self-Attention
&lt;/h3&gt;

&lt;p&gt;This is the core of every transformer. Here's GhostLM's implementation — no &lt;code&gt;F.scaled_dot_product_attention&lt;/code&gt;, no hidden magic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;size&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Combined QKV projection and split
&lt;/span&gt;    &lt;span class="n"&gt;qkv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;c_qkv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qkv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_heads&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Reshape to (B, n_heads, T, head_dim)
&lt;/span&gt;    &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_heads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_heads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_heads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Scaled dot-product attention
&lt;/span&gt;    &lt;span class="n"&gt;att&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Apply causal mask (lower triangular)
&lt;/span&gt;    &lt;span class="n"&gt;att&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;att&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;masked_fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;causal_mask&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Softmax + dropout + weighted sum
&lt;/span&gt;    &lt;span class="n"&gt;att&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;att&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attn_dropout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;att&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;

    &lt;span class="c1"&gt;# Reassemble heads and project back
&lt;/span&gt;    &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;contiguous&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resid_dropout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;proj&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every line is intentional. The causal mask ensures the model can only attend to previous tokens (autoregressive). The attention weights are manually computed with the classic &lt;code&gt;QK^T / sqrt(d)&lt;/code&gt; formula.&lt;/p&gt;

&lt;h3&gt;
  
  
  Transformer Block
&lt;/h3&gt;

&lt;p&gt;The block stacks attention and feed-forward layers with a pre-norm architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Pre-norm + self-attention with residual
&lt;/span&gt;    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ln_1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;# Pre-norm + feed-forward with residual
&lt;/span&gt;    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ffn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ln_2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why pre-norm?&lt;/strong&gt; I chose pre-normalization (LayerNorm before each sub-layer) over post-norm because it's significantly more stable for training, especially on smaller models. The gradients flow more cleanly through the residual connections, and you don't need as careful a learning rate schedule.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weight Tying
&lt;/h3&gt;

&lt;p&gt;One optimization that saves ~25 million parameters: the output projection layer shares weights with the token embedding. Instead of learning two separate &lt;code&gt;vocab_size × d_model&lt;/code&gt; matrices, we learn one and reuse it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lm_head&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token_embedding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the same trick GPT-2 uses, and it works because the embedding and output projection are fundamentally doing the same thing — mapping between token space and hidden space.&lt;/p&gt;




&lt;h2&gt;
  
  
  Training Data
&lt;/h2&gt;

&lt;p&gt;The data pipeline is one of the most important parts of any ML project. GhostLM's pipeline collects from three sources:&lt;/p&gt;

&lt;h3&gt;
  
  
  NVD CVE Descriptions (Real Data)
&lt;/h3&gt;

&lt;p&gt;I hit the National Vulnerability Database REST API directly — no HuggingFace dependency needed. Paginated requests with rate limiting, parsing nested JSON responses, extracting English descriptions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://services.nvd.nist.gov/rest/json/cves/2.0?resultsPerPage=2000&amp;amp;startIndex=0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vulnerabilities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;cve_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;descriptions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gave me &lt;strong&gt;9,925 real CVE descriptions&lt;/strong&gt; — the kind of text that says &lt;em&gt;"A buffer overflow in the XYZ component allows remote attackers to execute arbitrary code via crafted input."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Full Pipeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NVD API → 9,925 CVE descriptions (real)
Synthetic papers → 500 security research abstracts
Synthetic CTF writeups → 500 challenge solutions
─────────────────────────────────────────────────
Total: 10,925 records → ~490,532 tokens
Train: 10,378 | Validation: 547
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline handles text cleaning (unicode normalization, whitespace stripping, non-printable character removal), tokenization, chunking, and train/val splitting — all in &lt;code&gt;data/collect.py&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Training Results
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. I trained ghost-tiny on a &lt;strong&gt;ThinkPad Yoga 11e with a Celeron N4100 and 4GB of RAM&lt;/strong&gt;. Yes, really.&lt;/p&gt;

&lt;h3&gt;
  
  
  Loss Progression
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Steps&lt;/th&gt;
&lt;th&gt;Train Loss&lt;/th&gt;
&lt;th&gt;Val Loss&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;10.84&lt;/td&gt;
&lt;td&gt;10.04&lt;/td&gt;
&lt;td&gt;Random initialization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;7.12&lt;/td&gt;
&lt;td&gt;6.27&lt;/td&gt;
&lt;td&gt;First CVE patterns emerge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;td&gt;5.89&lt;/td&gt;
&lt;td&gt;5.41&lt;/td&gt;
&lt;td&gt;Starting to form sentences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2,000&lt;/td&gt;
&lt;td&gt;4.63&lt;/td&gt;
&lt;td&gt;4.58&lt;/td&gt;
&lt;td&gt;Grammar improving&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3,000&lt;/td&gt;
&lt;td&gt;3.91&lt;/td&gt;
&lt;td&gt;3.95&lt;/td&gt;
&lt;td&gt;Security vocabulary appearing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4,000&lt;/td&gt;
&lt;td&gt;3.52&lt;/td&gt;
&lt;td&gt;3.58&lt;/td&gt;
&lt;td&gt;Coherent attack descriptions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5,000&lt;/td&gt;
&lt;td&gt;3.38&lt;/td&gt;
&lt;td&gt;3.46&lt;/td&gt;
&lt;td&gt;Best checkpoint saved&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The loss curve is healthy — train and validation are tracking closely, no signs of overfitting yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generation at 5,000 Steps
&lt;/h3&gt;

&lt;p&gt;Here's what the model generates when prompted with &lt;em&gt;"A SQL injection attack works by"&lt;/em&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A SQL injection attack works by using the admin_user sequences in the web server. Web Application Firewall Evasion Techniques present a critical defense layer against commercial and model checking. Our model achieves 94% detection rate with transformer-based sequence modeling to identify common vulnerability patterns including buffer overflows.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Is it perfect? No. It bleeds between topics (SQL injection → WAF → research paper language). But it's producing grammatically correct sentences with real security terminology. At 5,000 steps on a 14.5M parameter model running on a laptop from 2018, I'll take it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Honest Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Topic coherence&lt;/strong&gt; — the model jumps between subjects mid-generation. It needs more steps to learn to stay on topic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memorization&lt;/strong&gt; — some outputs are lifted nearly verbatim from training data. More diverse data would help.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size&lt;/strong&gt; — 14.5M params is tiny. ghost-small (55M) will be a significant jump.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU training&lt;/strong&gt; — at ~1.8s per step, 10,000 steps takes hours. GPU or TPU is needed for serious training.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I've already applied for &lt;strong&gt;Google TPU Research Credits&lt;/strong&gt; to train ghost-small on proper hardware. The plan:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ghost-tiny to 10,000+ steps&lt;/strong&gt; — finish what I started&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ghost-small on TPU/GPU&lt;/strong&gt; — 55M params with real compute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace Hub release&lt;/strong&gt; — public model weights anyone can download&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live demo on HuggingFace Spaces&lt;/strong&gt; — try GhostLM in your browser&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark vs GPT-2&lt;/strong&gt; — objective comparison on cybersecurity tasks&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The entire project is open source. Clone it, run it, break it, improve it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/joemunene-by/GhostLM.git
&lt;span class="nb"&gt;cd &lt;/span&gt;GhostLM

&lt;span class="c"&gt;# Install everything&lt;/span&gt;
make &lt;span class="nb"&gt;install&lt;/span&gt;

&lt;span class="c"&gt;# Download training data&lt;/span&gt;
make data

&lt;span class="c"&gt;# Train ghost-tiny on CPU&lt;/span&gt;
make train-tiny

&lt;span class="c"&gt;# Chat with the trained model&lt;/span&gt;
make chat

&lt;span class="c"&gt;# Run the web demo&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;gradio
python demo/app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I'm actively looking for contributors. If you want to help with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Finding new cybersecurity datasets&lt;/li&gt;
&lt;li&gt;Implementing Flash Attention or RoPE&lt;/li&gt;
&lt;li&gt;Adding distributed training&lt;/li&gt;
&lt;li&gt;Writing documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check out &lt;a href="https://github.com/joemunene-by/GhostLM/blob/main/CONTRIBUTING.md" rel="noopener noreferrer"&gt;CONTRIBUTING.md&lt;/a&gt; and open a PR.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;I'm a 20-year-old computer science student in Nairobi, Kenya. I don't have access to massive compute clusters or research lab budgets. But I do have curiosity, persistence, and a belief that &lt;strong&gt;open-source AI shouldn't only come from well-funded labs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GhostLM is proof that you can build something meaningful from scratch with limited resources. The architecture is clean, the training pipeline works, and the model is learning. It's not going to replace GPT-4 — but it's a foundation that anyone can build on.&lt;/p&gt;

&lt;p&gt;If you found this interesting, star the repo, try it out, and let me know what you think. The best part of open source is that it gets better when more people are involved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/joemunene-by/GhostLM" rel="noopener noreferrer"&gt;https://github.com/joemunene-by/GhostLM&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0&lt;/p&gt;

&lt;p&gt;Built with ❤️ in Nairobi, Kenya 🇰🇪&lt;/p&gt;

</description>
      <category>cybersecurity</category>
      <category>llm</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
