Forem: Ian Johnson

Last post in this series! This post talks about how to apply the lessons learned and the agent harness to any stack, with examples of different popular web technologies.

Ian Johnson — Wed, 08 Apr 2026 18:23:35 +0000

Ian Johnson

Apr 8

Beyond Laravel: Applying the Agent Harness to Any Stack

#ai #productivity #tdd #programming

Comments

9 min read

Beyond Laravel: Applying the Agent Harness to Any Stack

Ian Johnson — Wed, 08 Apr 2026 17:31:33 +0000

The Strategy Is the Point

This series followed a Laravel + React codebase. But if you've been reading for the strategy and not the syntax, you already know: none of this is Laravel-specific.

Tests before agents. Linting as machine-checkable standards. Clean architecture so the agent follows patterns instead of inventing them. Trunk-based development for fast feedback. Harness files that scope guidance to where the agent is working. Custom skills that turn your workflow into structure.

Every step has an equivalent in whatever stack you're using. The tools change. The progression doesn't.

The Seven Steps

Here's the agent harness approach distilled to its language-agnostic core. Each step builds on the ones before it. You cannot skip ahead: the entire system is load-bearing.

Step 1: Test Infrastructure

What you're doing: Wrapping the existing codebase in tests that run against real dependencies (the same database engine, the same cache, the same queue) so you have a machine-checkable safety net before the agent touches anything.

What matters:

Characterization tests first. Lock in what the code does before you change what it should do. These aren't aspirational tests. They're documentation of current behavior.
Real dependencies, not fakes. If production runs Postgres, your tests run Postgres. SQLite-in-memory is a lie that will catch up with you.
One command to run everything. make test, npm test, ./gradlew test — the agent needs a single entry point. If running tests requires tribal knowledge, the agent will get it wrong.
Test factories that are hard to misuse. Give the agent a discoverable API for creating test data. Fluent builders, factory patterns, fixtures with clear names. Design for the dumbest correct user, because that's how the agent will use it.

Step 2: Linting and Static Analysis

What you're doing: Adding machine-checkable standards for code style, type safety, and structural quality. Each tool eliminates an entire category of wrong output from the agent.

What matters:

Format, lint, and type-check — all three. Formatting removes style arguments. Linting catches structural problems. Type checking catches logic errors. Together they narrow the space the agent can operate in.
One command to check everything. make lint, npm run lint, a Makefile target that runs the full stack. The agent runs this before every commit.
Pre-commit hooks that block and explain. The hook should fail with a message the agent can read and act on. "Run npx prettier --write . to fix" is better than "formatting error on line 47."
CI as the gate that cannot be skipped. Pre-commit hooks are the first check. CI is the final one. The agent cannot merge without green CI.

Step 3: Architecture and Boundaries

What you're doing: Refactoring toward clean boundaries (interfaces, services, clear separation of concerns) so the agent can work within a bounded area without needing to understand the whole system.

What matters:

Contracts before implementations. Define interfaces first. The agent can implement an interface without understanding the rest of the system. It cannot safely modify a God class.
One responsibility per unit. Whether it's a service class, a module, a use case — the agent works best when each unit does one thing and the boundaries are obvious.
Architecture as documentation. If the codebase has a clear pattern (actions, services, repositories, commands), the agent follows it. If every file is a snowflake, the agent improvises. You don't want improvisation.
Small, safe steps. One extraction per PR. Keep the app running in production throughout. Never refactor and change behavior in the same commit.

Step 4: Explicit Patterns for Business Logic

What you're doing: Establishing the patterns the agent should follow for new work: how business logic is structured, how authorization works, and how data flows through the system.

What matters:

A single pattern for business logic. Actions, use cases, commands, interactors — the name doesn't matter. What matters is that there's one pattern, it's consistent, and the agent can see ten examples in the codebase.
Centralized authorization. Scattered permission checks are a security risk with human developers. With an agent, they're a guarantee of inconsistency. Use your framework's policy/guard/permission system.
Typed inputs and outputs. Form objects, request validators, result types, DTOs — whatever your stack calls them. The agent needs to know what goes in and what comes out.

Step 5: Migration Strategy (If Applicable)

What you're doing: If you're migrating frontends, databases, or major subsystems: running old and new in parallel, migrating incrementally, never doing a big-bang rewrite, etc.

What matters:

Both systems run simultaneously. The old system serves production. The new system is gated behind environment flags or feature toggles until proven.
Page by page, feature by feature. Each migration is a small PR. Each small PR goes through the full test/lint/CI pipeline.
Clear scoping rules. The agent needs to know: does this work go in the old system, the new system, or both? Make the rules explicit in the harness.

Step 6: Trunk-Based Development and CI/CD

What you're doing: Establishing the delivery cadence that makes AI-assisted development practical: short-lived branches, small PRs, fast CI, and automated deployment.

What matters:

Branches live for hours, not days. The longer a branch lives, the more the agent's assumptions go stale. Small batches, fast merges.
CI runs the full pipeline. Build, lint, type-check, test, deploy. If any step fails, the PR doesn't merge.
Conventional commits. A machine-readable commit history helps the agent understand what changed and why. It also helps you when you're reviewing 145 PRs in three months.
Automated deployment. Push to main, deploy to staging. The feedback loop from code change to running software should be minutes, not hours.

Step 7: The Harness and Skills

What you're doing: Writing scoped guidance files that tell the agent how to work in each area of the codebase, then codifying your workflow into repeatable skills.

What matters:

Scoped guidance, not one big file. One harness file per major area. The agent loads what's relevant to where it's working. Keep the signal-to-noise ratio high.
Patterns with examples, not just rules. Show the agent a code example of the pattern you want. "Do it like this" beats "follow these principles" every time.
Anti-patterns are explicit fences. Tell the agent what not to do. "Never put HTTP concerns in an Action" is more useful than "keep Actions pure."
The feedback protocol. When the agent drifts, ask: is this a harness gap? If yes, update the harness first, then re-apply. Corrections become permanent rules.
Skills codify the sequence. Automate the ceremony (read ticket, write tests, implement, lint, commit, push, PR). Keep the judgment calls at checkpoints.

The Stack Table

Here's how each step maps to tools across popular web framework stacks. The rows are the steps. The columns are the stacks. Every cell answers: "What would I use here?"

Test Infrastructure

	Runner	DB Strategy	Factories / Fixtures	One Command
Laravel (PHP)	PHPUnit / Pest	MySQL in Docker (tmpfs)	Model Factories	`php artisan test`
Rails (Ruby)	RSpec / Minitest	Postgres in Docker	FactoryBot	`bundle exec rspec`
Django (Python)	pytest-django	Postgres in Docker	factory_boy / Model Bakery	`python -m pytest`
Next.js (TypeScript)	Vitest / Jest	Postgres via Testcontainers	Prisma seed scripts / custom builders	`npm test`
Spring Boot (Java)	JUnit 5	Testcontainers (Postgres/MySQL)	TestEntityManager / custom builders	`./gradlew test`
ASP.NET (C#)	xUnit / NUnit	Testcontainers or LocalDB	Bogus + custom builders	`dotnet test`
Go	`testing` + testify	Testcontainers or dockertest	Custom factory functions	`go test ./...`
Phoenix (Elixir)	ExUnit	Postgres sandbox	`ex_machina`	`mix test`

Recommendation: Wrap your test command in a make test target. It gives the agent (and your team) a single, stack-agnostic entry point that hides flags, environment setup, and Docker orchestration behind one command. When every project starts with make test, nobody needs to remember whether it's php artisan test, go test ./..., or bundle exec rspec.

Linting and Static Analysis

	Formatter	Linter	Type Checker	One Command
Laravel (PHP)	Pint	Psalm / PHPStan	Psalm (level)	`./vendor/bin/pint --test && ./vendor/bin/phpstan`
Rails (Ruby)	RuboCop (formatting)	RuboCop (style/lint)	Sorbet / Steep	`bundle exec rubocop`
Django (Python)	Black / Ruff format	Ruff / Flake8	mypy / pyright	`ruff check . && mypy .`
Next.js (TypeScript)	Prettier	ESLint	TypeScript (`tsc --noEmit`)	`npm run lint && npx tsc --noEmit`
Spring Boot (Java)	google-java-format / Spotless	Checkstyle / SpotBugs	javac (compile-time)	`./gradlew check`
ASP.NET (C#)	dotnet format	Roslyn analyzers / StyleCop	C# compiler + nullable refs	`dotnet format --verify-no-changes`
Go	`gofmt` / `goimports`	`golangci-lint`	Go compiler	`golangci-lint run`
Phoenix (Elixir)	`mix format`	Credo	Dialyxir	`mix format --check-formatted && mix credo && mix dialyzer`

Recommendation: Wrap your lint pipeline in a make lint target. Most stacks need multiple tools chained together — formatter, linter, type checker — and the exact flags change over time. A make lint target keeps the agent from needing to know whether your project runs ruff check . && mypy . or golangci-lint run. One target, full coverage, zero tribal knowledge.

Architecture Patterns

	Service Layer	Business Logic Unit	Authorization	Request Validation
Laravel (PHP)	Service classes + contracts	Action classes	Policies	Form Requests
Rails (Ruby)	Service objects / POROs	Command / Interactor	Pundit / Action Policy	Strong Parameters + dry-validation
Django (Python)	Service layer (manual)	Service functions / Command pattern	django-rules / permissions	Serializers / Pydantic
Next.js (TypeScript)	Server actions / service modules	Use case functions	Middleware + CASL / next-auth	Zod schemas
Spring Boot (Java)	`@Service` classes	`@Service` or Command pattern	Spring Security + `@PreAuthorize`	`@Valid` + Bean Validation
ASP.NET (C#)	Service classes via DI	MediatR handlers / Command pattern	Authorization policies + `[Authorize]`	FluentValidation
Go	Package-level service structs	Handler / Use case functions	Middleware + Casbin	Struct validation (go-playground)
Phoenix (Elixir)	Context modules	Context functions / Command pattern	Bodyguard	Ecto changesets

CI/CD and Delivery

	CI Platform	Deploy Tool	Branch Strategy	Commit Convention
Laravel (PHP)	GitHub Actions	Forge / Envoyer	Trunk-based, short-lived branches	Conventional Commits
Rails (Ruby)	GitHub Actions / CircleCI	Kamal / Capistrano / Heroku	Trunk-based, short-lived branches	Conventional Commits
Django (Python)	GitHub Actions / GitLab CI	Gunicorn + systemd / Docker + ECS	Trunk-based, short-lived branches	Conventional Commits
Next.js (TypeScript)	GitHub Actions / Vercel CI	Vercel / Docker + ECS	Trunk-based, short-lived branches	Conventional Commits
Spring Boot (Java)	GitHub Actions / Jenkins	Docker + Kubernetes / AWS ECS	Trunk-based, short-lived branches	Conventional Commits
ASP.NET (C#)	GitHub Actions / Azure DevOps	Azure App Service / Docker + ECS	Trunk-based, short-lived branches	Conventional Commits
Go	GitHub Actions	Docker + Kubernetes / systemd	Trunk-based, short-lived branches	Conventional Commits
Phoenix (Elixir)	GitHub Actions	Fly.io / Docker + release	Trunk-based, short-lived branches	Conventional Commits

Harness and Skills

	Harness Format	Scoped Files	Skill Definition	Agent Tool
All Stacks	Markdown (CLAUDE.md)	One per major directory	`.claude/skills/SKILL.md`	Claude Code

The harness and skills layer is entirely stack-agnostic. It's Markdown files in your repo. The content changes (your patterns, your anti-patterns, your architectural rules) but the mechanism is the same regardless of language.

Where to Start

If you're looking at this table and wondering where to begin, here's the priority order:

1. Tests. If you have nothing else, start here. Get a test runner working against real dependencies with a single command. Write characterization tests for the most critical paths. This alone makes the agent dramatically safer.

2. Linting. Add a formatter and a linter. Wire them into a pre-commit hook. This takes an afternoon and eliminates an entire category of bad output.

3. CI. Connect your test and lint commands to your CI platform. Make it block merges. Now the agent cannot ship broken code even if it tries.

4. Architecture. This is the long game. You don't need perfect architecture to start using an agent. But every boundary you create, every interface you extract, every consistent pattern you establish makes the agent more reliable in that area.

5. Harness files. Start with a root CLAUDE.md that describes the project, the tech stack, and the top-level patterns. Add subdirectory files as you notice the agent drifting in specific areas.

6. Skills. Only after everything else is working. Skills automate a workflow that already works manually. If the underlying steps aren't solid, automating them just produces bad output faster.

The Pattern Behind the Pattern

Every step in this series followed the same logic:

Identify a category of error the agent can make.
Add a machine-checkable constraint that eliminates it.
Give the agent a single command to verify compliance.
Update the harness when a new failure mode appears.

Tests eliminate behavioral errors. Linting eliminates structural errors. Architecture eliminates design errors. CI eliminates delivery errors. The harness eliminates context errors. Skills eliminate process errors.

The tools in the table will change. New frameworks will appear. New linters will ship. New CI platforms will launch. But this progression constrain, verify, scope, automate will remain the same, because it's not about the tools. It's about narrowing the space where the agent can be wrong until the only thing left is the judgment calls that require a human.

That's the harness. Build it in whatever language you ship.

The Takeaway

The strategy is language-agnostic. Tests, linting, architecture, CI, harness, skills — every stack has equivalents. The progression is what matters.
Start with tests and linting. These two steps alone eliminate more bad agent output than any amount of prompt engineering.
Architecture is a force multiplier. Clear patterns make the agent predictable. Unclear patterns make it creative. You don't want creative.
The harness content is yours. The table gives you the tools, but the rules inside your harness files come from your engineering judgment about your codebase.
Constrain, verify, scope, automate. That's the whole series in four words.

The agent didn't get smarter across the previous ten posts. The environment got smarter. That's the insight that generalizes to every stack, every language, and every team. Build the environment, and the agent follows.

Custom Skills: The End-to-End Workflow Made Executable

Ian Johnson — Wed, 08 Apr 2026 16:53:17 +0000

I Was Repeating Myself

By the time the harness was working and I'd moved to on-the-loop development, my sessions with Claude had a rhythm. Pick up a Jira ticket. Read the requirements. Decide which part of the codebase it touches. Write failing tests. Get them approved. Implement. Run lint and tests. Commit. Open a PR. Watch CI. Review the diff. Maybe refactor.

Every time, I typed the same instructions. "Here's a Jira ticket. Pull the requirements with jira issue view. Write tests first. Follow the Action pattern. Run make lint and make test before committing."

It worked. But I was the ceremony. I was the one remembering the steps, enforcing the order, making sure the harness feedback loop happened. If I forgot to say "write tests first," Claude might skip straight to implementation. If I forgot to say "run lint," it might commit without checking.

The workflow was good. But it lived in my head, not in the system.

Skills: Slash Commands for Claude Code

Claude Code supports custom skills: reusable prompts you invoke with a slash command. They live in .claude/skills/ as Markdown files with a bit of frontmatter. When you type /implement-jira-card PROJ-123, Claude reads the skill definition and executes the workflow described in it.

The skill file is just a SKILL.md:

.claude/skills/
  implement-jira-card/
    SKILL.md
  implement-change/
    SKILL.md

Each skill defines the argument it expects, describes the workflow in phases, and lists the rules the agent must follow. It's the same kind of guidance as a CLAUDE.md harness file (plain Markdown, checked into the repo, version-controlled) except instead of scoping guidance to a directory, it scopes guidance to a workflow.

The Two Skills

I built two skills that cover the two ways work enters the system.

/implement-jira-card [PROJ-123] — for work that starts with a Jira ticket. The skill pulls the issue details from Jira, walks through requirements, planning, TDD, and delivery. It knows how to use jira for issue details and gh for GitHub operations.

/implement-change [description] — for everything else. Bug fixes that don't have a ticket. Follow-up tasks from code review. Small improvements. Ad-hoc work. Same workflow, but the requirements come from the user's description instead of Jira.

Both skills follow the same eight-phase workflow. The difference is where the requirements come from — Jira or the user's own words.

What the Skill Needs from Jira

The /implement-jira-card skill pulls issue details with jira issue view, but not every field matters equally. The agent needs specific information to draft a requirements plan, and our Jira structure is set up to provide it.

Epics are projects. An epic groups all the tasks for a single initiative. When the agent reads a task, the parent epic gives it the broader goal: why this work exists, what it's part of, and what other tasks sit alongside it. Without that context, the agent treats every task as isolated. With it, the agent understands where the task fits and can make better scoping decisions.

Tasks are implementation units. Each task maps to one piece of work and typically produces one PR. We don't use stories. A task is specific enough that the agent can read it and know what to build, what to test, and when it's done.

Three fields on each task do the heavy lifting:

Description — the problem statement and context. This tells the agent what needs to change and why. A good description includes enough domain context that the agent doesn't invent assumptions. "Users should not be able to approve their own orders" is better than "fix approval logic." The description feeds directly into Phase 1's requirements document.

Acceptance criteria — the conditions that define done. These translate almost directly into test cases during Phase 4. "Given a user who created an order, when they attempt to approve it, then the request should be rejected with a 403" becomes a failing test before any implementation exists. The more precise the acceptance criteria, the better the test coverage. Vague criteria produce vague tests.

Screenshots and attachments — the visual reference. For UI work, screenshots show what the result should look like. The agent uses these during implementation to match layout, placement, and content. During Phase 5, if the skill runs visual verification through Puppeteer, the screenshots serve as the expected state.

The skill pulls all three, drafts a requirements plan from them, and asks for feedback before moving on. If the card is thin (missing acceptance criteria, vague description) that becomes obvious at the first checkpoint. I either flesh out the card or fill in the gaps in conversation. Either way, the agent doesn't start writing code until the requirements are clear.

This is why card quality matters more with an agent than without one. A developer can fill in gaps from tribal knowledge and hallway conversations. An agent works with what the card gives it. Good cards produce good requirements plans. Bad cards produce a longer Phase 1 conversation, or worse, confident code that solves the wrong problem.

The Eight Phases

Here's the full workflow, phase by phase, as I actually experience it.

Phase 0: Scope the Target

The first thing the skill does is ask a question: does this change apply to the legacy application (Blade CRUD), the React SPA, or both?

This matters because the project is in a transitional state. The legacy app is typical CRUD with Blade views. The SPA is event-driven and personalized. They have different controllers, different test patterns, different harness files. Getting this wrong means the agent writes code in the wrong layer.

I answer "SPA" or "legacy" or "both," and the skill knows which harness files to consult, which layers to touch, and how to test.

Phase 1: Requirements

For /implement-jira-card, the skill runs jira issue view PROJ-123 to pull the issue details. For /implement-change, it asks me to describe the problem if the argument wasn't clear enough.

Then it creates a requirements document: the problem, acceptance criteria, and scope. And it asks for feedback.

This is the first checkpoint. I read the requirements. If something's off, I correct it. If my feedback is about code quality or patterns, such as "we don't do it that way, use the notification service instead of calling the webhook directly", then the skill does something specific: it updates the relevant CLAUDE.md harness file first, reloads it, and then revises the requirements.

That's the harness feedback loop, baked into the workflow. My correction doesn't just fix this instance. It fixes all future instances.

Phase 2: Implementation Plan

The skill creates an implementation plan: files to change, new files to create, and the testing strategy.

Another feedback checkpoint. Same rules: if my feedback is about patterns, the harness gets updated before the plan gets revised.

This phase catches architectural missteps early. If Claude plans to put business logic in a controller instead of an Action, I catch it here, not after 200 lines of implementation.

Phase 3: Branch Setup

For Jira cards, the branch is prefixed with the issue key: PROJ-123-short-description. For ad-hoc changes, it's descriptive: fix-order-approval or followup-ticket-validation.

The branch is always created from the latest origin/main. No stale branches.

Phase 4: TDD Implementation

This is the core of the workflow, and it's where TDD stops being a philosophy and becomes a protocol.

The skill writes failing tests first. PHP tests, JavaScript tests, or both — whatever the change requires. Then it presents me with a list of test descriptions and asks for feedback.

This is the second critical checkpoint. I'm reviewing what the code should do before any implementation exists. The test descriptions are the spec. If they're wrong, the implementation will be wrong no matter how clean it is.

Once I approve the tests, the skill implements the smallest changes to make them pass. It presents a description of what changed. Another checkpoint.

If at any point my feedback references patterns or code quality, the harness gets updated first, then the tests or implementation get regenerated. The ratchet turns.

Phase 5: Change Approval and Commit

The skill shows a diff of all changes and asks for approval. If I'm happy, it runs the full pre-commit checks:

make lint      # Pint, Psalm, Prettier, ESLint, TypeScript
make test      # 2,700+ PHP tests
make test-js   # Vitest suite

If the change has UI, it runs make migrate-e2e and make seed-e2e, then uses the Puppeteer MCP server to visually verify the page. It asks whether to run the full E2E suite.

Everything passes? Commit, push, and create a PR (or update the existing one).

Everything fails? Fix it first. No exceptions.

Phase 6: CI and Code Review

The skill watches the GitHub Actions workflow in the background and fixes any CI failures. Then it runs a code review of its own changes looking for logic errors, security issues, and pattern violations.

It presents all findings at once and asks which ones to address. For each approved finding, same rules: if it's a pattern issue, update the harness first. For denied findings, skip them.

This is the agent reviewing its own work. It catches things the automated checks miss: subtle logic issues, naming inconsistencies, missing edge cases. I decide which findings are worth addressing and which are noise.

Phase 7: Refactoring

After the feature works and the code review is addressed, the skill identifies refactoring opportunities and security concerns. Same presentation format: all suggestions at once, I pick which to address.

This is where the rule of three applies. If the skill identifies duplicated code that's only duplicated twice, I skip it. If it's the third time, we refactor.

Refactoring commits are separate from feature commits. The skill follows conventional commits: feat: for the feature, refactor: for the cleanup. Separate concerns, separate commits.

Phase 8: Done

CI is green. PR is up. Code review is addressed. Refactoring is committed.

The skill confirms everything passed and asks if there's follow-up work. If there is, we start the next skill invocation. If not, we're done.

What This Looks Like in Practice

A real session:

me:     /implement-jira-card PROJ-456
claude: [reads Jira issue — "Add order approval workflow to admin dashboard"]
claude: Does this apply to the legacy app, the SPA, or both?
me:     SPA
claude: [creates requirements document]
claude: Here are the requirements. Feedback?
me:     Looks good.
claude: [creates implementation plan — new API endpoint, React component, tests]
claude: Here's the plan. Feedback?
me:     Use the existing DashboardDataService instead of a new service.
claude: [updates app/Services/CLAUDE.md to note DashboardDataService as the
         preferred service for dashboard data, revises plan]
claude: Updated plan. Feedback?
me:     Good.
claude: [creates branch PROJ-456-order-approval-dashboard]
claude: [writes failing tests]
claude: Here are the test descriptions:
        - admin can view pending orders on dashboard
        - order approval updates status and records approver
        - unauthenticated user gets 401
        - user without dashboard permission gets 403
        - user cannot approve their own order
        Feedback?
me:     Add a test for orders that have already been approved.
claude: [adds test, presents updated list]
me:     Good.
claude: [implements — API endpoint, React component, service method]
claude: Here's what changed. Feedback?
me:     Looks right.
claude: [shows diff]
me:     Ship it.
claude: [runs make lint, make test, make test-js — all pass]
claude: [commits, pushes, creates PR]
claude: [watches CI — green]
claude: [runs code review — 2 findings]
claude: Finding 1: The order type enum could use a TypeScript const assertion.
        Finding 2: Missing null check on allocation end date.
        Which to address?
me:     Both.
claude: [fixes both, commits, pushes]
claude: All green. Anything else?
me:     Nope.

That's the whole thing. From Jira ticket to merged PR. I typed maybe 30 words of direction. The skill handled the ceremony.

Why This Works

Consistency

Without the skill, the workflow depends on me remembering the steps. Some sessions I'd forget to ask for E2E verification. Some sessions I'd skip the code review phase because I was in a hurry. The skill doesn't forget. It runs the same eight phases every time.

Harness Feedback Is Built In

The most important line in both skill files:

If feedback references code quality or patterns, update the relevant CLAUDE.md harness file FIRST, reload it, then apply.

This rule appears at every feedback checkpoint. It means the harness feedback loop isn't something I have to remember to invoke; it's structural. Every piece of feedback I give either confirms the harness is working or improves it. The skill enforces this.

TDD Is Non-Negotiable

The skill writes tests before implementation. Not as a suggestion. As a phase that happens before the implementation phase exists. There's no way to skip it without aborting the skill entirely.

This is important because TDD is easy to skip when you're in a hurry. "I'll write the tests after" is the most common lie in software engineering. The skill makes the lie structurally impossible.

The Agent Reviews Its Own Work

Phases 6 and 7, code review and refactoring, are the agent auditing itself. It's not perfect. It misses things. But it catches enough that my review time drops significantly. And I get to choose which findings to act on, so it never runs away with unnecessary changes.

Separation of Concerns in Commits

The skill produces separate commits for features, fixes, and refactoring. This isn't cosmetic. When something breaks in production and you're scanning git log, the difference between feat: add order approval to dashboard and refactor: extract order calculation helper tells you instantly which commit to investigate.

The Skill File Anatomy

Both skills are plain Markdown with YAML frontmatter:

---
name: implement-jira-card
description: Analyze and implement a Jira issue using TDD — from requirements through to a PR
argument-hint: "[Jira issue key, e.g. PROJ-123]"
---

Implement the Jira issue: $ARGUMENTS

## Workflow

### Phase 0: Scope the Target
...

The $ARGUMENTS placeholder gets replaced with whatever you pass after the slash command. The description shows up in Claude Code's skill list. The argument hint tells you what to pass.

The workflow section is the actual prompt the agent follows. It's specific, phased, and full of checkpoints. The key rules section at the bottom handles edge cases and priorities.

That's it. No plugin system. No SDK. No custom tooling. A Markdown file with a workflow written in plain English.

`/implement-jira-card` vs `/implement-change`

The two skills are nearly identical. The differences:

	`/implement-jira-card`	`/implement-change`
Input	Jira issue key	Text description
Requirements source	`jira issue view`	User conversation
Branch naming	`PROJ-123-description`	`fix-description` or `followup-description`
Tools	`jira` + `gh`	`gh` only

Everything else — the phases, the checkpoints, the harness feedback loop, the TDD workflow, the CI watching, the code review — is the same.

I considered making one skill with a flag, but two separate skills is clearer. When I type /implement-jira-card, I know I'm starting from a ticket. When I type /implement-change, I know I'm describing the work myself. The intent is obvious from the command.

The Feedback Checkpoints

Count the checkpoints in a single skill run:

Scope the target (legacy, SPA, or both)
Requirements review
Implementation plan review
Test descriptions review
Implementation review
Diff review
Code review findings — which to address
Refactoring suggestions — which to address

Eight checkpoints. Eight moments where I'm on-the-loop: reviewing output, giving direction, and making judgment calls. Between those checkpoints, the agent operates autonomously. It writes code, runs tests, fixes failures, manages git, and creates PRs without asking.

This is the on-the-loop workflow from post 7, made concrete and repeatable. I'm not directing input. I'm reviewing output at predetermined checkpoints.

What the Skills Don't Do

The skills don't replace judgment. They automate ceremony.

I still decide what to build. I still decide which layer it belongs in. I still review test descriptions to make sure they capture the right behavior. I still read the diff. I still choose which code review findings matter.

The skills handle the sequence: read the ticket, write tests first, implement, run checks, commit, push, create PR, watch CI, review, refactor. That sequence is the same every time. It doesn't need my attention. The judgment calls at each checkpoint do.

Building Your Own

If you want to build skills for your project:

Start by noticing repetition. What instructions do you give the agent every session? That's your first skill.
Define the phases. What's the sequence? Where are the checkpoints?
Build in the harness feedback loop. Every checkpoint should have the rule: if feedback is about patterns, update the harness first, then re-apply.
Make TDD structural. Tests before implementation. Not as guidance, but as a phase that must complete before the next phase starts.
Include self-review. Have the agent audit its own work before you see it.
Keep it simple. A Markdown file. Plain English. No tooling.

The skill doesn't need to be clever. It needs to be consistent. The value isn't in any single phase; it's in the guarantee that all eight phases happen, in order, every time.

The Takeaway

Custom skills are the on-the-loop workflow made executable:

Codify your workflow, not just your patterns. The harness tells the agent how to write code. Skills tell it when to write code, when to test, when to ask for feedback.
Every feedback checkpoint is a harness improvement opportunity. The skill enforces this. Corrections become rules before they become code.
TDD as a phase, not a preference. The skill makes it structural. Tests come first because that's what Phase 4 says.
Separate the ceremony from the judgment. Automate the sequence. Keep the checkpoints.
Two skills cover most work. Jira ticket or ad-hoc description. Everything else is the same workflow.

The skills turned a workflow I was repeating from memory into a workflow the system enforces. Same eight phases. Same checkpoints. Same harness feedback loop. Every time, without fail.

That's not a minor convenience. That's the difference between a process that depends on discipline and a process that depends on structure. Structure scales. Discipline doesn't.

The Curator's Role: Managing a Codebase With an Agent

Ian Johnson — Wed, 08 Apr 2026 16:01:46 +0000

The Simplest Thing That Could Work

There's a temptation, when you decide to "use AI for software development," to build something complicated. A custom orchestration layer. A RAG pipeline over your codebase. A fine-tuned model trained on your conventions. A plugin ecosystem.

I used Markdown files.

The entire agent harness for this project is plain Markdown, checked into the repo, loaded automatically by Claude Code based on which directory you're working in. No custom tooling. No infrastructure. No maintenance burden.

CLAUDE.md                           ← Root guidance
app/Actions/CLAUDE.md               ← Action patterns
app/Services/CLAUDE.md              ← Service patterns
tests/CLAUDE.md                     ← Test patterns
resources/js/spa/CLAUDE.md          ← SPA patterns
...9 files total

That's it. Nine Markdown files. The agent reads them, follows them, and produces code that matches the project's conventions.

I want to be very clear about this because the industry is drowning in complexity around AI tooling: the simplest approach worked. Not as a starting point. Not as a minimum viable product. As the actual, final, production approach that I use every day.

When in doubt, choose the boring solution.

Guardrails First. Always.

If there's one message in this entire series, it's this: do the guardrails work up front.

I know. It's not the fun part. Writing tests for existing code is tedious. Setting up linting is yak-shaving. CI pipelines are plumbing. Nobody got excited about a pre-commit hook.

But here's what happens when you skip the guardrails and go straight to "AI agent writes my code":

The agent writes code that looks correct
You deploy it
Something breaks in production
You debug it
You realize the agent made an assumption that your tests would have caught — if you had tests
You fix the bug and add a test
You repeat this for every bug

That's the expensive path. You're paying the cost of guardrails anyway, but you're paying it in production bugs, debugging time, and lost confidence. You're paying retail instead of wholesale.

The cheap path:

Write tests
Add linting
Set up CI
Establish patterns
Build the harness
Then let the agent write code

Now the agent's output is verified before it ships. Tests catch behavioral bugs. Linting catches structural issues. CI catches everything else. The harness guides the agent toward correct patterns from the start.

Every hour spent on guardrails saves ten hours of debugging. I can't prove that number, but I believe it in my bones after watching this codebase evolve.

The Compound Effect of Simple Rules

Each guardrail is simple:

Tests: does the code do what it should?
Pint: is the PHP formatted consistently?
Psalm: are the types correct?
Prettier: is the JS/CSS formatted?
ESLint: are the React patterns correct?
TypeScript: are the frontend types sound?
Pre-commit: did we check before committing?
CI: did everything pass together?
Harness: did we follow the project's patterns?

None of these are individually impressive. But together they create a narrowing funnel. The space of "code the agent could produce" starts enormous. Each guardrail eliminates a category of wrong answers. By the time code passes all of them, the remaining space is almost entirely correct code.

This is why the approach scales. I didn't build a sophisticated AI system. I built a series of simple filters. The AI writes whatever it writes, and the filters ensure only good code survives.

Modern Software Engineering and Agents

Dave Farley's Modern Software Engineering argues that software engineering is the application of empiricism and the scientific method to building software. The core practices:

Work in small batches — small commits, small PRs, fast integration
Optimize for fast feedback — tests, CI, trunk-based development
Experimentation — try things, measure results, adapt
Continuous delivery — always releasable, deploy when ready

Every one of these maps directly to AI-assisted development:

Small batches = small PRs the agent can produce and you can review in minutes. Not 2,000-line diffs. Not week-long branches. One feature. One fix. One refactoring. Merge and move on.

Fast feedback = make lint && make test gives you a definitive answer in minutes. The agent runs these checks. If they pass, the code is good. If they fail, the agent fixes and tries again. The feedback loop is tight.

Experimentation = the harness is a hypothesis. "If I give the agent these patterns, it will produce code like this." Update the harness when the hypothesis is wrong. Run the experiment again. This is the scientific method applied to AI collaboration.

Continuous delivery = trunk-based development with CI means every merge is deployable. The agent produces code that's always ready to ship, not code that needs a cleanup pass before release.

Farley couldn't have predicted AI agents when he wrote the book, but he described exactly the practices that make them work.

The Harness Optimizes Feedback — For You and the Agent

The harness has two audiences:

For the agent: "Here's how to write an Action class. Here's the pattern. Here are the anti-patterns. Follow this."

For you: "Here's what I expect the agent to produce. If it doesn't match, either the agent drifted or the harness needs updating."

The feedback protocol makes this bidirectional:

You review output → Agent drifted? → Update harness → Agent re-reads → Better output
                   → Harness gap? → Update harness → Agent re-reads → Better output
                   → Looks good?  → Commit and ship

Every review either confirms the harness is working or improves it. The harness gets better over time. The agent's first-attempt accuracy improves over time. Your review time decreases over time.

This is the ratchet effect. The system improves in one direction. It doesn't degrade. Each lesson is captured. Each correction is permanent.

You're Codifying Yourself

Here's something I didn't expect: building the harness forced me to articulate decisions I'd been making unconsciously for years.

Why do I prefer constructor injection over facades? Why do I insist on Result DTOs instead of returning models? Why do Actions have one public method? Why does the SPA component own the logic while the interim wrapper is just plumbing?

I had reasons for all of these. They were informed by years of experience, books I'd read, projects I'd worked on, mistakes I'd made. But they lived in my head. When I was writing every line of code, they came out through my fingers. When an agent writes the code, they need to come out through the harness.

The harness is a codification of your engineering judgment. Your preferences. Your project's specific constraints. Your team's conventions. Your domain's requirements.

And every project's harness will be different. A fintech codebase has different concerns than a social media app. A Go microservice has different patterns than a Laravel monolith. A greenfield project has different rules than a legacy migration.

This is why "just use AI to write code" is incomplete advice. The AI doesn't know your project. It doesn't know your domain. It doesn't know why you chose contracts over concrete classes, or why authorization goes through Policies instead of middleware, or why the SPA is gated to non-production environments.

You know. And your job is to make that knowledge explicit, machine-readable, and automatically loaded at the right time.

The Engineer's Role in the Age of Agents

If agents can write code, what do engineers do?

Three things:

1. Curator of Design.
You decide the architecture. Actions, Services, Policies, query builders. These are design decisions that shape how every feature gets built. The agent follows design. It doesn't create it. Your taste, your judgment, your experience with tradeoffs, and that's irreplaceable.

2. Curator of Guardrails.
You build and maintain the system that verifies output: tests, linting, CI, pre-commit hooks. Without guardrails, agent output is unchecked. The guardrails are your engineering standards made executable.

3. Curator of Documentation.
Not READMEs for humans; guidance for agents. Harness files that encode patterns, constraints, and anti-patterns. Documentation that's loaded in context, not filed in a wiki nobody reads.

The code is a byproduct. The real output of a software engineer is the system of constraints that makes correct code the default and incorrect code structurally difficult.

This isn't a diminished role. It's a leveled-up role. You're working at a higher level of abstraction. You are defining what good looks like instead of typing it character by character.

On-the-Loop Management

With all the pieces in place, your role becomes:

Setting direction. What to build. What to migrate next. Which Jira tickets to pick up. Architecture decisions. Tradeoffs.

Writing specs as tests. The TDD red phase is your primary communication channel with the agent. Failing tests are unambiguous specifications.

Reviewing output. Reading diffs, not writing code. Checking "did the agent follow the patterns?" not "is this semicolon in the right place?"

Curating the harness. When the agent drifts, you don't just fix the instance, you fix the guidance. The correction propagates to all future work.

Managing infrastructure. Docker, CI/CD, deployment pipelines, queue workers, Redis, feature flags. The plumbing that makes everything else work.

This is management, not micromanagement. You're responsible for the system's output, but you're not manually producing every artifact.

The Story in Numbers

This project went from zero to here in about 3 months:

Metric	Value
Total commits	258
Pull requests	145
PHP tests	2,700+
Conventional commits	122 (47% of total)
Refactoring commits	17
Test-specific commits	14
Feature commits	29
Fix commits	53
React SPA pages	~15
Features shipped via interim wrappers	6
Harness files	9
Big-bang rewrites	0

One engineer. One AI agent. Nine Markdown files. And a codebase that went from "untested legacy monolith" to "well-structured application with dual-frontend migration, automated quality gates, and continuous delivery."

What I'd Do Differently

I'd write the harness earlier. I built the harness at commit #109 (out of 258). If I'd built it at commit #30, after the initial test and linting setup, every subsequent commit would have benefited from guided agent output.

I'd invest more in test infrastructure early. The UserFactory facade was a game-changer. Every similar shortcut (factory states, test helpers, assertion macros) pays dividends across hundreds of tests.

I'd document scoping rules from day one. "New features go here. Bug fixes go here. Don't touch this." The earlier the agent knows the rules, the fewer corrections you make.

What I Wouldn't Change

Starting with tests. Non-negotiable. Everything else depends on the safety net.

Keeping it simple. Markdown files, Makefile commands, Docker containers, conventional commits. Every piece is boring. Every piece works. The boring stack is the reliable stack.

Incremental migration. Never once did we stop shipping features to "do infrastructure." The migration happened alongside feature work, commit by commit, PR by PR.

The feedback protocol. Updating the harness on every review. This is the single highest-leverage habit in the entire workflow.

The Point

You can use AI coding agents on real production codebases and get predictable, high-quality results.

Not by hoping the agent is smart enough. Not by building custom AI infrastructure. Not by trusting vibes.

By doing the boring work first:

Write the tests
Add the linting
Set up CI
Establish patterns
Build the harness
Update the harness continuously

Then let the agent operate within those constraints. Review the output. Update the constraints. Ship the code.

The agent didn't get smarter over these three months. The guardrails got better. The harness accumulated lessons. The codebase developed a shape that made it harder to do the wrong thing and easier to do the right thing.

That's not AI magic. That's engineering.

Building the Agent Harness: Subdirectory CLAUDE.md Files

Ian Johnson — Wed, 08 Apr 2026 15:46:16 +0000

One Big File Doesn't Scale

Claude Code reads a CLAUDE.md file at the root of your project. It's the primary way to give the agent project-specific instructions. And for small projects, it works great.

For this project, it didn't.

The root CLAUDE.md grew to cover architecture, testing conventions, API patterns, legacy patterns, SPA patterns, service design, migration strategy, authorization rules, database conventions, and more. The file was huge. Every time Claude started working in any part of the codebase, it loaded all the instructions into context.

Two problems:

Context window bloat. Instructions about database migrations are irrelevant when writing React components. But they're eating context tokens.
Signal-to-noise ratio. When everything is important, nothing is. The agent has to parse 500 lines of instructions to find the 20 that matter for the current task.

The solution: subdirectory CLAUDE.md files.

The Harness Architecture

Claude Code loads CLAUDE.md files hierarchically. If you're working in app/Actions/, it loads:

The root CLAUDE.md
app/Actions/CLAUDE.md (if it exists)

This means you can put scoped guidance in subdirectories. The agent only loads instructions relevant to where it's working. The root file has the big picture. The subdirectory files have the specifics.

Here's the harness layout:

CLAUDE.md                              ← Project overview, architecture, shared rules
app/Http/Controllers/Api/CLAUDE.md     ← API controller patterns
app/Http/Controllers/Web/CLAUDE.md     ← Legacy controller rules (bug fixes only)
app/Actions/CLAUDE.md                  ← Action class patterns, Result DTOs
app/Services/CLAUDE.md                 ← Service contracts, strategy pattern
app/Http/Resources/CLAUDE.md           ← API Resource conventions
app/Policies/CLAUDE.md                 ← Authorization patterns
database/migrations/CLAUDE.md          ← Migration naming, column types
tests/CLAUDE.md                        ← Test patterns, UserFactory, TDD workflow
resources/js/spa/CLAUDE.md             ← SPA architecture, Interim wrappers

When Claude works on an API controller, it sees the root CLAUDE.md (architecture, roles, domain model) plus the API controller harness (QueryBuilder patterns, Resources, thin controllers). It doesn't see the migration naming rules or the SPA component patterns. Clean context. Relevant guidance.

What Goes in Each Harness File

Each harness file follows the same structure:

What this area is — one sentence
Design direction — where we're headed (growth area vs. legacy)
Patterns — code examples the agent should follow
Rules — explicit do/don't constraints
What NOT to do — anti-patterns with explanations

Let me walk through a few.

Actions Harness

The Actions harness defines the single-execute pattern:

## Action Class Pattern

- One public method: `execute()`
- Accept a Form Request or typed parameters — not raw arrays
- Return a Result DTO — not a model or array
- Inject services via constructor
- Fire domain events for audit trails and side effects

It includes the Result DTO pattern with named constructors:

public static function succeeded(Thing $thing): self
{
    return new self(success: true, thing: $thing);
}

public static function failed(string $error): self
{
    return new self(success: false, error: $error);
}

And explicit anti-patterns:

## What NOT to Do

- Do not put HTTP concerns in Actions — no response codes, redirects, or JSON
- Do not access `Request` objects directly — accept Form Requests passed by the controller
- Do not create Actions for simple CRUD that the controller can handle
- Do not add multiple public methods — one Action, one execute(), one responsibility
- Do not return raw models or arrays — always return a Result DTO

When Claude creates a new Action, it reads this file and follows the pattern. Every time. It doesn't invent its own approach. It doesn't return raw models. It doesn't add multiple public methods. Because the harness says not to.

Tests Harness

The Tests harness is one of the most detailed because tests are the quality gate:

## Performance

- **CRITICAL: Never run multiple test processes simultaneously**
- **Run `make test` exactly once** as a final check before commit
- **TIMEOUT MUST BE AT LEAST 20 MINUTES**
- **Frontend-only changes**: Skip `make test` — run `make test-js` and `make lint` only

It defines the UserFactory pattern so the agent never uses the wrong factory:

### UserFactory (Test Facade)

Always use the test UserFactory facade for creating test users:

    use Facades\Tests\Setup\UserFactory;

    $admin = UserFactory::admin()->create();
    $user = UserFactory::withPermissions('things.manage')->create();

Do NOT use `User::factory()->create()` directly in tests.

And it prioritizes test coverage:

### Test Coverage Priorities

1. Authentication (401 for unauthenticated)
2. Authorization (403 for unauthorized)
3. Validation (422 for invalid input)
4. Happy path (200/201 for valid operations)
5. Edge cases and business rules
6. Event/notification firing

Web Controllers Harness

This is the shortest harness file, and that's intentional:

# Web Controllers (Legacy)

Bug fixes only. No new features. No new routes. No new views.

New features go in API controllers + React SPA pages.

The brevity is the message. When Claude opens a web controller, the harness immediately tells it: you're in legacy territory. Fix bugs. Don't build here.

SPA Harness

The SPA harness covers the interim wrapper pattern, component structure, and the relationship between SPA and legacy contexts:

## Interim Wrappers

Interim wrappers are a release mechanism, not a separate architecture.
The SPA component is always the source of truth.

The wrapper only passes URL props that differ between contexts.
New features and bug fixes go into the SPA component.

This prevents a common mistake: Claude building features in the interim wrapper instead of the SPA component. The harness makes the hierarchy explicit.

The Root CLAUDE.md

The root file handles the big picture:

Project overview and tech stack
The two-frontend architecture and migration direction
Scoping rules (what goes where)
Domain model overview (User roles, Order lifecycle, etc.)
Makefile commands
Pre-commit verification steps
The TDD workflow

It also contains a harness table that maps areas to their harness files:

| Area | Harness File | Summary |
|------|-------------|---------|
| API Controllers | `app/Http/Controllers/Api/CLAUDE.md` | Growth direction |
| Web Controllers | `app/Http/Controllers/Web/CLAUDE.md` | Legacy, bug fixes only |
| React SPA | `resources/js/spa/CLAUDE.md` | SPA source of truth |
| Services | `app/Services/CLAUDE.md` | Contract-first design |
| Actions | `app/Actions/CLAUDE.md` | Single execute(), Result DTOs |
| Tests | `tests/CLAUDE.md` | UserFactory, TDD workflow |

This table is surprisingly important. It tells Claude, and human developers, that guidance exists and where to find it. Without the table, harness files might go unread.

The Feedback Protocol

The harness isn't static. It evolves with every review.

## Feedback Protocol

1. Update the appropriate CLAUDE.md harness file
2. Reload that harness file into context
3. Re-attempt the change using the updated guidance

When Claude does something wrong:

If it's a harness gap → update the harness
If it's a one-off mistake → correct inline
If it's a recurring pattern → add it to the "What NOT to Do" section

The reload step is key. After updating a harness file, Claude re-reads it in the same conversation. The correction applies immediately, not just in the next session. And because the harness file is committed to the repo, it applies to all future sessions.

This creates a ratchet effect. The harness gets better with every review. Mistakes that happen once get encoded as rules. Over time, the agent's first-attempt accuracy improves because the guidance accumulates lessons.

The Harness Checks Its Own Work

The harness includes pre-commit verification steps:

## Pre-Commit Verification

1. `make lint` — all checks must pass
2. `make test` — all PHPUnit tests must pass
3. `make test-js` — all Vitest tests must pass
4. Browser verification (UI changes only)
5. E2E tests (ask before running)

This is the harness checking its own output. The agent writes code, then runs the same quality gates that a human would. If anything fails, the agent fixes it before presenting the diff for review.

The agent doesn't need me to tell it "run the tests." The harness says to. It's part of the workflow, not an afterthought.

Design Decisions

Why CLAUDE.md files instead of custom tooling?
Simplest thing that works. CLAUDE.md files are plain Markdown, checked into the repo, version-controlled, reviewable in PRs. No custom tool to build, maintain, or explain. Any developer can read them. Any developer can update them.

Why one file per directory instead of one giant file?
Context control. When you're in app/Actions/, you need Action patterns. When you're in tests/, you need test patterns. Loading everything everywhere wastes context and drowns the signal.

Why include code examples?
Because patterns are easier to follow than rules. "Return a Result DTO" is a rule. Showing the succeeded() / failed() static constructors is a pattern the agent can copy. It'll copy the pattern before it reads the rule.

Why include "What NOT to Do"?
Because agents learn from examples, including negative examples. If the harness only shows what to do, the agent might still do the wrong thing in novel situations. Anti-patterns draw explicit fences.

The Takeaway

Building an agent harness:

Start with the root CLAUDE.md. Project overview, architecture, shared conventions.
Add subdirectory files where patterns matter. Actions, Services, Tests, SPA, and any other area with specific conventions.
Keep each file focused. Pattern + rules + anti-patterns. Nothing more.
Include code examples. The agent follows what it sees.
Include a harness table in the root file so nobody misses the guidance.
Update the harness on every review. Corrections become rules. Rules prevent recurrence.
Let the harness check its own work. Pre-commit verification built into the workflow.

The harness is not documentation for humans. It's guidance for an autonomous agent. Design it that way: specific, scoped, verifiable, and always evolving.

The most important post of the series so far!

Ian Johnson — Tue, 07 Apr 2026 17:23:37 +0000

Ian Johnson

Apr 7

In-the-Loop to On-the-Loop: How I Stopped Micromanaging My AI Agent

#ai #productivity #softwareengineering #programming

Comments

9 min read

In-the-Loop to On-the-Loop: How I Stopped Micromanaging My AI Agent

Ian Johnson — Tue, 07 Apr 2026 16:17:00 +0000

I Was the Bottleneck

For the first two months of this project, I used Claude Code with auto-approve turned off. Every file edit, every terminal command, every change... I reviewed it before it executed.

I read every line. I made inline corrections. I gave real-time direction: "No, use the repository pattern here." "That's the wrong role check." "We use UserFactory::admin(), not User::factory()."

I was pair programming with an AI agent. Except I was the worse kind of pair: the one who grabs the keyboard every 30 seconds.

The results were good. The code was clean. But I was doing most of the thinking and half the typing. The agent was a fancy autocomplete with better suggestions. I wasn't getting the leverage I'd hoped for.

The Realization

I read an article about "on-the-loop" versus "in-the-loop" human-AI collaboration. The framing clicked immediately:

In-the-loop: You're inside the agent's decision cycle. You approve every action. You're a required step in every operation. The agent can't sneeze without your permission.

On-the-loop: You set the direction, define the constraints, and review the output. The agent operates autonomously within those constraints. You step in when something goes off track, not for every keystroke.

In-the-loop is micromanagement. On-the-loop is management.

The problem was obvious: I was micromanaging because I didn't trust the agent to do the right thing. And I didn't trust the agent because there was nothing forcing it to do the right thing.

The Prerequisites

On-the-loop only works if the agent's environment constrains it toward correct output. Without guardrails, autonomy produces slop.

Look at what we'd built over the previous two months:

Guardrail	What It Constrains
2,700+ tests	Behavioral correctness
Pint	PHP code style
Psalm	PHP type safety
Prettier	JS/CSS formatting
ESLint	React/TypeScript patterns
TypeScript	Frontend type safety
Pre-commit hooks	Catches issues before commit
CI pipeline	Final verification gate
Actions pattern	Where business logic lives
Service contracts	How integrations are structured
Policies	How authorization works
Conventional commits	How changes are described
Trunk-based dev	How changes are delivered

Each guardrail narrows the space of "valid output." Together, they create a corridor. The agent can move freely within that corridor, but it can't easily wander off into the weeds.

This is why stages 1–3 came first. You can't go on-the-loop with an agent on a codebase that has no tests, no linting, and no architectural patterns. That's not delegation; that's negligence.

The Switch

I turned on auto-approve for file edits. I started giving Claude higher-level instructions:

Before (in-the-loop):

"Create a new file at app/Actions/Orders/CreateOrderAction.php. Add a constructor that injects NotificationInterface and AnalyticsService. Add an execute method that takes a CreateOrderRequest and User..."

After (on-the-loop):

"Extract the order creation logic from OrdersController::store() into a CreateOrderAction following the existing Action pattern."

The agent looks at the existing Actions. It sees the pattern. It creates the class, the Result DTO, wires up the controller. It runs make lint and make test. Everything passes. I review the diff. It's correct.

I went from dictating code to reviewing code. My throughput doubled. Maybe more. And the code quality stayed the same because the guardrails enforced it.

What On-the-Loop Looks Like Day to Day

A typical session now:

I give direction. "Implement the Guide SPA page. Here's the Jira ticket."
Claude reads the harness. It checks resources/js/spa/CLAUDE.md to understand the SPA patterns. It checks app/Http/Controllers/Api/CLAUDE.md for API conventions. It checks tests/CLAUDE.md for testing patterns.
Claude writes failing tests. Following TDD, it writes the test descriptions and presents them to me.
I review the test specs. This takes 2 minutes. I approve or adjust.
Claude implements. It builds the API endpoint, the React page, the interim wrapper if needed. It runs lint and tests.
I review the diff. This takes 5–10 minutes. I'm reading code, not writing it. I'm looking for architectural missteps, not formatting issues.
If it's good, we commit and push. Claude watches the CI run and alerts me to any failures.

The critical shift: I'm reviewing output, not directing input. I'm checking "did the agent follow the patterns?" not "write this line of code."

Custom Skills: Codifying the Workflow

As the on-the-loop workflow matured, I noticed I was giving Claude the same high-level instructions repeatedly. "Here's a Jira ticket. Read it. Write tests. Implement. Run lint and tests. Open a PR." So I codified these into reusable Claude Code skills: slash commands that encode the full workflow.

/implement-jira-card takes a Jira issue key, pulls the requirements, writes failing tests using TDD, implements the smallest change to make them pass, runs the quality gates, and prepares a PR, all following the harness patterns.

/implement-change does the same thing for ad-hoc changes that don't have a Jira ticket. You describe the change, and the skill drives the TDD workflow: write failing tests, get approval on the test descriptions, implement, verify.

These skills are the on-the-loop workflow made executable. Instead of explaining the process each session, I type a slash command and review the output. The skill encodes the sequence I'd otherwise repeat manually: read the context, write tests first, implement in small steps, run the quality gates, ship it.

The skills don't replace judgment. I still review test descriptions before implementation and review the final diff. But they eliminate the ceremony of setting up each task and ensure the TDD workflow is followed consistently.

The Feedback Protocol

Sometimes the agent gets something wrong. When it does, I don't just fix the instance. I fix the guidance.

This is the feedback protocol:

The agent produces something incorrect or suboptimal
I identify the pattern — is this a one-off mistake, or a gap in the harness?
If it's a gap, I update the relevant CLAUDE.md file — add the rule, the example, the anti-pattern
The harness reloads — Claude re-reads the updated guidance
The correction applies to all future work — not just this instance

Example: Claude was putting notification logic directly in API controllers instead of using the notification service. I added this to the Services harness:

"Do not call the chat webhook or external APIs directly from controllers — use the appropriate service."

It never made that mistake again. Across any controller. The harness is a living document that accumulates lessons learned.

The Agentic Flywheel

The feedback protocol above is manual: I notice a gap, I update the harness, the agent reloads. That's how it started. But the current system goes a step further.

In the project's main CLAUDE.md, two instructions turn the harness from a static document into a self-improving system:

The Change Approval Flow:

Show diff — present all changes and ask for feedback
If feedback given — update the relevant CLAUDE.md harness file to capture the pattern/practice, reload it, then re-apply the changes
If approved — run all pre-commit checks
If checks pass — commit and push

The Feedback Protocol:

All feedback about code quality, patterns, or practices follows this loop:

Update the appropriate CLAUDE.md harness file to capture, define, or refine the pattern
Reload that harness file into context
Re-attempt the change using the updated guidance

Read that carefully. The agent doesn't just follow the harness. It writes to the harness. When I give feedback on a diff, the agent's next action isn't to fix the code. It's to update the harness file that governs that area, reload the updated guidance, and then re-apply the change under the new rules.

This is the difference between a harness and a flywheel.

A harness is guidance the agent reads. I write the rules. The agent follows them. When the rules are wrong or incomplete, I update them manually.

A flywheel is guidance the agent reads and writes. I give feedback. The agent encodes that feedback into the harness. The next task benefits from the updated harness. That task generates new feedback. The harness evolves again.

Feedback → Agent updates harness → Better output → Less feedback → Repeat

Every review cycle makes the next cycle faster. The corrections compound. Early in the project, I was giving feedback on almost every diff: "use the service, not a direct call," "that's the wrong factory method," "authorization goes in the policy, not the controller." Each correction became a harness rule. Each rule eliminated a class of future mistakes.

Three months in, most diffs need zero feedback. The flywheel has accumulated enough guidance that the agent produces correct output on the first pass for the majority of tasks. My reviews shifted from "this is wrong, fix it" to "approved."

The flywheel has a second-order effect: it forces me to give precise, pattern-level feedback instead of one-off corrections. "Fix this line" doesn't help the harness. "Notification logic belongs in the notification service, not controllers" does. The mechanism shapes the feedback toward reusable rules, which is exactly what you want.

This is also why the harness is distributed across multiple CLAUDE.md files: one per architectural boundary (Actions, Services, Controllers, Tests, SPA, etc.). When the agent updates the harness, it updates the specific file governing that area. The feedback lands where future work will read it.

The flywheel isn't magic. It requires two things: feedback that's worth encoding, and an engineer who reviews diffs carefully enough to generate that feedback. But given those inputs, the system gets better automatically. The agent does the mechanical work of updating documentation, reloading context, and re-applying changes. You just say what's wrong.

The Trust Equation

On-the-loop trust has a formula:

Trust = Tests + Linting + CI + Architectural Patterns + Harness Guidance

If any component is missing, trust drops:

No tests? You can't verify the agent's output is correct.
No linting? The output might be inconsistent or buggy in ways tests don't catch.
No CI? You're trusting local runs that might not match production.
No patterns? The agent invents its own, and they'll be inconsistent.
No harness? The agent doesn't know your conventions.

All five together? You can hand the agent a Jira ticket and review the PR an hour later.

The Curator Mindset

My role shifted from writer to curator. I don't write most of the code anymore. I:

Define the patterns — through architecture and harness files
Review the test specs — through TDD red-phase tests
Review the output — diffs, not keystrokes
Update the harness — when the agent drifts or when patterns evolve
Make strategic decisions — what to build, in what order, with what tradeoffs

Here's the thing I didn't anticipate: building the harness forced me to articulate my own preferences. Why do I prefer constructor injection? Why Result DTOs instead of returning models? Why one execute() method per Action? I had reasons for all of these — years of experience, books I'd read, mistakes I'd made — but they lived in my head. The harness made them explicit.

This is a different kind of engineering. It's more like managing a team than writing code solo. You're responsible for the quality of the output, but you're not doing the mechanical work. You're the curator of design, guardrails, and documentation.

It's also more fun. I spend my time on the interesting problems (architecture, domain logic, strategic decisions) and let the agent handle the implementation details that follow established patterns.

The Results

I don't have hard before/after metrics (this isn't an A/B test), but the trajectory is clear:

258 commits in ~3 months — roughly 3 commits per day
145 PRs merged — consistent, steady output
2,700+ PHP tests, growing Vitest suite — quality gates that hold
6 major features migrated to React SPA — shipped to production via interim wrappers
17 refactoring commits — continuous architectural improvement
Zero big-bang rewrites — incremental progress throughout

The codebase went from "legacy monolith with no tests" to "well-structured application with automated quality gates and a dual-frontend migration in progress." In three months. With one engineer and an AI agent.

Infrastructure Is a Guardrail Too

It's easy to think of guardrails as code quality tools — tests, linters, static analysis. But the infrastructure decisions are guardrails in their own right.

Docker ensures every environment is identical. The Makefile provides a single interface for every operation. Redis-backed queues isolate background jobs (CRM sync, notifications, calculations) from the request cycle. Separate queue names mean a third-party API outage doesn't back up critical notifications.

The agent doesn't need to know how Docker networking works or why the CRM sync runs on its own queue. It just needs make test to pass and make lint to be clean. The infrastructure absorbs complexity so the agent (and you) don't have to think about it.

This is "pull complexity downward" in action. Simple interfaces. Complex implementations hidden behind them.

What Could Go Wrong

On-the-loop isn't a silver bullet. There are failure modes:

The harness is wrong. If your CLAUDE.md files encode bad patterns, the agent will faithfully reproduce bad patterns. The harness is as good as your engineering judgment.

The tests don't cover the right things. If your tests verify implementation details instead of behavior, the agent can pass all tests while doing the wrong thing.

You stop reviewing. On-the-loop doesn't mean no-loop. You still review diffs. You still verify the output makes sense. You just do it at a higher level.

You skip the prerequisites. If you try to go on-the-loop without tests, linting, and CI, you'll get fast slop instead of slow slop. Still slop.

The Takeaway

The path from in-the-loop to on-the-loop:

Build the guardrails first. Tests, linting, CI, clear architecture.
Establish patterns. Actions, Services, Policies — consistent, repeatable, verifiable.
Create the harness. CLAUDE.md files that encode your conventions.
Start delegating. Give higher-level instructions. Review output, not input.
Update the harness continuously. Every correction is a lesson the harness absorbs.
Trust the system, not the agent. You trust the guardrails. The agent just happens to work within them.

The agent didn't get smarter. The environment got smarter. That's the difference.

Trunk-Based Development with Short-Lived Branches

Ian Johnson — Tue, 07 Apr 2026 15:03:24 +0000

Why Long-Lived Branches Kill Velocity

You've seen it. A feature branch that started two weeks ago. It's 47 commits behind main. Three people are waiting on it. The merge conflict is 400 lines. Nobody wants to review it because reviewing 2,000 lines of diff is nobody's idea of a good time.

Long-lived branches are where productivity goes to die. And when you add an AI agent to the mix, they get even worse. The agent writes code against the branch state. Main moves on. By the time you merge, half the agent's assumptions are wrong.

Trunk-based development fixes this. The rule is simple: branches live for hours, not days. Merge to main early and often. Keep main releasable at all times.

Trunk-based development doesn't necessarily mean merging changes straight to main. In my view, it's more about ensuring everything works together to really take advantage of CI. Short-lived branches give us this, as well as the safety net that many developers prefer. Concern about pushing directly to main is a developer preference. Personally, I prefer not to.

The Workflow

Here's what a typical feature looks like in this project:

Branch — create a branch from main: feat/PROJ-431-dashboard-migration
Build — write tests, implement the feature, run make lint && make test
PR — open a PR. Small diff. Clear description. Conventional commit title.
CI — GitHub Actions runs the full pipeline (lint, test, test-js)
Merge — once CI is green, merge to main
Deploy — CI triggers a Forge deployment webhook. Staging updates automatically.

The entire cycle (branch to merged) is usually same-day. Sometimes within an hour for smaller changes.

145 PRs in 3 Months

This project has 258 commits across ~3 months. 145 of those went through pull requests. That's roughly 1.6 PRs per day, every day.

Most PRs are small. A refactoring extraction. A test coverage expansion. A bug fix. A single feature. The biggest PRs were the frontend migration (Tailwind, jQuery removal), and even those were broken into sequential stages.

Small PRs have compounding benefits:

Easier to review — you can actually read the diff
Easier to revert — if something breaks, git revert one PR, not a 2,000-line changeset
Faster CI — smaller changes mean fewer test failures to debug
Less merge conflict risk — you're never far from main

Conventional Commits

Every commit follows the conventional commits format:

feat: add GET /api/dashboard endpoint (PROJ-430) (#130)
fix: resolve planner bugs (PROJ-432) (#131)
refactor: extract CreateOrderAction from OrdersController::store() (#80)
test: expand OrdersController test coverage (#59)
docs: document legacy Blade vs React SPA architecture (#119)
ci: add workflow_dispatch trigger for manual CI runs
chore: remove legacy frontend dependencies and dead code (#103)

This isn't just aesthetics. Conventional commits create a machine-readable history. You can:

Generate changelogs automatically
See at a glance whether a commit is a feature, fix, or refactoring
Train an agent to follow the same convention (it will, if every existing commit uses it)

The commit message is a contract. feat: means new functionality. fix: means something was broken and now it's not. refactor: means the behavior didn't change. When the agent writes a commit message, these prefixes help me triage without reading the diff.

The CI Pipeline

Every push to main triggers the full pipeline:

Build → Code Quality → Tests → Deploy
         (make lint)    (make test + make test-js)

The pipeline runs in Docker containers built from the same docker-compose.yml as local development. Same PHP version. Same Node version. Same MySQL. If it passes locally, it passes in CI.

The deploy step triggers a webhook with our cloud provider that pulls the latest code, runs migrations, rebuilds assets, and restarts workers:

cd staging.example.com
git pull origin main
composer install --no-dev --optimize-autoloader
php artisan migrate --force
npm ci && npm run build
php artisan queue:restart
php artisan config:cache
php artisan route:cache
php artisan view:cache

Staging updates within minutes of a merge to main. Production deploys are triggered manually (or by the same webhook on the production server) after staging verification.

Infrastructure: Queue Workers and Redis

The deployment isn't just the web app. We also manage background infrastructure:

Queue workers process async jobs: CRM sync, notification dispatch, and background calculations. The Forge server runs supervised workers:

php artisan queue:work redis --queue=default,crm --sleep=3 --tries=3

The queue:restart in the deploy script gracefully restarts workers so they pick up the new code.

Redis backs the queue and can optionally back the cache. Separate Redis databases (DB=0 for cache, DB=1 for queues) prevent queue operations from evicting cached data.

The Docker Compose stack mirrors this:

redis:
  image: redis:7-alpine
  profiles: [queue]

queue-worker:
  build: .
  command: php artisan queue:work redis --queue=default,crm
  profiles: [queue]
  depends_on: [redis, mysql]

The profiles key means queue infrastructure only starts when you explicitly ask for it (docker compose --profile queue up). Local development doesn't need Redis running unless you're testing queue jobs.

The E2E Database

E2E tests (Playwright) run against a separate database: myapp_e2e. This gets its own migration and seeding:

make migrate-e2e    # Run migrations on E2E database
make seed-e2e       # Seed test users with proper roles, permissions, relationships

The E2E seeder creates users with known credentials and realistic data. It's idempotent — running it twice doesn't create duplicates.

In CI, the E2E job spins up the full Docker stack (app, nginx, mysql) and runs Playwright against it. Same app, same database engine, same infrastructure as production. The only difference is the data is seeded, not real.

Continuous Delivery (Not Continuous Deployment)

An important distinction: we practice continuous delivery, not continuous deployment.

Every merge to main is deployable. The pipeline proves it: tests pass, linting passes, the build succeeds. But deploying to production is a conscious decision, not an automatic one.

This matters because:

Some features are gated behind environment checks or feature flags
Some changes need manual verification on staging first
Production deploys happen when we decide, not when the CI pipeline finishes

The codebase is always releasable. Whether we release is a business decision, not a technical one.

How This Enables Agent-Assisted Development

Trunk-based development + CI + conventional commits create something crucial for working with an AI agent: a fast, reliable feedback loop.

When Claude writes code:

The tests tell me if it works (seconds to minutes)
The linter tells me if it's clean (seconds)
CI confirms both in an environment I trust (minutes)
If it passes, I merge. If it doesn't, Claude fixes it.
The conventional commit tells me what changed without reading the diff.

There's no "let me review this 2,000-line PR over the weekend." It's: did it pass? Merge. Did it fail? Fix. Ship it. Move on.

Dave Farley calls this "optimizing for feedback." The faster you know whether a change worked, the faster you can iterate. Trunk-based development with CI gives you feedback in minutes, not days.

The Takeaway

Branches live for hours. If your branch is older than a day, something's wrong.
Small PRs, merged often. 145 PRs in 3 months. Each one small enough to review in minutes.
Conventional commits are a communication protocol. Both for humans reading the log and agents writing commits.
CI is the source of truth. If it passes CI, it's good. If it doesn't, fix it before merging.
Continuous delivery means always releasable. Deploy when you want, not when you have to.
Infrastructure is code. Docker, queue workers, Redis, deploy scripts: all versioned, all reproducible.

The combination of tests, linting, CI, and trunk-based development creates a system where changes are small, verified, and frequent. That's exactly the system an AI agent thrives in.

No Big-Bang Rewrites: Running Two Frontends Without Losing Your Mind

Ian Johnson — Tue, 07 Apr 2026 14:36:33 +0000

The Rewrite That Never Was

We needed a modern frontend. The Blade + Bootstrap + jQuery stack was showing its age. The design team had a new UI/UX vision. The natural instinct was: rewrite the frontend in React.

But big-bang rewrites fail. Joel Spolsky wrote about this in 2000. Fred Brooks explained it before that. The pattern is always the same: you spend months building The New Thing, the old thing keeps getting patches, the two diverge, and you end up with two broken systems instead of one working one.

So we didn't rewrite. We migrated. Page by page. Feature by feature. And we set up the architecture so both frontends could coexist without developers losing their minds.

Two Paths, One Codebase

The application runs two frontend architectures simultaneously:

Path	Server Layer	UI Layer	Status
Legacy	Web controllers	Blade views	Bug fixes only
SPA	API controllers	React + TypeScript	All new features

The Legacy path serves ~228 Blade views across ~59 web controllers. It works. Users depend on it. We're not touching it unless something's broken.

The SPA path is the target architecture. React 19, TypeScript 5, Tailwind CSS 4, mounted at /app via a catch-all route:

Route::get('/app/{any?}', [SpaController::class, 'index'])
    ->where('any', '.*')
    ->middleware(['auth', 'verified', 'onboarding', '2fa']);

The SPA gets the initial state (user, CSRF token) from Blade via window.__INITIAL_STATE__, then React Router handles everything client-side.

Environment Gating

The SPA is gated to local, staging, and testing environments. Production serves Legacy pages exclusively (with one exception we'll get to).

// SpaController
public function index()
{
    if (! in_array(app()->environment(), ['local', 'staging', 'testing'])) {
        abort(404);
    }

    return view('spa.index');
}

This means we can build, test, and iterate on the SPA without any risk to production. Staging gets the full SPA experience. Production gets the battle-tested Blade views.

When a feature is stable and tested in the SPA, we have two options:

Wait until the SPA is ungated for production
Ship it now using the interim wrapper pattern

The Interim Wrapper Pattern

This is the trick that made the migration practical. When an SPA page is ready for production but the SPA environment gate isn't lifted yet, we mount the React component inside a Blade shell.

Here's how it works:

1. The SPA component is the source of truth.

// resources/js/spa/pages/Dashboard/Dashboard.tsx
export function Dashboard({ dashboardUrl = '/app/dashboard' }) {
    const { data } = useDashboard(dashboardUrl);
    return (
        <AppShell>
            <DashboardContent data={data} />
        </AppShell>
    );
}

2. The interim wrapper renders it with legacy URL overrides.

// resources/js/dashboard/InterimDashboard.tsx
import { Dashboard } from '../spa/pages/Dashboard/Dashboard';

export function InterimDashboard() {
    return <Dashboard dashboardUrl="/dashboard" />;
}

3. A standalone mount file hydrates it into a Blade shell.

// resources/js/dashboard/main.tsx
import { createRoot } from 'react-dom/client';
import { InterimDashboard } from './InterimDashboard';

createRoot(document.getElementById('dashboard-root')!).render(
    <InterimDashboard />
);

4. A Blade view provides the mount point.

{{-- dashboard/v2.blade.php --}}
@extends('layouts.app')
@section('content')
    <div id="dashboard-root"></div>
    @viteReactRefresh
    @vite('resources/js/dashboard/main.tsx')
@endsection

The key insight: the SPA component is always the source of truth. The interim wrapper is just a thin shell that renders the SPA component with different URL props. Bug fixes go into the SPA component and automatically apply to both the SPA and the legacy context.

We shipped six features this way:

Feature	SPA Component	Interim Wrapper	Blade Shell
Dashboard	`Dashboard`	`InterimDashboard`	`v2.blade.php`
Onboarding	`OnboardingWizard`	`InterimOnboarding`	`onboarding.blade.php`
Planner	`PlannerWizard`	`InterimPlanner`	`index.blade.php`
Guide	`Guide`	`InterimGuide`	`index.blade.php`
Marketplace	`Marketplace`	`InterimMarketplace`	`index.blade.php`
History	`PlannerHistory`	`InterimPlannerHistory`	`history.blade.php`

The Frontend Overhaul

The migration also involved modernizing the entire frontend stack. This happened in a deliberate sequence:

1. Tailwind CSS 4 + Shadcn/ui
Replace Bootstrap with Tailwind. Add Shadcn/ui for consistent React components. This was the foundation layer.

2. SPA pages migrated
All existing React pages updated to use Tailwind and Shadcn/ui components.

3. jQuery elimination
Every $(document).ready() and $.ajax() call replaced with vanilla JS. jQuery removed from the bundle.

4. Blade template migration
All 228 Blade views migrated from Bootstrap classes to Tailwind. This was the biggest single PR, but it was almost entirely CSS class changes; no logic changes.

5. Livewire to React
The few Livewire components we had were rebuilt in React.

6. Dead code removal
Legacy frontend dependencies, unused JS files, and Bootstrap artifacts cleaned out.

Each of these was a separate PR, tested independently, merged to main within a day or two. No long-lived branches. No merge conflicts. The test suite verified nothing broke after each change.

Feature Flags

For features that needed gradual rollout or A/B testing, we used a feature flag / analytics service:

// Server-side feature gate
if ($this->analytics->checkGate($user, 'new_dashboard_layout')) {
    return view('dashboard.v2');
}
return view('dashboard.index');

The analytics service also handles event tracking. Every meaningful user action (order created, ticket submitted, dashboard viewed) gets logged:

$this->analytics->logDashboardView($user, 'v2');

The AnalyticsService wraps the SDK and no-ops when ANALYTICS_ENABLED=false, so tests and local development aren't affected.

Scoping Rules

To keep everyone sane (humans and agents), we defined clear scoping rules:

Bug in a legacy-only domain? Fix in the web controller and Blade view.

Bug in a migrated domain? Fix in the SPA React component. It automatically applies to both contexts.

New feature in any domain? Build the API endpoint and SPA page. No new Blade features.

Migrating a domain? Follow the sequence: extract to Actions → build API controller → create React page → create interim wrapper if needed.

These rules are documented in the CLAUDE.md harness files (we'll get to those in posts 7–8). The agent reads the rules and follows them. No ambiguity about where new code goes.

The Asset Pipeline

Vite 6 handles both the SPA and the interim wrappers:

// vite.config.ts
export default defineConfig({
    plugins: [
        laravel({
            input: [
                'resources/js/spa/main.tsx',        // SPA entry
                'resources/js/dashboard/main.tsx',   // Interim: Dashboard
                'resources/js/onboarding/main.tsx',  // Interim: Onboarding
                'resources/js/planner/main.tsx',        // Interim: Planner
                // ...
            ],
        }),
        react(),
    ],
});

Each interim wrapper gets its own entry point. Vite tree-shakes unused code. The SPA gets its own bundle. Blade pages get the specific entry they need via @vite().

The Takeaway

Never rewrite. Migrate. Page by page, feature by feature, with both systems running in parallel.
Gate the new thing. Don't ship the SPA to production until it's proven in staging.
Use wrappers for early release. The interim pattern lets SPA pages ship inside legacy shells.
SPA component is always the source of truth. The wrapper is just plumbing.
Clear scoping rules prevent confusion about where code goes.
Feature flags for gradual rollout and experimentation.

This architecture means we're never stuck. We can ship features to production today (via interim wrappers) while building toward the full SPA. No pressure. No big bang. Just steady progress.

Actions, Policies, and the Art of Obvious Code

Ian Johnson — Mon, 06 Apr 2026 15:51:04 +0000

Fat Controllers Die Hard

After extracting traits into services, the controllers were thinner, but still fat. A typical store() method in OrdersController did:

Validate input
Create the order
Create related records
Upload documents
Send email notifications
Fire events for audit trails
Send chat notifications
Redirect with flash message

That's eight responsibilities in one method. When Claude looked at this controller to understand how orders work, it had to parse all eight concerns interleaved together. When it needed to build an API controller for the same operation, it would copy-paste the web controller and try to adapt it. Badly.

The fix: Actions.

The Action Pattern

An Action is a single-purpose class with one public method: execute().

namespace App\Actions\Orders;

class CreateOrderAction
{
    public function __construct(
        private NotificationInterface $notifications,
        private AnalyticsService $analytics,
    ) {}

    public function execute(CreateOrderRequest $request, User $user): CreateOrderResult
    {
        // Validate preconditions
        abort_unless($user->organization_id, 422, 'User has no organization.');

        // Create the order
        $order = Order::create([
            'user_id' => $user->id,
            'organization_id' => $user->organization_id,
            'status' => 'pending',
            // ...
        ]);

        // Handle documents
        $this->uploadDocuments($request, $order);

        // Side effects
        event(new OrderCreated($order));
        $this->notifications->sendOrderNotification($order, 'Created', 'New order filed.');
        $this->analytics->logOrderCreated($user, $order->id, 'web');

        return CreateOrderResult::succeeded($order);
    }
}

The Result DTO:

class CreateOrderResult
{
    public function __construct(
        public readonly bool $success,
        public readonly ?Order $order = null,
        public readonly ?string $error = null,
    ) {}

    public static function succeeded(Order $order): self
    {
        return new self(success: true, order: $order);
    }

    public static function failed(string $error): self
    {
        return new self(success: false, error: $error);
    }
}

Now the controller is thin:

// Web controller
public function store(CreateOrderRequest $request, CreateOrderAction $action)
{
    $result = $action->execute($request, auth()->user());
    return redirect()->route('orders.show', $result->order);
}

// API controller — same Action, different response format
public function store(CreateOrderRequest $request, CreateOrderAction $action)
{
    $result = $action->execute($request, auth()->user());
    return new OrderResource($result->order);
}

The web controller returns a redirect. The API controller returns JSON. The business logic is identical because it lives in the Action, not the controller.

This accidentally created the perfect migration bridge. When I later migrated features from web controllers to API controllers, the Action already existed. I just wired it up to a new controller with a different response format. No duplication. No drift.

The Extraction Sequence

Like the services, I extracted Actions one at a time:

Action	From
`CreateOrderAction`	`OrdersController::store()`
`UpdateOrderStatusAction`	`OrdersController::approve/deny()`
`CreateTicketAction`	`TicketsController::store()`
`ApproveTicketAction`, `DenyTicketAction`	`TicketsController`
`EntityCalculator`	`EntityController`
`CreateEntityAction`	`EntityController`
`CreateOrganizationAction`	`OrganizationsController`
`CreateCustomOrderAction`	`CustomOrdersController`

Each extraction followed the same pattern:

Write tests for the existing behavior (if not already covered)
Create the Action class
Create the Result DTO
Move logic from the controller to the Action
Wire the controller to use the Action
Run the full test suite
Ship it

TDD drove the whole process. If I was extracting CreateOrderAction, I first wrote tests against the Action's execute() method directly. Those tests defined the contract. Then I moved the code.

Laravel Policies: Authorization Done Right

Before Policies, authorization was scattered across controllers:

// Inline role checks everywhere
if (!$user->isAdmin() && !$user->isOrgAdmin()) {
    abort(403);
}

// Or worse, duplicated across methods
if ($user->role->name !== 'Admin') {
    abort(403);
}

This is a maintenance nightmare. Every controller does its own authorization. An agent building a new controller has to figure out which role checks to copy. Get it wrong, and you have a security hole.

Laravel Policies centralize authorization into one place per model:

namespace App\Policies;

class OrderPolicy
{
    public function view(User $user, Order $order): bool
    {
        if ($user->isAdmin()) {
            return true;
        }

        if ($user->isOrgAdmin()) {
            return $order->organization_id === $user->organization_id;
        }

        return $order->user_id === $user->id;
    }

    public function approve(User $user, Order $order): bool
    {
        return $user->isAdmin() || $user->isReviewer();
    }
}

Now the controller just says:

$this->authorize('view', $order);

One line. The Policy handles the role logic. The controller doesn't know or care about roles.

Role-Scoped Query Builders

Related to Policies, I also extracted role-scoped query builders. Instead of every controller having:

if ($user->isAdmin()) {
    $orders = Order::all();
} elseif ($user->isOrgAdmin()) {
    $orders = Order::where('organization_id', $user->organization_id)->get();
} else {
    $orders = Order::where('user_id', $user->id)->get();
}

We now have:

$orders = OrderQueryBuilder::forUser($user)->get();

The query builder encapsulates the scoping logic. The controller doesn't know about roles. The agent doesn't need to figure out scoping. It just calls ::forUser($user).

The Cross-Company Security Fix That Started It All

I should mention what kicked this off. Early in the project, we found cross-company data leakage. While this change did not enter production, it was serious, as users from Company A could see data from Company B because the controllers weren't consistently scoping queries.

A couple of PRs fixed the immediate issues. But the root cause was architectural: authorization logic was scattered and inconsistent. There was no single source of truth for "who can see what."

Policies and role-scoped query builders weren't just a nice refactoring. They were the systemic fix for a class of security bugs. Once every query goes through OrderQueryBuilder::forUser() and every action checks $this->authorize(), cross-organization leakage becomes structurally impossible.

This is what "design for security" looks like in practice. Not penetration testing after the fact, but making the insecure path harder to write than the secure one.

Why Obvious Architecture Beats Documentation

After these refactorings, the codebase has a clear pattern:

Request → Controller → authorize() → Action::execute() → Result DTO → Response
                                        ↓
                              Services (notifications, CRM, etc.)
                              Events (audit trails)
                              Query Builders (scoped data access)

When Claude needs to build a new feature, it doesn't need a 50-page architecture document. It can look at any existing Action and follow the same pattern. It can look at any existing Policy and understand the authorization model.

Make the right thing easy and the wrong thing hard. If the Action pattern is established and every existing feature uses it, the agent will use it too. If authorization goes through Policies, the agent will add policy checks. Not because it read documentation, but because that's the pattern it sees everywhere.

This is "convention over configuration" taken to its logical conclusion. The codebase is the documentation.

The Takeaway

Fat controllers are an agent liability. If your logic lives in controllers, the agent will copy-paste across controllers and create drift.
Actions create a migration bridge. Same logic, different response format. Web today, API tomorrow.
Policies centralize authorization. One source of truth beats scattered inline checks.
Query builders centralize scoping. Role-based data access in one place, not everywhere.
Architecture is the best documentation. Clear patterns are self-reinforcing — the agent follows what it sees.

At this point in the project, we had: tests, linting, CI, services with contracts, Actions with DTOs, Policies, and query builders. The codebase was getting healthy. But we still had a big problem: two frontends.

Traits to Services: Refactoring for Testability (and for Agents)

Ian Johnson — Mon, 06 Apr 2026 15:35:33 +0000

The Trait Problem

PHP traits are seductive. You've got some chat notification logic. Four controllers need it. Slap it in a trait, use ChatNotificationTrait, done.

Except now you have:

Hidden dependencies — the trait calls $this-> methods that don't exist in the trait itself
Invisible coupling — change the trait, break four controllers, good luck figuring out which ones
Untestable logic — you can't unit test a trait in isolation because it doesn't exist in isolation
Global state smell — traits encourage reaching into the controller's properties

This codebase had six traits doing serious work:

Trait	What It Did
`ChatNotificationTrait`	Send chat webhooks
`CrmApiTrait`	CRM sync
`OcrScanApiTrait`	OCR via document scanning API
`ConvertApiTrait`	Document conversion
`ExternalApiTrait`	Third-party API integration
`CalculationTrait`	Domain calculations

Each one was used in multiple controllers. Each one mixed HTTP client logic, business rules, error handling, and configuration into a single use statement. Testing any of it meant testing the entire controller.

The Extraction Plan

I planned all six extractions upfront but executed them one at a time, in separate PRs. Each PR:

Created the contract (interface)
Created the service implementation
Bound the interface in the service provider
Updated all controllers to inject the service instead of using the trait
Ran the full test suite

The trait stayed in the codebase until every consumer was migrated. Then it got deleted. At no point was the application broken.

This is the "make change easy, then make the easy change" principle from Kent Beck. Each extraction was a small, safe step. The tests caught any behavioral changes. The linting caught any structural issues.

Contract-First Design

Every extraction started with an interface:

namespace App\Services\Notifications\Contracts;

interface NotificationInterface
{
    public function sendOrderNotification(Order $order, string $message): void;
    public function sendTicketNotification(Ticket $ticket, string $message): void;
}

Then the implementation:

namespace App\Services\Notifications;

class ChatNotificationService implements NotificationInterface
{
    public function __construct(
        private string $webhookUrl,
        private HttpClient $http,
    ) {}

    public function sendOrderNotification(Order $order, string $message): void
    {
        $this->send($this->formatOrderMessage($order, $message));
    }

    // ...
}

Why contracts? Three reasons:

Testability — you can mock NotificationInterface in tests without caring about chat webhooks
Swappability — when we eventually move from chat webhooks to a different notification channel, the interface stays the same
Boundaries — the interface defines what the service does. The implementation defines how. Consumers only know about the what.

The Extraction Sequence

I did these in a deliberate order, starting with the simplest:

1. ChatNotificationTrait → ChatNotificationService
Simplest extraction. HTTP webhook calls with message formatting. No complex state.

2. CrmApiTrait → CRM service classes
More complex — bulk write API, sync tracking, DTO transformations. But the interface was clean: sync users, sync orders.

3. OcrScanApiTrait → DocumentScanner service
OCR integration. Extracted behind a DocumentScannerInterface so we could swap OCR providers later.

4. ConvertApiTrait → Document conversion services
Document format conversion. Straightforward HTTP client wrapper.

5. ExternalApiTrait → ExternalApiClient service
Third-party API integration. Authentication, request signing, response parsing.

6. CalculationTrait → CalculatorService
The most complex extraction. Domain calculation logic with historical configuration tracking. This one needed a ConfigHistory model to properly separate the calculation from the controller state.

Each one took about a day. The full sequence took about two weeks. At no point was the app broken. Users never noticed.

Behind Enough Abstraction

The key phrase is "behind enough abstraction for things to continue working." When you extract ChatNotificationTrait into ChatNotificationService, the controllers that were using $this->sendChatNotification() now call $this->notificationService->sendOrderNotification().

But you don't change all the controllers at once. You:

Create the service
Bind it in the service provider
Update one controller
Run the tests
Update the next controller
Run the tests again

If something breaks, you know exactly which controller change caused it. Small steps. Fast feedback. Empiricism over dogma.

Why Boundaries Help Agents

Here's the thing I didn't fully appreciate until later: clear boundaries help agents more than documentation.

When I later started using Claude to build features, the agent could look at App\Services\Notifications\Contracts\NotificationInterface and immediately understand:

What notification capabilities exist
What parameters they take
How to use them (inject the interface, call the method)

Compare that to the trait world, where the agent would have to:

Find the trait
Read the trait to understand what methods it provides
Figure out which controller properties the trait depends on
Hope it's using the trait correctly

The service interface is self-documenting. The trait is a mystery box.

Architecture is the best documentation for agents. If the code structure is clear, the agent doesn't need instructions. It can read the interfaces and follow the patterns.

The Infrastructure Angle

These extractions also cleaned up how we handle external integrations at the infrastructure level. Each service got its own configuration:

// config/services.php
'crm' => [
    'client_id' => env('CRM_CLIENT_ID'),
    'client_secret' => env('CRM_CLIENT_SECRET'),
    'sync_enabled' => env('CRM_SYNC_ENABLED', false),
    'realtime_sync' => env('CRM_REALTIME_SYNC', true),
    'queue' => env('CRM_SYNC_QUEUE', 'crm'),
],

And jobs that previously lived inside traits got extracted into proper Laravel jobs running on dedicated queues:

// CRM sync runs on its own Redis queue
// so it doesn't block order notifications
QUEUE_CONNECTION=redis
REDIS_QUEUE_DB=1

The queue worker in Docker can be spun up with a profile:

docker compose --profile queue up -d

This means CRM sync can be slow, flaky, or temporarily broken without affecting the rest of the application. The queue retries failed jobs. The dedicated queue means a CRM outage doesn't back up critical notifications.

Separating concerns in the code naturally led to separating concerns in the infrastructure. That's the kind of compound benefit you get from doing the refactoring properly.

The Test Coverage Story

Before the extractions: the controllers were tested, but the trait logic inside them was tested only indirectly. You couldn't test "does the chat message format correctly?" without making an HTTP request to the controller.

After the extractions: each service has its own test. The controller tests mock the service interface. The service tests verify the actual logic.

// Before: testing chat notification meant testing the controller
$this->actingAs($admin)->post('/orders/1/approve')
    ->assertOk(); // ...and hopefully the notification was sent?

// After: test the service directly
$service = new ChatNotificationService($webhookUrl, $mockHttp);
$service->sendOrderNotification($order, 'Approved');
$mockHttp->assertSent(/* ... */);

The test pyramid got healthier. More unit tests for services, fewer fat integration tests for controllers.

The Takeaway

Traits are a code smell when they contain business logic. If you're about to use AI agents on a trait-heavy codebase:

Identify your traits — especially the ones with external HTTP calls, complex logic, or shared state
Extract them behind interfaces — contract-first, one service at a time
Bind the interface in a service provider — so consumers inject the contract, not the implementation
Keep the trait until all consumers are migrated — then delete it
Run the full test suite after every change — this is non-negotiable

The result: clean boundaries, testable services, swappable implementations, and a codebase the agent can actually navigate.

Linting, Static Analysis, and the Pre-Commit Hook That Saved My Sanity

Ian Johnson — Mon, 06 Apr 2026 15:09:17 +0000

The Agent Writes Code. Who Checks It?

In the last post, I talked about why tests come first when working with an AI agent. Tests tell you if the code works. But they don't tell you if the code is good.

An agent will happily write code that passes all your tests and is also:

Inconsistently formatted
Full of type errors Psalm would catch
Using deprecated patterns
Missing semicolons in one file and using them in another

Tests catch behavioral bugs. Linting catches structural rot. You need both.

The Tooling Stack

I added five tools to this Laravel + React codebase, and each one closed a different gap:

Tool	Language	What It Catches
Pint	PHP	Code style (PSR-12, Laravel conventions)
Psalm	PHP	Static analysis (type errors, null safety, dead code)
Prettier	JS/CSS/Blade	Formatting (consistent whitespace, quotes, line length)
ESLint	TypeScript/React	Lint rules (unused vars, hook deps, accessibility)
TypeScript	TypeScript	Type checking (compile-time type safety)

One make lint command runs all five:

lint: pint psalm format eslint typecheck

If any of them fail, the change doesn't ship. Period.

Why Agents Need Checkable Standards

Here's the fundamental problem with giving an AI agent a style guide:

"Please use consistent formatting and follow our coding conventions."

That's a suggestion. The agent might follow it. It might not. You'll spend your review time catching style issues instead of reviewing logic.

Now compare:

$ make lint
Pint ........... FAIL (3 files)
Psalm .......... PASS
Prettier ....... FAIL (1 file)
ESLint ......... PASS
TypeScript ..... PASS

That's a fact. The agent can run make lint, see it failed, fix the issues, and run it again. No ambiguity. No judgment call. Pass or fail.

Prose guides are for humans. Machine-checkable standards are for agents.

This is the same principle behind CI/CD: don't rely on people to remember the rules. Encode the rules into tools that enforce them automatically.

Pre-Commit Hooks: The First Gate

I added a pre-commit hook early in the project. It runs a subset of checks before any commit lands:

This catches the most common issues before they even hit CI. When Claude generates code that's formatted wrong, the pre-commit hook blocks the commit. Claude sees the failure, fixes the formatting, and tries again. Zero human intervention.

The Compound Effect

Each tool on its own catches a category of problems. Together, they create something more powerful: a narrowing of the failure space.

Without any tools, the agent can produce code that's wrong in infinite ways — wrong behavior, wrong types, wrong format, wrong style, wrong patterns.

Add tests: now the behavior is constrained.
Add Pint: now the PHP style is constrained.
Add Psalm: now the types are constrained.
Add Prettier: now the JS formatting is constrained.
Add ESLint: now the React patterns are constrained.
Add TypeScript: now the frontend types are constrained.

Each layer removes an entire category of "wrong." What's left is a much smaller space of valid code. The agent's job gets easier because there are fewer ways to be wrong.

Think of it like bowling bumpers. Each tool is a bumper. The ball (the agent's code) can still miss the pins, but it can't end up in the gutter.

CI as the Final Gate

Pre-commit hooks are great, but they can be bypassed (accidentally or intentionally). CI is the gate that can't be skipped.

The GitHub Actions pipeline runs the full check suite on every push:

The pipeline has four stages:

Build — Docker images pushed to GitHub Container Registry
Code Quality — make lint (Pint, Psalm, Prettier, ESLint, TypeScript)
Tests — make test (2,700+ PHP tests), make test-js (Vitest)
Deploy — Deployment webhook (only on main, only if everything passes)

Nothing merges to main without passing all four stages. This is the same pipeline for human-written code and agent-written code. No exceptions.

Infrastructure: Docker All the Way Down

One thing I want to call out — all of this runs in Docker. Every make target executes inside the Docker app container. The Makefile is the interface:

pint:
    docker compose exec app ./vendor/bin/pint --test

psalm:
    docker compose exec app ./vendor/bin/psalm

format:
    docker compose exec app npx prettier --check "resources/**/*.{js,ts,tsx,css,scss,blade.php}"

eslint:
    docker compose exec app npx eslint "resources/js/**/*.{ts,tsx}"

typecheck:
    docker compose exec app npx tsc --noEmit

This matters because:

Reproducibility — same PHP version, same Node version, same everything, everywhere
No "works on my machine" — if it passes in Docker locally, it passes in CI
The agent doesn't need to know about local setup — it just runs make lint

The Docker stack itself is Ubuntu 24.04 LTS with PHP-FPM, Nginx, MySQL 8.0, and optional Redis + queue worker containers. Everything defined in docker-compose.yml, everything started with make up.

The Queue and Background Jobs

The app processes background jobs — background calculations, CRM syncing, notification dispatch. These run through Laravel's queue system backed by Redis:

# docker-compose.yml (queue profile)
redis:
  image: redis:7-alpine
  profiles: [queue]

queue-worker:
  build: .
  command: php artisan queue:work redis --queue=default,crm --sleep=3 --tries=3
  profiles: [queue]
  depends_on:
    - redis
    - mysql

In tests, QUEUE_CONNECTION=sync runs jobs synchronously so tests don't depend on Redis. In CI, same thing. In production, Redis handles the real work.

The point: infrastructure decisions like queue drivers, cache drivers, and session drivers all have test-mode equivalents. Getting these right early means the agent never has to think about them.

What I Learned

Add linting before you start feature work. Every feature you build without linting is a feature you'll have to retroactively lint later.
Make the fix commands obvious. If make lint fails, the error message should tell you to run make pint-fix or make format-fix. The agent reads these messages and acts on them.
Run everything in Docker. The consistency is worth the overhead. You never debug environment differences again.
CI is not optional. It's the only gate you can trust completely. Pre-commit hooks help, but CI enforces.
Each tool is a force multiplier. Pint alone doesn't transform your workflow. But Pint + Psalm + Prettier + ESLint + TypeScript + tests + CI + pre-commit hooks? That's a system. And systems compound.

The Emerging Pattern

Notice what's happening here. We're not building features yet. We're building the ability to build features safely.

Tests verify behavior. Linting verifies structure. CI verifies both, automatically, on every push. Docker makes it reproducible. The Makefile makes it accessible.

This is the foundation. In the next post, we'll start refactoring, extracting traits into services, pulling logic into Actions - and every change will be validated by this exact system.

When the agent writes a refactoring commit, it runs make lint and make test. If both pass, the refactoring preserved behavior and maintained code quality. That's not a guess. That's proof.

Forem: Ian Johnson

Last post in this series! This post talks about how to apply the lessons learned and the agent harness to any stack, with examples of different popular web technologies.

Beyond Laravel: Applying the Agent Harness to Any Stack

Beyond Laravel: Applying the Agent Harness to Any Stack

The Strategy Is the Point

The Seven Steps

Step 1: Test Infrastructure

Step 2: Linting and Static Analysis

Step 3: Architecture and Boundaries

Step 4: Explicit Patterns for Business Logic

Step 5: Migration Strategy (If Applicable)

Step 6: Trunk-Based Development and CI/CD

Step 7: The Harness and Skills

The Stack Table

Test Infrastructure

Linting and Static Analysis

Architecture Patterns

CI/CD and Delivery

Harness and Skills

Where to Start

The Pattern Behind the Pattern

The Takeaway

Custom Skills: The End-to-End Workflow Made Executable

I Was Repeating Myself

Skills: Slash Commands for Claude Code

The Two Skills

What the Skill Needs from Jira

The Eight Phases

Phase 0: Scope the Target

Phase 1: Requirements

Phase 2: Implementation Plan

Phase 3: Branch Setup

Phase 4: TDD Implementation

Phase 5: Change Approval and Commit

Phase 6: CI and Code Review

Phase 7: Refactoring

Phase 8: Done

What This Looks Like in Practice

Why This Works

Consistency

Harness Feedback Is Built In

TDD Is Non-Negotiable

The Agent Reviews Its Own Work

Separation of Concerns in Commits

The Skill File Anatomy

/implement-jira-card vs /implement-change

The Feedback Checkpoints

What the Skills Don't Do

Building Your Own

The Takeaway

The Curator's Role: Managing a Codebase With an Agent

The Simplest Thing That Could Work

Guardrails First. Always.

The Compound Effect of Simple Rules

Modern Software Engineering and Agents

The Harness Optimizes Feedback — For You and the Agent

You're Codifying Yourself

The Engineer's Role in the Age of Agents

On-the-Loop Management

The Story in Numbers

What I'd Do Differently

What I Wouldn't Change

The Point

Building the Agent Harness: Subdirectory CLAUDE.md Files

One Big File Doesn't Scale

The Harness Architecture

What Goes in Each Harness File

Actions Harness

Tests Harness

Web Controllers Harness

SPA Harness

The Root CLAUDE.md

The Feedback Protocol

The Harness Checks Its Own Work

Design Decisions

The Takeaway

The most important post of the series so far!

In-the-Loop to On-the-Loop: How I Stopped Micromanaging My AI Agent

In-the-Loop to On-the-Loop: How I Stopped Micromanaging My AI Agent

I Was the Bottleneck

`/implement-jira-card` vs `/implement-change`