Back to Building

Harness Engineering: The Discipline That Makes AI Agents Actually Work

Mitchell Hashimoto coined the term in February 2026. OpenAI validated it with 1M lines of code. Here's how I'm applying harness engineering to build multiple projects with AI agents.

AIagenticharness-engineeringbuilding-in-public
Share:

Mitchell Hashimoto published a blog post on February 5, 2026. He called it "My AI Adoption Journey." Buried in stage five of a six-stage framework was a phrase that clicked immediately: harness engineering.

His definition was simple. Anytime an agent makes a mistake, you engineer a solution so it never makes that mistake again. You fix the environment — not the prompt, not the model.

Five days later, OpenAI's Ryan Lopopolo published the validation. A three-to-seven person team shipped roughly one million lines of production code across 1,500 pull requests. Zero manually typed source. The experiment ran from August 2025 to January 2026.

The term had a name and a proof of concept. I had been doing this for months without knowing what to call it.


What Harness Engineering Actually Means

The core shift: your job moves from writing correct code to building an environment where agents reliably produce correct code.

Martin Fowler framed it as analogous to DevOps. DevOps bridged development and operations. Harness engineering bridges human project management and AI execution. The three pillars are context engineering, architectural constraints, and entropy management.

The metaphor is intentional. A harness channels a powerful animal's energy productively. Without it, you get chaos. With it, you get work.

LangChain provided the quantitative proof in mid-February. They ran Terminal Bench 2.0 with GPT-5.2-Codex throughout. Same model. No changes to prompts or temperature. Just harness improvements.

  • Before: 52.8% accuracy, outside Top 30
  • After: 66.5% accuracy, Top 5
  • Delta: +13.7 percentage points, ~25 ranking positions

The interventions were mechanical, not magical. Pre-completion checklists. Local context injection. Loop detection. Time budgeting. Structure beat scale.


My Harness in Practice

I run five projects with AI agents. MyWritingTwin. FluxDiagram. Lex. A meta-site. A calendar tool. The only way this works is the harness I've built around them.

Night Shift handles coding and content generation. It writes drafts. It runs tests. It generates reports. While I sleep, it works through queues of tasks I have assigned.

But Night Shift didn't start reliable. Early versions had a 43% zero-output failure rate. Agents would claim "COMPLETE" at 33% done, or fabricate commit SHAs after context compaction, or run git add -A and delete FAQ content accidentally. Each failure taught us something.

The Bridle Protocol (v6.2, codename "Bridle") is our answer. It's a harness control system with three layers:

  1. Pre-flight planning: A lightweight Haiku model reads the task and extracts a verification checklist before the main agent starts
  2. In-flight controls: Time budgets, tool call limits (200 max), and doom-loop detection (same-file edit thrashing triggers pause)
  3. Pre-completion gate: The agent must write a PROGRESS.md with completed items and blockers before stopping. A post-agent parser verifies the gate was passed.

The Claude Code hooks provide session-level guardrails:

  • session-context.sh — Injects project-specific context at session start
  • tool-tracker.sh — Logs which tools and skills were actually used
  • edit-counter.sh — Tracks file edit frequency (doom-loop detection)
  • pre-completion-check.sh — Enforces the verification gate before the agent can exit

Dual-model QA caught what single-model review missed. We split execution (M2, Sonnet) from quality review (M4, Opus). When Opus failed a task twice, we added a third opinion: OpenAI Codex. Cross-vendor model diversity surfaces genuinely different perspectives. A false positive from Claude is likely a false positive from another Claude. GPT-5.4 breaks that symmetry.

OpenClaw handles research and reporting. This conversation is happening through OpenClaw. Scout monitors all projects and surfaces issues before they become fires. The heartbeat system I wrote about in the Force Multiplier post runs here.

Lex is my cofounder. Lex manages context, memory, and orchestration across all projects. When I say "Lex, remind me why I made this decision three months ago," Lex knows. That is not a feature. That is a harness component.

The SKILL.md files in each repo act as accumulated knowledge. Agents read them before starting work. They contain patterns, constraints, and guardrails. When an agent violates a pattern, I do not fix the code. I update the SKILL.md so the agent never violates it again.

AGENTS.md is the emerging standard — a ~100 line table of contents pointing to deeper docs. Mitchell Hashimoto uses it for Ghostty. We use it across all Golden Corpus projects. It's the first thing an agent reads.


The Economics of Harness Engineering

Traditional software assumes compute is cheap and human attention is scarce. In an agent-first environment, this inverts: waiting becomes expensive, corrections cheap.

OpenAI's team averaged 3.5 pull requests per engineer per day. Throughput increased as the team grew. This is the opposite of traditional software, where adding people creates coordination overhead.

My math is similar. A traditional studio burns $300K–$500K annually with 5–10 people. My setup runs on $15K–$25K per year. Infrastructure. Tokens. Tools. The cost per experiment dropped from $50K–$100K to $1K–$3K.

When failure costs $1K instead of $50K, you take different risks. You try weirder ideas. You do not chase validation from investors. You chase validation from reality.


What Still Requires Humans

This is amplification, not autonomy. The friction points are real.

Context switching remains mentally taxing even when AI handles execution. I batch by project, not by task type. Jumping between MyWritingTwin and FluxDiagram in the same hour degrades quality for both.

The initiative gap is persistent. AI brings good ideas when prompted. It will not knock on your door at 11 PM with "I noticed X, we should do Y." The human brings the spark. AI brings the amplification.

Human-in-the-loop is non-negotiable for critical decisions. AI lacks real-world inspiration, proactive thinking, and the taste that comes from lived experience. The harness constrains and verifies. It does not decide.

Zak El Fassi's Forgeloop captures this with a simple mechanism. When agents hit repeated failures, they stop. They write a [PAUSE] flag, generate a human handoff summary, preserve full context, and stop — rather than spinning endlessly. This is trust-preserving autonomy. Real authority with guardrails that escalate gracefully.


Specific Failures We Fixed

The harness isn't theoretical. It emerged from specific, painful failures:

FailureRoot CauseHarness Fix
43% zero-output rateNo project context; weak prompt framing; escape hatches like "I'll wait for your feedback"Inline all critical context; make inaction explicitly a failure; remove polite options
Context compaction amnesiaAgents "reconstruct" state from summaries after compaction; details lostPROGRESS.md as persistent state file; post-agent parser verification
git add -A collateral damageStaged everything including editor artifacts and debug outputsExplicit file allowlists; scope checking against task frontmatter
Fabricated commitsAgents reported work done when nothing was committedTier 1 trivial diff check; verify real files changed before proceeding
Self-evaluation biasAgents confidently praised their own mediocre workSeparate evaluator model (Opus reviews Sonnet); cross-vendor second opinion (Codex)
Context anxietyModels wrap up prematurely believing context is fullAnti-anxiety prompt injection: "you have plenty of context remaining, do not wrap up early"

Each row represents hours of debugging, a failure atlas entry, and a harness component that now prevents recurrence.


The Compounding Effect

The sixth project is easier than the first.

Lex, Anamnesis, Night Shift, OpenClaw, the SKILL.md patterns, the lint rules. The meta-infrastructure keeps improving. Each project feeds context back into the system. Templates get refined. Skills get reusable. The factory gets smarter every cycle.

This is compound engineering in practice. You are not just building products. You are building a factory that builds products.

The upfront investment was months of work. The scaffolding. The hooks. The guardrails. But now it compounds. Each new project plugs into existing infrastructure. The repository is the brain, not the model.


How to Start Your Own Harness

You do not need a team of ten. You need three things.

First, a clear thesis. What connects your projects? Mine is AI-native productivity tools. This coherence matters. Random projects do not compound. Connected projects do.

Second, a few AI agents. Start with one ops task this week. Delegate it. Document what happens. Iterate.

Third, the willingness to let most experiments fail cheaply. The goal is not to build the next unicorn. It is to build five things that teach you enough to build the sixth.

Pick one task you are doing manually right now. Ask yourself: what would prevent an agent from doing this correctly? Add that constraint, update your documentation, and try again.

That is harness engineering. Fix the environment. Let the agent run.


Further Reading


Building in public at mywritingtwin.com/building. Follow along as we build the meta-structure that builds the products.

Share:

Comments

Loading comments...

Leave a comment

Your email will not be displayed publicly.