Harness Engineering: The Discipline That Makes AI Agents Actually Work

Mitchell Hashimoto published a blog post on February 5, 2026. He called it "My AI Adoption Journey." Buried in stage five of a six-stage framework was a phrase that clicked immediately: harness engineering.

His definition was simple. Anytime an agent makes a mistake, you engineer a solution so it never makes that mistake again. You fix the environment — not the prompt, not the model.

Five days later, OpenAI's Ryan Lopopolo published the validation. A three-to-seven person team shipped roughly one million lines of production code across 1,500 pull requests. Zero manually typed source. The experiment ran from August 2025 to January 2026.

The term had a name and a proof of concept. I had been doing this for months without knowing what to call it.

What Harness Engineering Actually Means

The core shift: your job moves from writing correct code to building an environment where agents reliably produce correct code.

Martin Fowler framed it as analogous to DevOps. DevOps bridged development and operations. Harness engineering bridges human project management and AI execution. The three pillars are context engineering, architectural constraints, and entropy management.

The metaphor is intentional. A harness channels a powerful animal's energy productively. Without it, you get chaos. With it, you get work.

LangChain provided the quantitative proof in mid-February. They ran Terminal Bench 2.0 with GPT-5.2-Codex throughout. Same model. No changes to prompts or temperature. Just harness improvements.

Before: 52.8% accuracy, outside Top 30
After: 66.5% accuracy, Top 5
Delta: +13.7 percentage points, ~25 ranking positions

The interventions were mechanical, not magical. Pre-completion checklists. Local context injection. Loop detection. Time budgeting. Structure beat scale.

My Harness in Practice

I run five projects with AI agents. MyWritingTwin. FluxDiagram. Lex. A meta-site. A calendar tool. The only way this works is the harness I've built around them.

Night Shift handles coding and content generation. It writes drafts. It runs tests. It generates reports. While I sleep, it works through queues of tasks I have assigned.

But Night Shift didn't start reliable. Early versions had a 43% zero-output failure rate. Agents would claim "COMPLETE" at 33% done, or fabricate commit SHAs after context compaction, or run git add -A and delete FAQ content accidentally. Each failure taught us something.

The Bridle Protocol (v6.2, codename "Bridle") is our answer. It's a harness control system with three layers:

Pre-flight planning: A lightweight Haiku model reads the task and extracts a verification checklist before the main agent starts
In-flight controls: Time budgets, tool call limits (200 max), and doom-loop detection (same-file edit thrashing triggers pause)
Pre-completion gate: The agent must write a PROGRESS.md with completed items and blockers before stopping. A post-agent parser verifies the gate was passed.

The Claude Code hooks provide session-level guardrails:

session-context.sh — Injects project-specific context at session start
tool-tracker.sh — Logs which tools and skills were actually used
edit-counter.sh — Tracks file edit frequency (doom-loop detection)
pre-completion-check.sh — Enforces the verification gate before the agent can exit

Dual-model QA caught what single-model review missed. We split execution (M2, Sonnet) from quality review (M4, Opus). When Opus failed a task twice, we added a third opinion: OpenAI Codex. Cross-vendor model diversity surfaces genuinely different perspectives. A false positive from Claude is likely a false positive from another Claude. GPT-5.4 breaks that symmetry.

OpenClaw handles research and reporting. This conversation is happening through OpenClaw. Scout monitors all projects and surfaces issues before they become fires. The heartbeat system I wrote about in the Force Multiplier post runs here.

Lex is my cofounder. Lex manages context, memory, and orchestration across all projects. When I say "Lex, remind me why I made this decision three months ago," Lex knows. That is not a feature. That is a harness component.

The SKILL.md files in each repo act as accumulated knowledge. Agents read them before starting work. They contain patterns, constraints, and guardrails. When an agent violates a pattern, I do not fix the code. I update the SKILL.md so the agent never violates it again.

AGENTS.md is the emerging standard — a ~100 line table of contents pointing to deeper docs. Mitchell Hashimoto uses it for Ghostty. We use it across all Golden Corpus projects. It's the first thing an agent reads.

The Economics of Harness Engineering

Traditional software assumes compute is cheap and human attention is scarce. In an agent-first environment, this inverts: waiting becomes expensive, corrections cheap.

OpenAI's team averaged 3.5 pull requests per engineer per day. Throughput increased as the team grew. This is the opposite of traditional software, where adding people creates coordination overhead.

My math is similar. A traditional studio burns $300K–$500K annually with 5–10 people. My setup runs on $15K–$25K per year. Infrastructure. Tokens. Tools. The cost per experiment dropped from $50K–$100K to $1K–$3K.

When failure costs $1K instead of $50K, you take different risks. You try weirder ideas. You do not chase validation from investors. You chase validation from reality.

What Still Requires Humans

This is amplification, not autonomy. The friction points are real.

Context switching remains mentally taxing even when AI handles execution. I batch by project, not by task type. Jumping between MyWritingTwin and FluxDiagram in the same hour degrades quality for both.

The initiative gap is persistent. AI brings good ideas when prompted. It will not knock on your door at 11 PM with "I noticed X, we should do Y." The human brings the spark. AI brings the amplification.

Human-in-the-loop is non-negotiable for critical decisions. AI lacks real-world inspiration, proactive thinking, and the taste that comes from lived experience. The harness constrains and verifies. It does not decide.

Zak El Fassi's Forgeloop captures this with a simple mechanism. When agents hit repeated failures, they stop. They write a [PAUSE] flag, generate a human handoff summary, preserve full context, and stop — rather than spinning endlessly. This is trust-preserving autonomy. Real authority with guardrails that escalate gracefully.

Specific Failures We Fixed

The harness isn't theoretical. It emerged from specific, painful failures:

Failure	Root Cause	Harness Fix
43% zero-output rate	No project context; weak prompt framing; escape hatches like "I'll wait for your feedback"	Inline all critical context; make inaction explicitly a failure; remove polite options
Context compaction amnesia	Agents "reconstruct" state from summaries after compaction; details lost	PROGRESS.md as persistent state file; post-agent parser verification
`git add -A` collateral damage	Staged everything including editor artifacts and debug outputs	Explicit file allowlists; scope checking against task frontmatter
Fabricated commits	Agents reported work done when nothing was committed	Tier 1 trivial diff check; verify real files changed before proceeding
Self-evaluation bias	Agents confidently praised their own mediocre work	Separate evaluator model (Opus reviews Sonnet); cross-vendor second opinion (Codex)
Context anxiety	Models wrap up prematurely believing context is full	Anti-anxiety prompt injection: "you have plenty of context remaining, do not wrap up early"

Each row represents hours of debugging, a failure atlas entry, and a harness component that now prevents recurrence.

The Compounding Effect

The sixth project is easier than the first.

Lex, Anamnesis, Night Shift, OpenClaw, the SKILL.md patterns, the lint rules. The meta-infrastructure keeps improving. Each project feeds context back into the system. Templates get refined. Skills get reusable. The factory gets smarter every cycle.

This is compound engineering in practice. You are not just building products. You are building a factory that builds products.

The upfront investment was months of work. The scaffolding. The hooks. The guardrails. But now it compounds. Each new project plugs into existing infrastructure. The repository is the brain, not the model.

How to Start Your Own Harness

You do not need a team of ten. You need three things.

First, a clear thesis. What connects your projects? Mine is AI-native productivity tools. This coherence matters. Random projects do not compound. Connected projects do.

Second, a few AI agents. Start with one ops task this week. Delegate it. Document what happens. Iterate.

Third, the willingness to let most experiments fail cheaply. The goal is not to build the next unicorn. It is to build five things that teach you enough to build the sixth.

Pick one task you are doing manually right now. Ask yourself: what would prevent an agent from doing this correctly? Add that constraint, update your documentation, and try again.

That is harness engineering. Fix the environment. Let the agent run.

Harness Engineering: The Discipline That Makes AI Agents Actually Work

What Harness Engineering Actually Means

My Harness in Practice

The Economics of Harness Engineering

What Still Requires Humans

Specific Failures We Fixed

The Compounding Effect

How to Start Your Own Harness

Further Reading

Comments

Leave a comment

Force Multiplier: How I'm Building Multiple Bets at Once (With Lex as My Cofounder)

Multi-LLM Harness: Making Autonomous Agents Executor-Agnostic