The Unsung Engineering Behind Every AI Agent You've Used

If you've been building with LLMs and wondering why some setups crush benchmarks while others choke on basic tasks, Vivek Trivedy has a framework that might finally explain the gap. In a detailed breakdown on the LangChain blog, he argues that an "agent" is really just two things: a model and a harness—and most of the differentiation lives in the harness. The model contains raw intelligence; the harness is everything else that makes that intelligence actually useful. This isn't just semantic hair-splitting. It's a design philosophy that's reshaping how serious shops build autonomous systems.

Defining the Harness

Trivedy's definition is refreshingly blunt: if you're not the model, you're the harness. That includes system prompts, tools and MCP integrations, bundled infrastructure like filesystems and sandboxes, orchestration logic for spawning subagents and routing between models, and middleware hooks that enforce deterministic execution patterns like context compaction or lint checks. The cleanest split he offers is this—raw models output text from text (or images, audio). They can't maintain durable state across interactions, execute code, pull realtime knowledge, or spin up environments to do work. All of that infrastructure? Pure harness territory. When you're using a while loop to track conversation history and append user messages, you're already writing harness code. You just might not have called it that.

Core Primitives That Unlock Agent Behavior

The article walks through several foundational primitives starting with filesystems. Trivedy argues this is arguably the most critical harness primitive because it solves multiple problems at once—agents get a workspace to read data and documentation, work can be offloaded instead of crammed into context windows, and state persists across sessions. Multiple agents (and humans) can coordinate through shared files, which enables architectures like Agent Teams. Git adds versioning so agents can track progress, rollback mistakes, and branch experiments. For memory beyond session boundaries, harnesses support standards like AGENTS.md—a file that gets injected into the agent's context on startup, effectively giving models continual learning capabilities without weight editing. Web search and MCP tools like Context7 handle knowledge cutoffs by letting agents pull current information about library versions or data that didn't exist when training stopped.

Battling Context Rot

Here's a problem most developers don't think about until they're deep in a long task: agent performance degrades as context windows fill up. Trivedy calls this "Context Rot" and it's a real phenomenon—models get worse at reasoning as their working memory gets cluttered. Harnesses today are largely delivery mechanisms for good context engineering, which means managing this rot is table stakes. Compaction intelligently offloads and summarizes existing context when windows approach capacity so the agent can keep working. Tool call offloading keeps only the head and tail tokens of large tool outputs while storing full results in the filesystem for potential retrieval later. Skills solve a different flavor of the problem—when too many tools or MCP servers load into context on startup, performance tanks before work even begins. Progressive disclosure via skills lets models access specialized capabilities without drowning in descriptions at initialization.

Long Horizon Autonomous Execution

This is where Trivedy gets into the really interesting engineering. The holy grail for coding agents is autonomous software creation that works correctly over hours or days of work. But today's models struggle with early stopping, decompose complex problems poorly, and lose coherence across multiple context windows. Solving this requires compounding several harness primitives. Ralph Loops intercept when a model tries to exit and reinject the original prompt in a fresh context window, forcing continued work against completion goals. Planning via prompted decomposition keeps agents on track while self-verification through test suites or model self-evaluation creates feedback loops for correctness. The filesystem makes all of this possible because each iteration starts clean but reads state from previous work—agents can pick up where they left off rather than starting over every context window.

The Future: Training and Harness Co-Evolution

Trivedy highlights an interesting trend: products like Claude Code and Codex are post-trained with harnesses in the loop, which means models improve at specific harness-native behaviors like filesystem operations, bash execution, planning, and parallelizing work. This creates a feedback cycle where useful primitives get discovered, added to harnesses, then baked into next-generation model training. But this co-evolution has side effects. He points to Codex-5.3's apply_patch tool logic as an example—models trained with specific harness patterns in mind show degraded performance when those patterns change, even if the underlying task is identical. The Terminal Bench 2.0 Leaderboard tells a more surprising story: Opus 4.6 running in Claude Code scores far below Opus 4.6 in optimized third-party harnesses. Trivedy's own team improved their coding agent from Top 30 to Top 5 on that benchmark by changing nothing but the harness.

Key Takeaways

An "agent" is Model + Harness—everything that's not the model itself lives in the harness layer
Filesystems are arguably the most foundational harness primitive, enabling durable state, context management, and multi-agent collaboration
Context Rot is a real problem; compaction, tool offloading, and skills are all harness-level solutions to it
Long-horizon autonomy requires compounding primitives: Ralph Loops, planning, self-verification, and persistent filesystem state
Terminal Bench 2.0 proves harnesses matter enormously—one model can jump from Top 30 to Top 5 with better harness engineering alone

The Bottom Line

Trivedy is right that most of the interesting engineering in AI systems lives below the model layer. As models get more capable, some harness features will migrate inward—but just as prompt engineering remains valuable despite increasingly capable base models, harness engineering will stay relevant for building genuinely effective agents. If you're not thinking seriously about your harness design, you're leaving significant performance on the table.

> The Unsung Engineering Behind Every AI Agent You've Used