LLM agents are drowning in their own toolboxes. A single agent today stacks function-calling tools, Model Context Protocol (MCP) endpoints, and procedural skills—each described in context on every step, each paid for whether it's used or not. Anthropic's own measurements show 58 tools consuming roughly 55,000 tokens of context before a user asks their first question, with community reports describing real-world setups exceeding 140,000 tokens of tool definitions. The catalogue keeps growing, selection accuracy keeps falling, and critically, the system learns nothing from the outcome of each call. Wrong tools get loaded again next time. Failed parameters repeat. That dead-end exploration—the wrong skill description, the failing invocation, the retry—is where the real cost concentrates.

What Stigmergy Brings to the Table

The solution proposed by developer Sebastian Hanke borrows directly from how ant and termite colonies coordinate without a plan or a leader. French zoologist Pierre-Paul Grassé coined "stigmergy" in 1959 from Greek roots meaning "work that directs through marks." Termites build structured nests with no blueprint because each one reads the trace its predecessors left in the world—the half-built pillar tells the next termite to continue there. Ants deposit pheromone trails that accumulate faster on shorter paths, drawing more walkers and accumulating faster still. No individual ant compares distances; the solution lives in the field itself. Marco Dorigo formalized this as Ant Colony Optimization in his 1992 thesis, introducing selection rules controlled by an exploitation-versus-exploration parameter (q0) and bounding pheromone values to prevent both starvation of unused edges and lock-in on a single path.

The Pheromone Graph Design

The transfer Hanke proposes is direct: capability sequences become a directed graph whose edges carry pheromone values. When a task succeeds, the edges traversed get reinforced. When a tool call fails or a skill proves wrong, those paths decay through lazy exponential evaporation. Selection uses Thompson sampling over decayed evidence—probabilistic but seeded deterministically so results are reproducible. The deposit amount itself is gated on both how good and how cheap the outcome was, meaning efficient successes reinforce more strongly than expensive wins. This isn't a static retrieval system; it's an actual feedback loop where past outcomes shape future choices in real time.

Prompt Caching Changes Everything

Here's where Hanke makes a crucial observation that changes the typical optimization approach: when hosts keep tool definitions in a cached prompt prefix—which is now standard and reaches high hit rates—any per-step edit to that block breaks the cache from the edit point onward. Narrowing the visible set trades a documented standing cost for a recurring cache-miss cost, making it no longer an unambiguous win. The solution: keep every capability visible so the cache stays warm, but add learned paths as small per-turn context after the breakpoint. Think of it as guidance rather than filtering—the proven sequence from task to solution that matches the current problem, surfaced before the model even thinks about what to call next.

Technical Architecture and Trust

The implementation uses a hexagonal core behind two ports with a proactive pre-call step inside the controlled loop—meaning the layer can surface learned-path guidance ahead of the model while the cached tool block stays intact. Everything runs locally on the user's machine with no network egress; selection is deterministic given a seed, and there's a strict local-first boundary for trust and portability. Bounded pheromone values prevent unused capabilities from starving completely while also preventing runaway lock-in on early successful paths. The system refines an already-warm field rather than converging from cold start online, with heavy convergence pushed offline where the harness can run many parallel "ants" per task at once.

Key Takeaways

  • Anthropic measured 58 tools at ~55,000 tokens before user input; community reports describe setups exceeding 140,000 tokens of tool definitions
  • The cost concentrates in failed exploration (wrong skill loads, failing calls, retries), not the one successful invocation itself
  • Prompt caching makes narrowing visible tools counterproductive—path guidance keeps caches warm while still directing selection
  • Stigmergy provides positive feedback (reinforce success), negative feedback (evaporation/forgetting), and controlled stochasticity for exploration
  • The design is local-first, deterministic, and runs entirely on the user's machine with no network calls required

The Bottom Line

This is a genuinely interesting transfer of a well-understood natural principle to a concrete engineering problem. The prompt caching insight alone—that narrowing breaks caches while path guidance doesn't—is worth the price of admission. The empirical study hasn't run yet, so treat this as a design paper with promising preliminary data rather than proven infrastructure. But if the token-reduction numbers hold up in controlled testing, this could become the standard way to handle capability selection in agent frameworks that can't afford to keep relearning the same dead ends.