If you've been watching AI coding agents ship increasingly broken output while wondering if it's you, the tooling, or just bad luck—Microsoft's new deep dive into Agent Experience (AX) suggests you're not wrong to be frustrated. A Principal Developer Advocate Waldek Mastykarz published a detailed breakdown this week explaining why these tools consistently fumble APIs, generate deprecated SDK calls, and default to competing technologies even when yours is the right fit.
The Three-Layer Stack That Determines Everything
When a developer prompts an AI coding agent, that instruction travels through three distinct layers: the model itself, the harness (Copilot, Claude Code CLI, Cursor), and your agent extensions. Here's the uncomfortable truth Mastykarz lays out—two of those three are fixed constraints you cannot change. The model contains whatever training biases it learned, including outdated documentation patterns or technologies with more training data. You can't retrain it or modify its weights. The harness controls system prompts, tool-calling protocols, and context assembly—and different harnesses interpret the same MCP server differently, meaning an extension that works perfectly in Copilot might break entirely in Claude Code CLI.
Why Your Extensions Are Fighting Each Other
The third layer—agent extensions—is where your leverage lives. Skills, MCP servers, instruction files, custom agents: these are all tools you can shape to teach models about your technology and correct their misconceptions. But here's the gotcha Mastykarz emphasizes repeatedly: every agent has a finite context window, and those extensions compete for it. When developers have 15 extensions installed, tool descriptions get summarized, truncated, or dropped entirely because something else claimed the space first. This is what Microsoft calls the composition problem—and nobody has solved it yet. An extension with perfect discovery and correct invocations in isolation degrades significantly when other popular extensions are present.
The Three Failure Modes That Kill Your Outcomes
After running hundreds of agent sessions across different products, Microsoft's team identified three consistent failure patterns. Discovery failure happens when your extension exists but never reaches the context window—too many competitors, packaging issues, or harness registration problems make it invisible to the model. Selection failure occurs when the extension is present but the agent doesn't connect it to developer intent—this shows up most often and is actually fixable by aligning your tool descriptions with how developers naturally describe their problems rather than internal terminology. Quality failure is subtler: the extension gets discovered, selected, invoked, but the content it returns hurts more than helps—walls of text that confuse models, verbose output that pushes useful context out of the window, or accurate-but-overwhelming responses.
Measuring Lift vs Drag
Mastykarz frames all AX work as an optimization problem between lift and drag. Your extensions create lift when they improve outcomes—the agent discovers your tool, uses it correctly, and generates working code with the right SDK and patterns. Extensions create drag when they're present but outcomes stay the same or worsen: never discovered (zero lift but no harm), discovered but content that confuses the model, or working alone but conflicting with other extensions. The uncomfortable reality is you usually don't know it's happening. The only solution is controlled measurement—running identical scenarios with and without your extensions while keeping everything else constant.
Key Takeaways
- You cannot change the model or harness—agent extensions are your only lever for improving agent behavior with your technology
- Context windows are zero-sum: more installed extensions don't mean better outcomes, sometimes they mean worse
- Selection failure (tool never gets invoked) is the most common issue and the most fixable—just improve descriptions
- Quality failures are insidious: your extension might be called correctly but returning content that makes outcomes worse
The Bottom Line
This AX framework is exactly what the developer tooling space needed—stop blaming models or harnesses for poor output when you can actually measure whether your extensions create lift or drag. If you're shipping technology and not systematically testing agent behavior with controlled comparisons, you're flying blind while your competitors optimize ruthlessly.