A new benchmark called SkillsBench is challenging assumptions about how developers should be using AI agent skills, and the results should make you reconsider your entire workflow. The paper tested 86 tasks across 11 domains with seven different agent-model configurations and found that curated skills raised average pass rates by 16.2 percentage points—but self-generated skills provided no benefit on average. That's a massive difference for something many developers assume are interchangeable approaches to the same problem.
What's Wrong With Self-Generated Skills
The paper's methodology for testing "self-generated" skills is where things get interesting—and where Anson Biggs, the author of the blog post that brought this research to Hacker News attention, takes serious issue. The benchmark essentially asked struggling models to write about a problem before attempting it. Biggs argues this isn't skill generation at all: it's just thinking blocks with extra steps. "They just reinvented thinking blocks but worse," he writes. If you want meaningful procedural knowledge from an agent, you can't ask the same model that couldn't solve the problem to generate documentation about solving it—that's circular logic dressed up as a benchmark. Biggs has been watching this space closely at work and noticed a pattern: developers keep making the same mistake. When their agent struggles with something, they immediately prompt it to write a skill documenting how to do that thing better next time. This is functionally identical to asking for more thinking tokens before output—and just as useless as a standalone strategy.
What Skills Actually Are in Claude Code
Skills in the Claude Code ecosystem aren't magic prompts or elaborate system instructions—they're structured folders living under .claude/skills/ with a specific layout. Each skill gets its own directory containing a SKILL.md file (the actual procedural knowledge), supporting tools like shell scripts for complex commands, and reference materials for edge cases. Biggs shares an example from his GitLab CI monitoring workflow: instead of making Claude figure out how to watch pipeline jobs every session, he built a skill that explains the setup once, includes the CLI wrapper script, and points to additional documentation for troubleshooting. The model reads it once and then operates correctly for the rest of the conversation.
Three Valid Use Cases for Skills
Biggs identifies three scenarios where skills actually provide value. First, context injection: agents are stateless by default, meaning every new conversation is like meeting them for the first time with zero project knowledge. CLAUDE.md handles broad strokes, but monorepos often have language-specific patterns, Docker Compose quirks, or CI configurations that can't fit in a single file. Skills fill those gaps for common-but-not-universal scenarios—like explaining how to run integration tests when the setup involves x86 containers on Apple Silicon Macs. Second, repetition avoidance. If you regularly tell agents to align your documentation folder with merge request descriptions and issue trackers, writing that process out once as a skill means never typing it again. Third—and this is where Biggs sees the most value—hard problem capture: when an agent gets stuck on something genuinely difficult, you intervene to unstick it, then immediately ask what gap caused the failure in the first place. Sometimes it's trivial; sometimes you get genuine insight worth preserving as a skill for future sessions.
The Right Way to Build Skills
"If I ask you how you did something cool with an Agent, and you just on the fly have a fresh Agent build me a SKILL.md on my question, I will kill you," Biggs writes. That's the core principle: skills need to capture knowledge that exists nowhere else in the conversation context. A fresh session has no gaps to fill because there's been no attempt yet—you're just asking for generic documentation. The value comes from distilling lessons learned during actual problem-solving sessions where the model discovered what it didn't know. Biggs ran his own version of SkillsBench using properly constructed skills and found agents performed dramatically better. He admits he doesn't have funding to fully validate the results, but preliminary passes convinced him the methodology matters more than most benchmarking papers acknowledge. "I think this essentially doubles the amount of dataset needed for this benchmark so I assume that's why the authors didn't include this method," he speculates.
Key Takeaways
- Skills are folder structures with SKILL.md files plus supporting tools and references—not standalone prompts
- Self-generated skills from struggling models are just thinking blocks rebranded
- Build skills after successful problem-solving to capture gaps that actually existed, not imagined ones
- Use skills for context injection in stateless agents, repetition of common procedures, and documenting hard-won lessons
The Bottom Line
The benchmark's headline finding is real—structured procedural knowledge genuinely helps agents—but the methodology for self-generated skills is fundamentally flawed. If you're building skills on-the-fly from fresh sessions, you're not capturing institutional knowledge; you're just generating more context that won't transfer when it matters. Wait until your agent struggles, solve the problem together, then extract what was missing. That's how you build skills worth keeping.