CLAUDE.md files have become the de facto way to customize Claude's behavior for code tasks, but comparing different instruction sets has always been anecdotal—until now. Developer emiliolugo dropped clawmark on GitHub this week, a Rust CLI that runs controlled benchmarks of competing CLAUDE.md variants against five bundled SWE-bench Lite problems. The tool automates the entire pipeline: spawning Claude locally, evaluating generated patches with the official SWE-bench harness in Docker, and spitting out a leaderboard ranked by resolve rate.
How Clawmark Works
The tool requires Rust (MSRV 1.79), the Claude CLI >= 1.0.0, Docker >= 24.0, Python 3.11+ with swebench installed, and git >= 2.39. After running clawmark doctor to verify prerequisites, you point it at two or more variant files using either shorthand flags (--a/--b) for two-way comparisons or the flexible --variant form for N-way tests. Each variant gets tested against all five tasks with configurable timeouts and optional parallel execution up to 5 concurrent invocations.
The Benchmark Pipeline
For every variant-task combination, clawmark clones the target repository at the task's base commit, injects your CLAUDE.md file as-is into the repo root, invokes Claude via claude -p --output-format json --dangerously-skip-permissions, and captures git diff HEAD as the patch. After all predictions complete, it fires up the SWE-bench harness once per variant to evaluate patches in Docker isolation. The final report shows resolve rates alongside wall-clock time, token counts, estimated USD cost, and cost-per-resolve for each variant.
Security Tradeoffs Worth Noting
The tool uses --dangerously-skip-permissions when invoking Claude, meaning your AI agent has full access to the host system during benchmarks. Clawmark mitigates shell injection via subprocess argv arrays and prevents path traversal by canonicalizing paths against the working directory, but it explicitly warns: 'Do not run untrusted CLAUDE.md variants' through v1. The SWE-bench harness itself runs in Docker containers, so model patches are evaluated safely—clawmark never executes generated code on the host directly.
What Clawmark Doesn't Do (Yet)
v1 is intentionally minimal—no config files, web UI, remote execution, retries, resume functionality, progress bars, repeated trials, or full 300-task SWE-bench runs. There's no turn limit, token budget enforcement, retry policy, or per-task cost cap. The tool warns that open-ended variants can 'consume materially more time and usage quota than short, patch-focused variants.' For first runs, the documentation recommends tight benchmark-oriented instructions like 'You are running inside an automated benchmark. Make the smallest code change that addresses the issue.'
Why This Matters for AI Coding Agent Workflows
As Claude, Copilot, and other AI coding assistants become core to developer workflows, the community is realizing that instruction engineering matters as much as model selection. Clawmark gives practitioners a scientific way to iterate on CLAUDE.md variants—testing whether verbose system prompts outperform terse ones, or whether specialized instructions for certain file types actually help. The five-task smoke set won't tell you everything, but it's a reproducible starting point that beats 'feels faster to me.'
Key Takeaways
- Clawmark benchmarks 2+ CLAUDE.md variants against identical SWE-bench Lite problems with Docker isolation
- Supports two-variant shorthand (--a/--b) or N-way comparisons via --variant flags
- Reports resolve rate, runtime, token usage, estimated cost, and per-resolve efficiency
- Security model requires trust in variant content—use only your own CLAUDE.md files
- v1 intentionally minimal: no retries, progress UI, full SWE-bench, or budget enforcement