Infracost just shipped a serious wake-up call to anyone maintaining CLIs that AI coding agents invoke as subprocesses. The infrastructure cost estimation tool redesigned their CLI specifically for machine callers and achieved a 79% reduction in Claude's output token usage alongside a 67% drop in API costs on the hardest queries. A benchmark against Opus across 16 questions spanning a 3-project Terraform fixture with 1,171 resources showed accuracy climbing from 45% to 100% on scoreable problems, while total spending fell 41%. The kicker: one question that previously cost $3.51 and hit the 25-turn cap without producing an answer now returns correct results for $0.25.
The Bare Claude Problem
The original pain point was predictable once you think about it. When AI agents call a CLI tool, they're working with raw output they didn't design. Infracost's --json flag dumped hundreds of kilobytes of structured data through the model's context window, and the agent had to write pipelines—jq filters, Python parsers, sort and wc chains—to extract what it actually needed. Each subprocess streamed its output back through the model's context, compounding token costs on already-expensive API calls. The model was composing slicing logic itself in token-expensive shell invocations because the CLI didn't expose the predicates it needed.
Predicate Pushdown: Flags That Pay for Themselves
The fix landed as new flags on infracost inspect. --addresses-only acts as an alias for -fields=address, returning one resource address per line—enough for wc -l counts or piping into subsequent text processing. The --filter flag accepts a comma-separated AND'd grammar of key=value predicates that covers the slicing patterns most common in agent traces: filtering by policy violation type, cost threshold, tag compliance status. Instead of chaining jq | python | sort | wc across a massive JSON dump, agents now issue single commands like infracost inspect --summary --fields distinct_failing_tagging_resources and get exactly what they asked for. On the hard question bucket where bare Claude burned 113K output tokens, predicate pushdown dropped that to 24K—a 79% reduction on token-heavy queries alone.
TOON Format: Stop Paying for Redundant Field Names
The second major change addresses JSON's overhead when consumed by LLMs. A 500-row table with 5 columns using standard JSON pays for the field name "address" appearing 500 times—useful redundancy for humans, pure tax for a model that just needs to count or sum. Infracost adopted TOON (Token-Oriented Object Notation), an indentation-based format with a published spec at toon-format/spec. The key property: uniform object arrays render as a single header line plus comma-separated value rows instead of repeating field names per record. Benchmarks on tabular datasets show roughly 35% fewer tokens versus minified JSON and 59% fewer versus pretty-printed JSON using the GPT-5 o200k_base tokenizer, with comprehension accuracy holding steady around 76%. On Infracost's FinOps issue output—compact JSON dominated by uniform arrays—the savings sit in the 30-40% range.
Building a Benchmark Harness That Actually Measures
None of these numbers exist without a rigorous measurement framework. The team built a custom harness that ran each of 16 questions across three configurations: bare Claude with Bash and Read tools, the same agent loading infracost-scan SKILL.md with -llm output, and skill-loaded runs using -json format instead. Key safeguards included sandboxed HOME directories to prevent skill carryover in "bare" baseline tests, project-local TMPDIR to avoid macOS ACL issues with subprocesses running under different UIDs, PATH-prepended builds of the current branch so older binaries wouldn't silently disable new flags, and five repeats per cell to smooth 20-30% non-determinism in token costs. The --rerun-failed flag re-executes only cells that hit turn caps without useful cost numbers, while --rescore reapplies verifier logic to existing transcripts when the scoring criteria change—saving serious API budget during iterative development.
Why Skip Straight to 2.0
The team explains they were scoping a 1.0 release earlier this year—the CLI would graduate from pure cost estimation to surfacing the issues behind those costs: previous-generation instance types, policy violations, FinOps problems. Then agent traffic started appearing in their subprocess invocations, and it became clear the design center had shifted. A human reviewer reads PR comments; an AI agent runs infracost inspect --filter ... and pipes tabular rows directly into the next workflow step. The capabilities are identical to what a 1.0 would have shipped, but the caller profile changed enough that bumping straight to 2.0 better reflects the architectural shift from 0.10.x releases.
What CLI Maintainers Should Take Away
The economic regime is what changed here—not the design principles themselves, which are standard CLI wisdom: push predicates close to the data, give callers projection options, don't make consumers pay for scaffolding they don't need. What's new is that output is metered at per-token API rates against large-context models, and the consumer is non-deterministic with a credit card attached. The benchmark harness gave Infracost fast feedback on CLI design choices that would have taken weeks of dogfooding to surface—they can't run 16 questions across 5 repeats against human users and measure effort the way they measure token cost. Most changes making the CLI cheaper for agents also make it better for humans at a terminal: targeted predicates, source-level projection, wire formats without redundant scaffolding. The code is all in the infracost/cli repository if you want to instrument your own tool.
Key Takeaways
- Predicate pushdown into CLI flags eliminates expensive jq/python pipeline chains that burn tokens on subprocess streaming
- TOON format cuts token counts 30-40% on tabular output versus compact JSON, with no accuracy tradeoff
- A measurement harness is essential—aggregate cost and accuracy numbers only become meaningful after accounting for bimodal question difficulty distributions
- Design changes that help AI agents also improve human CLI ergonomics in concrete ways
The Bottom Line
This isn't a one-off optimization story—it's proof that the way we design CLIs needs to change because the consumer model changed. When your tool's output hits a metered API with a credit card attached, "good enough" JSON dumps aren't good enough anymore. Infracost built the measurement infrastructure to prove it, made the changes, and shipped 2.0. If you're maintaining any CLI that AI coding agents invoke as subprocesses, you should probably be doing the same thing.