We Gave an AI Agent Eyes. It Never Used Them

The Agent Voyager Project dropped Claude Haiku 4.5 into a bear trap last week—a single page from ParseBench containing four correlation matrices crammed onto one sheet of a 2012 econometrics paper. The task was dead simple on paper: download the PDF, rebuild it as an HTML table, and don't screw it up. What they got back exposed something weird about how AI agents actually behave when you give them tools—and which ones they reach for first.

The Experiment Setup

The team grabbed Haiku because Opus 4.8 costs too damn much. They wanted to know: can a budget model punch above its weight class with the right harness? Specifically, Goose—Elevance's open agent framework—was wrapped around two configurations. One had pdf-vision, an MCP server that renders PDFs as images so models can "see" them. The other relied on Goose's built-in pdf_tool, which just extracts text. They even told the vision config to trust the picture over raw text. Classic control experiment setup.

First Attempt: Confident Failure

pdf_tool went first. Five turns, five cents, and a 53% score. The tool poured every table on the page into one unbroken stream of text—August, ORPR, time series—all jumbled together with no rows, no columns, nothing to indicate where one matrix ended and another began. Goose rebuilt it anyway, re-read its own work, and declared victory. "All values match perfectly," it said, right before failing spectacularly. No step in the run flagged anything wrong. The trajectory shows an agent that was confidently incorrect and had zero awareness of it.

Second Attempt: Eyes That Never Saw

The same model, the same page, pdf-vision enabled instead. Twenty-four turns and thirty-three cents later—roughly 7x more expensive—the agent scored 100% and passed. But here's the kicker: it never once saw the page. The image kept coming back empty. Eight times Goose tried to look at a picture that wouldn't load, including one moment where it piped a PNG through base64 in the terminal hoping to read it by hand. When vision finally gave up, the agent fell back to pdf-vision's secondary trick—layout-aware markdown export—and built perfect rows and columns from that. The actual hero was turn 22's pivot to text structure, not any visual processing.

Key Takeaways

Vision cost 7x more but delivered zero actual visual analysis
A cheaper model can match expensive ones if the harness is solid enough
Recording every step revealed what raw scores hide entirely
Structure-preserving text exports beat flat extraction for tabular data

The Bottom Line

This isn't a story about vision failing—it's a story about tools that lie about what they do. pdf-vision won because it had better fallback behavior, not because the agent used its eyes. And AVP's trajectory recording is the real MVP here: without seeing exactly which tool fired on turn 22, this insight disappears into another "vision works" headline.

> We Gave an AI Agent Eyes. It Never Used Them

The Experiment Setup

First Attempt: Confident Failure

Second Attempt: Eyes That Never Saw

Key Takeaways

The Bottom Line

> RELATED DISPATCHES