In 2026, AI agents are everywhere—shipping code, handling customer service tickets, drafting documents—but ask them to navigate genuinely uncertain environments like medical diagnosis or scientific discovery, and they start falling apart. The core issue isn't computation power; it's question-asking. Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard University's School of Engineering and Applied Sciences (SEAS) have been digging into this exact problem, using a childhood board game as their test bed.
Why Battleship?
"Collaborative Battleship" reframes the classic guessing game around natural language questions. One player acts as "captain," asking where hidden ships are located, while their teammate serves as "spotter," responding with yes-no answers in real-time. The researchers first had over 40 humans play together to build the "BattleshipQA" dataset—questions and answers that became a baseline for testing state-of-the-art language models including GPT-5, Llama 4 Scout, Claude 4 Opus, and others. The initial results were telling: top-tier LMs could beat humans at Battleship without any special training, completing games in fewer turns. But smaller systems like Llama 4 Scout struggled badly with rational decision-making. The real bottleneck wasn't answering questions—it was formulating useful ones that revealed meaningful information about the hidden board.
Monte Carlo Inference Changes Everything
The breakthrough came from implementing Monte Carlo inference strategies. These techniques treat potential guesses as individual particles, weighting them based on how valid they appear after each spotter response—think game balls that inflate or deflate with every turn. This calculated, adaptive approach let captain models make inquiries that extracted considerably more info. "Today's language models are primarily optimized to answer complex queries, but it's less clear whether they learn to ask good questions for themselves," says MIT PhD student and CSAIL researcher Gabriel Grand SM '23, lead author on the paper. "Our work shows that asking informative questions depends on the ability to predict and simulate the world."
Llama 4 Scout Beats GPT-5 at 1% of the Cost
The numbers are wild. As a relatively small language model, Llama 4 Scout originally beat humans only 8 percent of the time. After applying Monte Carlo inference refinements to its question-asking strategy? An 82 percent win rate versus human opponents. Even more striking—this carefully tuned smaller model outpaced GPT-5 while operating at roughly 1 percent of its computational cost.
Python as a Bridge
On the answering side, researchers tackled a different problem: getting models to provide accurate responses about ship locations. The fix was elegant—automatically converting captain questions into executable Python code that explicitly tells spotter LMs how to verify their answers. A question like "Is there a ship in column one that spans two rows?" becomes instructions for the model to search that specific area and assess its dimensions. This approach delivered consistent gains across the board: GPT-4o-mini saw nearly 30 percent performance improvement, Claude 4 Opus jumped about eight points, and average accuracy across models climbed roughly 15 percent. "The field has seen a lot of success from 'auto-formalization' strategies, in which LMs generate code to verify their solutions," says senior author Jacob Andreas, MIT EECS associate professor and CSAIL principal investigator. "What I find most exciting is that it opens up the possibility of using these techniques to generate better solutions in the first place."
Beyond Battleship: Guess Who?
The team didn't stop at one board game. Testing their approach on "Guess Who?"—where players narrow down 100 possible characters through yes-no questions—yielded similar results. Llama 4 Scout jumped from 30 percent success to over 72 percent, while GPT-4o climbed from 62 to 90 percent accuracy.
Expert Players Still Unbeaten
Despite the impressive gains, expert human players remain difficult for all tested models to beat—a notable contrast with chess, where top AI systems have surpassed even grandmasters. "GPT-5 can beat your average 'Battleship' player, and gets a hair better with our methods," adds coauthor Valerio Pepe, an OpenAI researcher and recent Harvard graduate. "However, expert players are still hard to beat for all models." "As AI systems become more agentic, the hardest problems turn out to be social ones: tracking common ground, resolving misunderstandings, and adapting to different partners over time," says Robert Hawkins, assistant professor of linguistics at Stanford University, who wasn't involved in the paper. "This work elegantly captures these phenomena in a controlled collaborative setting, and makes a compelling case that the real bottleneck for AI agents isn't just the calculation of optimal questions, but the pragmatic reasoning needed to make the most of their answers."
Key Takeaways
- Monte Carlo inference strategies let smaller models like Llama 4 Scout ask dramatically better questions, reaching an 82 percent win rate versus humans
- Converting natural language questions into Python code boosted answering accuracy by ~15% on average across tested models
- A refined Llama 4 Scout model outpaced GPT-5 while operating at approximately 1% of the computational cost
- Auto-formalization—generating executable code from queries—proved effective for both question formulation and answer verification
The Bottom Line
This research exposes a fundamental truth the AI industry doesn't want to admit: raw model scale isn't the real moat anymore. When Llama 4 Scout with smart prompting beats GPT-5 at roughly one percent of the cost, it proves that inference-time reasoning strategies matter more than training bigger models on more data. The next frontier for AI agents won't be about who has the most parameters—it'll be about who asks the best questions.