If you've built even a basic AI agent that can read files and run bash commands, you've probably hit this wall: the thing works great for one-off tasks but completely falls apart when you ask it to do something substantial. It makes progress, then just... stops. This isn't a bug in your code—it's fundamental to how LLMs are trained. They're optimized for conversational back-and-forth, not sustained autonomous work.

The Planning Problem

The author of the "Build A Basic AI Agent From Scratch" series on ruxu.dev breaks down exactly what an agent needs to handle long tasks: understand the goal upfront, plan before acting, decompose work into concrete steps, track what's pending versus done, recover when things go wrong, and verify completion before stopping. None of this happens automatically—you have to build it explicitly. The solution isn't one magic tool. It's two simple but powerful primitives that force the model to think before typing: a Scratchpad for internal reasoning and a To-Do List for tracking progress through messy multi-step work.

The Scratchpad Tool

This one's dead simple—it's just an in-memory notepad the agent writes its thoughts into before acting. The key benefit is forcing pre-commitment: the model has to articulate its approach, survey what it knows, evaluate options, anticipate failure modes, and decide on exactly one next action—all before touching anything real. The implementation is a basic Python class with read/write methods that store content in memory rather than a file, since you don't want scratchpad state leaking between sessions.

The To-Do List Tool

More interesting is the task tracker. It enforces strict rules: only one item can be "in_progress" at a time (no parallelization chaos), statuses are limited to pending/in_progress/done/cancelled/failed, and duplicate task IDs throw errors. The retry logic is particularly clever—with a RETRY_LIMIT of 3, if a failed task gets set back to in_progress multiple times, the tool eventually refuses and tells the agent to escalate to the human instead.

System Prompt Engineering

All the strategic planning behavior lives in the system prompt. For complex tasks (roughly three or more distinct steps), the model is instructed to write its initial thinking to the scratchpad first, then break work into todos with "pending" status before touching anything. It must mark items done immediately rather than batching completions, and call todo_list before moving to the next step. The replanning section is crucial: after every tool result, the agent checks if outcomes matched expectations—and if not, it diagnoses in the scratchpad whether it's a recoverable input error or a deeper problem with the approach.

Real-World Test

The proof is in the pudding. When asked to migrate a static site from Eleventy to Hugo—definitely not a simple task—the agent wrote its reasoning to the scratchpad first, created four todo items (inspect repo structure, map templates/content/assets, implement migration, verify build), worked through them sequentially with proper status updates, recovered from errors via retry logic, and eventually ran hugo --minify successfully with all four tasks marked done. The agent kept working for minutes without human intervention.

Key Takeaways

  • LLMs default to conversational behavior; you must explicitly build sustained autonomous capability
  • A scratchpad forces pre-planning: articulate approach before acting on it
  • Strict task tracking (one in-progress item, retry limits) prevents chaotic or infinite loops
  • System prompts encode the "when to replan" logic that tools alone can't express
  • Done detection requires three checks: empty todo list, output verification, and uncertainty audit

The Bottom Line

This is the unglamorous plumbing that makes AI agents actually useful in production. Anyone can hook up a few tool-calling endpoints—getting an agent to work reliably on hard problems for extended periods without human babysitting is where the real engineering happens. Bookmark this series if you're serious about building autonomous coding agents, because the next installment promises human-in-the-loop guardrails.