Building AI Agents That Can Actually Handle Long-Running Tasks

If you've ever tried to use an AI agent for anything more than a quick script, you've hit the same wall everyone hits: the model makes progress for about two exchanges and then just... stops. It forgets what it was doing. It loses the thread. Turns out that's not a bug—it's how LLMs are trained. They're built for conversation, not marathon work sessions. The solution comes from developer Roger Oriol in his ongoing 'Build A Basic AI Agent From Scratch' series. Rather than waiting for foundation models to magically improve at long-horizon tasks, he arms the agent with explicit planning infrastructure: a scratchpad for thinking and a to-do list for tracking progress through complex jobs.

The Scratchpad: Making the Model Think Before Acting

The first new tool is dead simple but surprisingly powerful. It's an in-memory scratchpad where the model writes out its reasoning before touching anything else. Oriol calls this 'forcing the model to think through the goal and plan the whole approach before starting working on it.' The scratchpad isn't shared between sessions—it's private workspace for the current task only. The implementation is a basic Python class with read() and write() methods that store content in memory. When the agent receives a complex request, its first move is to dump everything into the scratchpad: restate the goal, survey what it knows, evaluate options, anticipate failure modes, then commit to exactly one next action.

The To-Do List: Explicit State Tracking

The second tool handles what most agents fumble: keeping track of what's pending, in-progress, and done. Oriol's ToDoList class enforces good practices by design—no parallel in-progress items, no invalid statuses, no duplicate task IDs. Statuses include pending, in_progress, done, cancelled, and failed. There's also retry logic baked in. If a task fails, the model can mark it as failed, attempt recovery, then set it back to in_progress. But after three retries (RETRY_LIMIT = 3), the tool tells the model to stop and escalate to the human instead. This prevents agents from spinning forever on broken approaches.

The System Prompt: Teaching Planning Behavior

Tools alone aren't enough. Oriol extends the agent's system prompt with explicit instructions for how to use these tools during complex tasks (roughly three or more distinct steps). The workflow becomes: write initial thinking to scratchpad, break work into concrete todo items, mark one in_progress at a time, update status immediately after completing each step, and call todo_list before moving forward. The replanning section is particularly clever. After every tool result, the model checks whether the outcome matched expectations. If something fails, it diagnoses whether it's a recoverable input error or a deeper problem with the approach itself—then either retries with corrections, replaces the task with a revised version, or reorders priorities based on new information.

Real-World Test: Eleventy to Hugo Migration

The proof is in the pudding. Oriol tasked his agent with migrating a static site from Eleventy to Hugo and watched it work through four major phases: inspect repository structure, map templates and content assets, implement Hugo configuration and migration, then run build verification. The agent wrote scratchpad entries before each significant action, tracked all four tasks in the todo list, recovered from failures by replanning, and ultimately completed the full migration with a passing hugo --minify build.

What Comes Next

Oriol teases the next piece: human-in-the-loop safety checks. Right now these agents can edit files and run commands without oversight—which is great until they modify something they shouldn't. The obvious fix is having the agent ping you before doing anything potentially destructive. That's the topic for the next installment in this series.

Key Takeaways

LLMs trained for conversation need explicit scaffolding to handle long tasks—scratchpads and todo lists fill that gap
In-memory tools work fine when each session is independent—no database needed
Retry limits prevent agents from looping forever on broken approaches
The system prompt teaches planning discipline: think first, plan in public, verify completion before stopping

> Building AI Agents That Can Actually Handle Long-Running Tasks