Stateful Inference Architecture Cuts LLM Agent Latency by Half

Multi-agent tool calling has become the dominant interaction pattern for production LLM systems, but there's a dirty secret hiding in every inference framework: they're all reprocessing 85 to 95 percent of your prompt from scratch on every single turn. Victor Norgren just dropped research on arXiv that exposes this inefficiency and delivers a fix that's anything but incremental.

The Core Problem with Current Inference

Existing frameworks like vLLM and SGLang treat each tool call as an independent request, forcing the entire conversation context through the model again even when only a handful of new tokens arrived since the last turn. In agentic workflows that might span dozens of turns—each calling external tools, querying databases, or coordinating sub-agents—this architectural decision translates directly into wasted compute and latency that kills user experience. Norgren's team identified this as an O(n_t) per-turn cost problem where n_t represents total tokens in the conversation. Their solution? Convert that to O(Δt), where Δt is only the delta tokens—the actual new information added since the last inference step.

How Stateful Inference Actually Works

The architecture combines three distinct mechanisms to eliminate redundant processing. First, a persistent KV cache lives across turns and advances by ingesting only new tokens rather than recomputing attention over the entire context window. Second, a radix prefix cache extends this optimization across interleaved multi-agent traffic—critical when multiple agents are handling different subtasks simultaneously with shared prefixes in their prompts. Third, a prompt-lookup speculative decoder accelerates structured output generation by predicting and pre-computing likely response schemas. The key insight is that these gains come from stateful reuse and speculation, not traditional caching of results. The system isn't storing answers to repeat queries—it's maintaining computational state across sequential turns so the model never redundantly processes what it already understands.

Benchmarks That Should Make Engineers Pay Attention

Against vLLM and SGLang on novel, fully-generated workloads, Norgren's reference implementation achieves 2.1× faster per-turn latency on a six-turn agentic workflow and an eye-opening 4.2× improvement on the median turn of a thirty-five-turn one. End-to-end wall time gets cut in half. These aren't cherry-picked metrics from favorable conditions—they're measurements against workloads specifically designed to stress-test novel scenarios without training data contamination.

Key Takeaways

Current LLM serving frameworks waste 85-95% of compute on redundant context reprocessing across agentic turns
Stateful KV cache persistence converts O(n_t) per-turn costs into O(Δt) delta-only processing
Radix prefix caching handles interleaved multi-agent traffic with shared prompt structure
Benchmarks show 2.1× to 4.2× speedups versus vLLM and SGLang on multi-turn workflows

The Bottom Line

This research exposes a fundamental mismatch between how agentic systems actually behave—long-running, stateful, tool-chaining—and the stateless request-response model that inference frameworks were originally designed around. Norgren's work suggests we should be thinking about LLM serving not as isolated API calls but as persistent compute with memory across turns. That's a paradigm shift hiding inside what looks like an optimization paper.

> Stateful Inference Architecture Cuts LLM Agent Latency by Half

The Core Problem with Current Inference

How Stateful Inference Actually Works

Benchmarks That Should Make Engineers Pay Attention

Key Takeaways

The Bottom Line

> RELATED DISPATCHES