Lighthouse Attention: Training-Time Hierarchy That Makes Quadratic Attention Practical Again

FlashAttention solved the memory problem. It did not solve the compute problem. Scaled dot-product attention still scales as Θ(N²)—double the context, quadruple the FLOPs. At 512K tokens on a single NVIDIA B200, dense attention forward+backward burns enormous compute, forcing frontier model teams to throw 32 GPUs at attention alone just to hit million-token windows. That's the problem Lighthouse Attention aims to crush.

The Core Innovation: Symmetric Pyramid Pooling

Existing sparse methods—NSA, HISA, DSA, MoBA—all share a critical flaw: they only pool keys and values while keeping queries at full resolution. That keeps the attention call at O(NSd), still linear in sequence length. Lighthouse flips this entirely by pooling queries, keys, AND values symmetrically into an L-level pyramid structure. The result is an attention call that scales as O(S²d) where S << N. At 512K context, that's a 21× speedup on the forward pass. No architectural changes, no inference penalty, and critically—selection lives entirely outside the attention kernel.

Four-Stage Pipeline Wraps Standard FlashAttention

Lighthouse's selection pipeline runs in four stages: pyramid pooling, parameter-free ℓ₂-norm scoring, chunked-bitonic top-K selection, then standard FlashAttention on the gathered sub-sequence. The top-K step is deliberately non-differentiable—gradients flow only through the Q, K, V entries that got selected, which forces the model to produce representations useful when chosen rather than good at choosing. The chunked-bitonic approach produces stratified selection instead of strict global top-K, preventing attention collapse onto a narrow span. And here's the kicker: the coarsest pyramid level is always retained in full, guaranteeing every position contributes to at least one layer.

Two-Stage Training with Recoverability Guarantee

The real acid test is recoverability—can sparse training produce weights that work under dense attention at inference? Lighthouse uses a two-stage recipe: Stage 1 trains with Lighthouse selection, then Stage 2 resumes under dense SDPA. A 530M Llama-3-style decoder trained on C4 at 98K context with Lighthouse in 26 of 30 layers showed loss spikes of 1.12–1.57 nats when switching to dense—then recovered within ~1,000–1,500 steps and crossed below the dense-from-scratch baseline. By step 16,000 (roughly 50.3B tokens), all Lighthouse runs achieved final losses between 0.6980–0.7102 versus the dense baseline's 0.7237.

Performance: Wall-Clock Speedup Without the Catch

The numbers are real and they're spectacular: 1.4–1.7× pretraining wall-clock speedup at 32K–128K context compared to dense SDPA, using 22.5–27.0 wall-clock hours instead of 37.9 for comparable training runs. On Needle-in-a-Haystack retrieval (4K–96K context), Lighthouse with k=2048 matches or beats the dense baseline's retrieval rate. Context parallelism scales cleanly to 1M tokens across 32 B200 GPUs with zero kernel modifications—existing context-parallelism infrastructure just works.

The Limitation Nobody's Talking About

Lighthouse is not a universal accelerator. At short contexts, pyramid overhead dominates and you get no benefit—it only kicks in at 32K+ tokens where the compute savings actually matter. More importantly, it's training-only: autoregressive decoding presents one query at a time, violating the all-queries-co-occur assumption that makes pyramid pooling work. If your bottleneck is inference throughput on short sequences, this isn't for you. But if you're pretraining frontier models with million-token context windows—and who isn't trying to—Lighthouse is a proven speedup with no strings attached.

Key Takeaways

Symmetric Q/K/V pooling turns O(NSd) into O(S²d)—21× faster forward pass at 512K tokens
Selection lives outside the attention kernel, reusing stock FlashAttention without custom sparse kernels
Two-stage training (Lighthouse → dense SDPA recovery) beats dense-from-scratch final loss
1.4–1.7× wall-clock speedup at 32K–128K context with no inference overhead or architectural changes
Training-only solution: doesn't apply to autoregressive decoding workloads

The Bottom Line

Lighthouse Attention is the real deal—a training optimization that actually delivers on its promises without poisoning your model's inference behavior. For teams burning millions in GPU-hours pretraining long-context models, this is the kind of drop-in efficiency gain that makes CFOs happy and ML engineers happier. Skip Stage 2 recovery at your peril, but if you follow the recipe (L=3, p=4, k=1536, projection-norm scorer), you're looking at significant throughput gains with zero architectural debt.

> Lighthouse Attention: Training-Time Hierarchy That Makes Quadratic Attention Practical Again