Six months ago, xAI had no video model, no multimodal infrastructure, and essentially a blank slate team. Three months later, they shipped Grok Imagine 0.9—and the person who helped make it happen is back on Latent Space with some nuclear hot takes about where generative media actually goes from here.

The Hot Take: Video Intelligence Comes From LLMs

Ethan He, formerly NVIDIA's Cosmos World Model lead and now an independent researcher after leaving xAI, argues that the next Sora won't be a better video model—it'll be a video agent. His thesis is straightforward yet counterintuitive: video models get their intelligence primarily from language models, not from training on video data itself. The real unlock isn't better diffusion architectures; it's treating video generation as an LLM orchestration problem.

From Zero to Multimodal in 90 Days

Heine's journey through xAI's development process reveals why iteration speed matters more than almost anything else. Working with a small, highly-aligned team meant minimal meeting overhead and maximum execution velocity. "You reduce the communication bandwidth among people, and everyone can work towards the same goal," Ethan explains. "Every day there's not that much meetings on the calendar—maybe like a sync a day, and after that it's just all building."

Tiny Bugs Beat New Algorithms

Here's what should alarm anyone shipping foundation models: some of the biggest quality gains came from fixing minuscule bugs in data pipelines—not from novel research breakthroughs. Ethan recalls discovering issues where caption alignment was slightly off or training data had subtle corruptions that cascaded into degraded outputs. The lesson? Before chasing architectural innovations, get your fundamentals airtight.

Coding Models Change the Bottleneck

"Now coding models are much more efficient and can help us implement stuff much faster. Compute might become a bottleneck again," Ethan warns. When it took weeks to generate synthetic training data or implement new algorithms, researchers had breathing room between experiments. With LLMs automating implementation work down to hours, the constraint flips: you need enough GPU cycles to test all your ideas before anyone else does.

The Rise of Video Agents

"In the near term, the next Sora won't be a better video model, but a video agent." That framing—borrowed from how AI coding evolved from one-shot completions to multiturn reasoning systems capable of planning, editing, testing, and submitting PRs—captures where video generation is heading. Grok Imagine Agent Mode (Beta) already launched as "a full creative agent working on one infinite open canvas" that plans, generates, edits, and iterates automatically in the same workspace.

Generative UI and the Neural OS Future

Ethan takes Flipbook seriously—perhaps more than most dismiss it as a fun demo. With inference costs dropping yearly, custom video JIT UI gets closer to practical reality. The vision: world models that are real-time, interactive, and long-horizon enough to serve as AI's front end, potentially replacing traditional HTML/CSS rendering with generative pixel outputs.

Key Takeaways

  • Video intelligence comes from LLMs more than video data itself—the next unlock is orchestration, not architecture
  • Fast iteration cycles beat everything else; xAI shipped in 3 months with a small aligned team
  • Data pipeline bugs matter more than research breakthroughs for near-term quality gains
  • Coding models have flipped the bottleneck back to compute—you can build faster but need GPU capacity to test ideas
  • Video agents follow AI coding's evolution path from one-shot output to multiturn planning systems

The Bottom Line

The real moat in video generation isn't better diffusion transformers—it's building systems that can plan, critique, and iterate like a senior creative director powered by LLMs. Anyone still treating this as a pure model quality problem is solving yesterday's equation.