When PromptFrenzy dropped its latest showdown, they didn't ask the frontier models to solve math problems or write poetry — they asked them to draw a pelican riding a bicycle as SVG code. The prompt was dead simple: "Generate an SVG of a pelican riding a bicycle." No tools, no retries, no human edits. First response wins. That's the kind of test that strips away the marketing fluff and shows you what these models actually spit out when the rubber meets the road.

How PromptFrenzy Ran It

The benchmark used each vendor's raw API — not some polished SDK wrapper. For the frontier round, Claude Fable 5, GPT-5.5 Pro, and Gemini 3.1 Pro got the identical one-shot prompt via their direct APIs. Wall-clock latency was measured for the full response, including any reasoning tokens. The Claude family round ran through PromptFrenzy's agent harness, which adds a small overhead but keeps conditions consistent across the lineup comparison. Quality between the three flagship models came out surprisingly close — at least on the static pelican. But here's where it gets interesting: latency and billing were decidedly not similar. PromptFrenzy notes that "quality is close — the latency and the bill are not," which tells you these models are making fundamentally different decisions under the hood when generating code.

The Claude Family Round

Beyond the three-way frontier showdown, PromptFrenzy ran a separate round pitting the full Claude lineup against itself. This gives us a glimpse at how Sonnet and Haiku-class models stack up to flagship Fable 5 on the same SVG generation task. Earlier launch-day runs via their agent harness showed measurable drops in output quality as you move down the model hierarchy — not exactly shocking, but good to have actual side-by-side data rather than vibes.

The Animated SVG Challenge

The real test came next: make it move. Same one-shot approach, but now each model had to generate a fully self-contained animated SVG with no JavaScript, no video model fallback, and zero editing after generation. This is where the rubber hits the road for code generation capability — animating shapes in pure SVG markup requires understanding transforms, keyframes, and how to structure valid XML that browsers will actually render. PromptFrenzy provided a split-screen comparison video showing all models' outputs side by side, plus Claude Fable 5's raw SVG playing live in-browser. If you want to see what your money gets you in terms of actual rendered output, those comparison assets are worth scrolling through — the difference between "works" and "broken in interesting ways" becomes immediately apparent.

Key Takeaways

  • Same-prompt, one-shot testing strips away iterative prompting advantages that often inflate benchmark scores
  • Raw API latency varies significantly across frontier models even when output quality appears similar
  • Animated SVG generation reveals real gaps in code reasoning capability that static tests miss
  • Claude family lineup shows predictable quality degradation down the hierarchy, but how much matters for cost-sensitive use cases

The Bottom Line

These pelican-on-a-bicycle showdowns aren't just novelty tests — they're a reminder that model capabilities often diverge most under constraints like "no retries" and "no editing." When you can't baby-sit the output, you learn what these models actually know versus what they've learned to imitate. PromptFrenzy's approach is exactly the kind of pragmatic, no-BS benchmarking the community needs more of.