A new benchmark from Encore puts five TypeScript back end frameworks through the same AI coding agent gauntlet—and four of them come out looking like hot garbage in production. The company ran Claude Code (claude-sonnet-4-6) against Encore, Express, Fastify, Hono, and NestJS using identical prompts, the same model, and matching Postgres setups on identical VMs. Every framework's functional test suite passed after Run 1. But when Encore's team actually read the diffs? Four out of five agents had built the laziest possible implementations that would still technically pass: a Postgres table polled by setInterval for durable queues, CREATE TABLE IF NOT EXISTS at boot instead of any migration system, and in-process cron running on every replica.

The Test Setup

The benchmark threw three layered tasks at each framework. Task 1 built the orders API with a typed cross-service call to payments. Task 2 added an order-created event subscriber and a daily aggregation cron. Task 3 threaded X-Request-Id tracing across service boundaries. Each framework got its own exe.dev VM, and the agent worked through all three tasks back-to-back, which matters because failure modes that only surface when extending existing code would have been invisible in single-task runs. Tests were plain black-box HTTP assertions against the live server run with vitest—same probes against every framework, same model (claude-sonnet-4-6 via Claude Code), same VM image.

Run 1: All Green, One Production-Ready

After the first run, every framework hit 31 out of 31 tests. Hono was cheapest at $1.55 per run, NestJS most expensive at a $2.61 median with one outlier hitting $4.45. First-try-green ratios sat around two-thirds across the board except Fastify which nailed all three attempts. The headline numbers looked boring—until you cracked open the diffs and found that Express, Fastify, Hono, and NestJS had converged on near-identical implementations of every async primitive: a Postgres queue table polled by setInterval(..., 500), application-level cron scheduled at startup with setTimeout recursion, and CREATE TABLE IF NOT EXISTS instead of any migration tracking. Encore's agent? It used the framework's primitives—new Topic for durable events, new Subscription for subscribers, new CronJob for scheduling—and hit 100% on a five-check production-readiness rubric without being asked.

The Lazy Solution Problem

The benchmark defines production-readiness across five checks: versioned migrations, multi-instance-safe cron, retry policy with a dead-letter queue, a failed-message endpoint, and structured logging. All four non-Encore frameworks scored around 20% on these checks after Run 1. Their tests passed because the functional suite wasn't checking whether the code was actually shippable—it was just confirming HTTP endpoints responded correctly. The Postgres polling pattern works fine in tests running against a single instance but falls apart immediately in production: if a message fails to process, it retries forever and blocks everything behind it; three replicas mean the cron fires three times per tick unless it's idempotent (and most aren't). CREATE TABLE IF NOT EXISTS at boot leaves you with no migration history for rollbacks.

Run 2: Pre-Installing Libraries Made Things Worse

If the gap was just that the agent didn't know which libraries to use, the fix seemed obvious: put production-grade libraries in the project. Encore pre-installed pg-boss for durable jobs and cron, drizzle-kit or typeorm for migrations, pino for structured logs, with a README explaining what each library did. Same harness, same model, same prompts. Every framework except Encore regressed. None of them landed a single first-try-green run across three repeats. Task 1 was largely fine; the agent fell apart on tasks 2 and 3 where it had to integrate the new libraries with code it inherited from t1.

Why Libraries Didn't Save It

The specific failure patterns were telling. Express and Fastify both wrote pg-boss code that registered scheduled jobs without first calling boss.createQueue('name')—a requirement added in pg-boss v10 that the agent didn't know about, so their servers crashed at boot with "Queue daily-aggregation not found." NestJS imported pg-boss into a NotificationsService but forgot to register the wrapping PgBossService in the module's providers array. Hono spread its failures across all four pre-installed libraries. The pattern was consistent: the agent reached for the right library and couldn't land the integration cleanly under the 80-turn budget the linked tasks imposed.

Run 3: Writing the Rubric Into Tests Finally Moved the Needle

For the final run, Encore rewrote task 3's prompt so production-readiness became part of what the test suite graded directly. Each check got an automated probe backing it, and the per-task turn cap jumped from 80 to 200 in a "Ralph Wiggum loop" spirit—keep iterating until the integration actually holds together. Fastify emerged as the cleanest non-Encore result: pg-boss for pub/sub and cron, drizzle-kit for migrations, pino for logs, green on every check at $4.60 against Encore's $2.58. Express came one test short—it kept CREATE TABLE IF NOT EXISTS even though knex was already in the project. Hono and NestJS had wider regressions: tracing failed across both, and NestJS additionally shipped a TypeScript error that broke the typecheck probe (29/36 and 30/36 respectively).

Why Encore Stands Apart

The frameworks ship very different amounts of AI-agent-facing material today. Encore ships CLAUDE.md via encore llm-rules init, an MCP server (encore mcp start), llms.txt plus llms-full.txt, and a dedicated AI-integration docs page. Hono publishes only llms.txt. Express, Fastify, and NestJS ship none of the four. But it's not just documentation—the framework's primitives already encode production-readiness at the platform level. When you declare new Topic with deliveryGuarantee: "at-least-once", you get a real durable topic with retries and a DLQ configured automatically. When you declare new CronJob, an external scheduler invokes it once across the fleet rather than once per replica. The agent reaches production-readiness as a side effect of using the framework correctly.

Key Takeaways

  • AI coding agents will build the laziest solution that passes your tests, not the one that's production-ready
  • Pre-installing libraries without explicit integration guidance actually hurts more than it helps—agents can't wire them together cleanly under tight turn budgets
  • When you write production-readiness into the test suite and let agents iterate against it, results improve dramatically (Fastify hit 100%) but cost more ($4.60 vs $2.58 for Encore)
  • Frameworks that encode production guarantees in primitives outperform those that delegate to library stacks—Encore's agent reached 100% production-readiness in Run 1 without being asked

The Bottom Line

This benchmark is a wake-up call for teams betting on AI coding agents: passing tests and shipping production-ready code are two completely different problems, and your test suite won't tell you the difference unless you explicitly grade for it. If you're using Express, Fastify, Hono, or NestJS with an AI agent in 2026, you'd better have a rigorous rubric baked into CI—or you'll ship Postgres polling on setInterval and wonder why your durable queue eats poison messages in production.