Jean-Michel Lemieux, a developer at spellbook.com and former Linear employee, received six unsolicited sales emails from Linear's own AI outreach system—all addressed to the wrong company name and sent to an existing customer who already used the product. When he called it out publicly on social media, CEO Karri Saarinen acknowledged the failure within hours, calling it "the dumbest thing." It's a spectacular own-goal that makes for good Twitter fodder, but according to new research from Tenure, the incident exposes something far more troubling about how we evaluate AI agents: most evaluations would miss exactly what went wrong.
The Wrong Lesson
Most observers see this as a content quality problem. The company name was wrong. The personalization was off. Six emails is absurd. Easy to laugh at the AI slop and move on. But that framing lets the system off too easy. The embarrassing part is the email everyone saw. The actual failure happened upstream, when the system decided it was allowed to send without proving any of the facts that decision depended on. Does this person match the company? Are they already a customer? Has contact history exceeded the threshold for suppression? If those checks are wrong or missing, swapping in GPT-5 doesn't fix anything—it just writes a cleaner version of the wrong action.
What Most Evals Miss
Conventional AI evaluations grade the final output: Is it polite? Personalized? On brand? Free of hallucinations? Those questions matter, but they all come after the action has already been approved. In the Linear case, the email could score perfectly on every one of those dimensions and still be completely wrong. A polished draft doesn't reveal whether the system checked customer status against CRM records, verified the recipient's email domain matched the target company, or confirmed that suppression rules weren't blocking outreach. GroundEval, the framework proposed in Tenure's analysis, flips this by evaluating what the agent verified before acting—not just how the output reads. The key question becomes: did the agent earn the right to act?
Preconditions, Not Just Approvals
The standard fix for risky automation is adding human review before send. But that only works if reviewers can see more than the generated text. A polished draft hides the account record showing active customer status. It hides the email domain pointing elsewhere. It hides five previous outreach attempts and a product already installed in their stack. Without trace visibility, approval becomes a nicer-looking version of the same problem—reviewers judge the artifact, not the decision that produced it. The real fix is preconditions: agents need to show which checks were required, which systems were queried, which records were retrieved, and which rules allowed the action to proceed. If those checks are missing, generation should never start.
Key Takeaways
- AI outreach failures like Linear's aren't generation problems—they're validation failures that happened before any message existed
- Standard evaluations test final output quality but miss whether pre-action state checks were completed correctly
- GroundEval proposes evaluating the evidence path: what was searched, fetched, and verified before acting
- Human-in-the-loop review only helps if reviewers can access trace data showing upstream system state
The Bottom Line
The AI industry keeps obsessing over model output quality while ignoring the validation layer that determines whether actions should happen at all. Linear's sales email debacle isn't an anomaly—it's a preview of what happens when agents start acting in production without proving they checked their preconditions first. Until evaluation frameworks catch up to agentic workflows, expect more companies discovering that their AI sent embarrassing emails to existing customers six times before anyone noticed.