A new open-source evaluation framework called HermesBench just dropped on Hacker News, and it's taking a different angle than typical AI benchmarks. Rather than benchmarking raw model performance, this tool evaluates complete personal agent configurations—the entire stack including prompts, tools, memory systems, gateway behavior, delegation patterns, safety guardrails, latency, and stability. The current public baseline sits at 78.2 across 27 workflow recipes, with redacted traces available for inspection.

Why This Matters

Most AI benchmarks focus on base model capabilities in isolation—how well does GPT-4 or Claude reason through problems? But if you're running a personal AI agent in production, the model is just one piece of the puzzle. Your prompt engineering, tool definitions, AgentSkills configuration, and memory architecture all dramatically affect real-world reliability. HermesBench explicitly tries to measure that whole system.

Evidence First, Limits Visible

The project takes a refreshingly transparent approach. Every published result links back to scenario definitions, public score axes, driver closure decisions, deterministic checks, and redacted trace timelines. The site is deliberately upfront that this is one early baseline—not a base-model leaderboard. They're not trying to crown a winner; they're building infrastructure for the community to compare configurations.

Scoring Philosophy

HermesBench judges agents across five axes: outcome reached, evidence/truthfulness, runtime/scope safety, responsiveness, and communication quality. The philosophy is explicit—capable but unsafe? Penalized. Safe but unhelpful? Penalized. Correct but unusably slow? Also penalized. 'A personal agent that is capable but unsafe, safe but unhelpful, or correct but unusably slow is not actually good,' the documentation states flatly.

Coverage Model

The recipe catalog spans everyday personal-agent work across five categories: Personal Core (context, calendar), Communications, Ambient and Travel, Private Sensitive (finance), and Power-User Optional integrations. The current 27 recipes represent a starting point—the team is actively soliciting community contributions for new use cases.

Agent-Driven Quick Start

Running HermesBench is intentionally simple. Users copy a prompt into Codex, Claude, or another coding agent, which loads the HermesBench skill and drives one scenario recipe. Full bundle runs are opt-in because they take longer and cost more. The workflow packages redacted profile snapshots and score evidence automatically—no manual spreadsheet wrangling required.

Key Takeaways

  • Baseline of 78.2 across 27 recipes establishes an early reliability benchmark for personal agents
  • Evaluates the full stack: prompts, tools, memory, safety, latency—not just model capability
  • Transparent methodology with inspectable redacted traces builds trust
  • Five-axis scoring penalizes lopsided configurations (capable but unsafe = bad)
  • Agent-driven workflow makes running evals accessible without custom tooling

The Bottom Line

HermesBench fills a real gap in the personal AI agent ecosystem—measuring what actually matters for reliability rather than chasing synthetic benchmark scores. Whether it gains traction depends on whether the community contributes recipes and profiles, but the evidence-first philosophy is exactly what this space needs right now.