Autonomous coding agents like Claude Code, Codex, Cursor, and Cline are running shell commands, editing files, and making network calls at machine speed—often while you're not watching. One bad rm -rf, an accidental git push of secrets, or a poisoned README that tricks your agent into exfiltrating credentials can turn your codebase into a disaster. Cerberus, a new open-source project from developer Adir Dabush, puts a checkpoint directly on the tool boundary where agents actually act—intercepting every call before it runs, risk-scoring it across four signals, and giving you control over whether it proceeds.

The Core Problem With Autonomous Agents

The fundamental issue isn't that AI coding assistants are malicious—it's that they're powerful and fast. They execute commands without hesitation when given a task, which is great for productivity until something goes sideways. A destructive command like chmod 777 or kill -9, an outbound call to a paste site instead of GitHub, or a secret loaded into context and then sent out in base64-encoded form—these are the failure modes nobody talks about at conferences. Cerberus addresses this by sitting between the agent and your machine, intercepting every tool invocation through a PreToolUse hook that can allow, audit, hold for human approval (HITL), or block outright.

Four Signals, One Risk Score

Cerberus aggregates four deterministic signals into a weighted risk score: Policy (destination hosts, sensitive paths like ~/.ssh and ~/.aws), Behavioral (runaway loops, tool-call repetition rates), Content (secret detection in outbound payloads), and Injection (heuristic or optional DeBERTa classifier catching malicious tool results). There's also a hard floor that absolute prohibitions—like blocking rm -rf entirely—can never override. The system is agent-agnostic at its core; only thin per-agent adapters differ between Claude Code, Codex, Cursor, and Cline. For Claude Code specifically, held calls surface in the native permission prompt with Cerberus's reasoning—no dashboard required.

Secret Exfiltration Prevention

The most impressive feature is how Cerberus handles secret detection without ever touching or logging the secrets themselves. It identifies secrets loaded into context (with provenance: source file, line number, SHA256 hash), then content-matches against outbound payloads to catch the actual exfiltration attempt—holding calls that carry raw, base64, hex, or URL-encoded keys. The match includes confidence scores and path information but never stores or logs the secret value itself. This is a legitimately hard problem solved elegantly: you get protection without sacrificing security to your monitoring tools.

Architecture and Installation

The stack is Node + TypeScript with a Vite/React dashboard served by the engine itself. Install via npm i -g @cerberussec/core, wire it into your agent with cerberus init (backed up, idempotent), then run cerberus engine to start both the gateway and dashboard at http://127.0.0.1:9000/. Rules and risk weights are editable YAML in a rules/ directory—not code—so policy changes don't require recompilation. The core uses Apache/MIT-compatible dependencies with no external API calls or telemetry. An optional local DeBERTa injection model (@cerberussec/injection-model, ProtectAI, Apache-2.0) upgrades the built-in heuristic classifier if you want stronger prompt injection detection.

What Cerberus Doesn't Cover

The project is honest about its limitations: because it sees tool calls rather than the LLM's internal prompt, it catches the exploitation of a prompt injection (the egress call that would leak data) but not the injection itself. It doesn't cover data-pipeline or RAG poisoning scenarios. The exfiltration match is high-confidence but acknowledges it's not airtight—novel secret formats or split-across-calls encoding could slip through. These aren't dealbreakers; they're honest defaults over false guarantees, which is refreshing in a space where everyone oversells AI safety.

The Bottom Line

Cerberus fills a gap that's been staring us in the face: autonomous agents have been running wild on developer machines for two years, and we've had nothing except hope. This isn't a silver bullet—prompt injection remains a fundamentally unsolved problem—but it's the first tool that actually puts you in the loop at the exact moment when intervention matters. If you're running Claude Code or similar agents in production or on sensitive codebases, this is not optional. It's infrastructure.