I Put My AI Safety Architecture Through Four Independent Red Teams — None Could Break It

When you're building an autonomous AI agent that handles real tasks in the real world, 'trust me, it probably won't do anything bad' doesn't cut it as a safety strategy. That's why Andre Zabel took a different approach with E.L.L.A., the Ethical Local Language Assistant launching July 1st, 2026 at ella-agent.de. Instead of relying on prompt engineering or policy documents, he embedded four architectural prohibitions directly into the code itself — enforced by the system, not the model.

The Architecture That Says No

The E.L.L.A. Directive isn't a set of guidelines for the AI to follow; it's a hardwired constraint layer that intercepts execution before harmful actions happen. This is fundamentally different from prompt-based safety measures, where a sufficiently creative jailbreak can convince the model to ignore its instructions. As Zabel puts it: 'The critical difference from prompt-based safety: the model can "want" to do something all it likes — the architecture refuses execution.'

The Four Prohibitions

The directive establishes four non-configurable, non-overridable prohibitions enforced at the code level — not by the user, operator, or even the language model itself: - No Harm — no action that causes physical, financial, psychological, or data-related harm - No Conceal — every tool invocation is logged immediately and completely, locally - No Surveil — no observation or recording without explicit, informed consent - No Exfiltrate — no transmission of user data to third parties without explicit, per-transmission consent

The Red Team Lineup

To stress-test this architecture before launch, Zabel engaged four independent AI systems as adversarial reviewers: Google Gemini, Perplexity AI, DeepSeek, and xAI Grok. Their mission was straightforward — find weaknesses in the safety layer and attempt to break all four prohibitions. This wasn't a friendly evaluation; it was a genuine attempt at architectural compromise by some of the most capable LLMs available.

What They Found

Not one of the four systems could break any of the four core prohibitions themselves. Every weakness identified lay outside the defined scope of what the Directive claims to protect. The AIs flagged potential issues like manipulative text responses that don't involve tool calls, reliance on developer-defined tool classification, and broader EU AI Act compliance considerations — but these are valid critiques about the system's boundaries, not failures of its core safety mechanisms.

What They Said

The adversarial reviews were surprisingly consistent in their assessments: - Gemini: 'remarkably strict — especially regarding exfiltration' - Perplexity: 'principle-driven, architectural focus, user-centric' - DeepSeek: 'resistant to prompt injection and model jailbreaks' - Grok: 'a serious and innovative contribution to agent-specific safety'

Why This Matters for Agent Safety

This approach represents a meaningful shift in how we think about AI safety. Most commercial agents rely on instruction hierarchies where system prompts sit above user inputs — theoretically unbreakable but practically circumvented through creative manipulation. The E.L.L.A. Directive takes a different path: enforcement at the architectural level, outside the model's influence entirely. When you move constraints out of the prompt layer and into code execution, you create a meaningful barrier against jailbreaks that pure instruction-based systems simply cannot match.

Open Source Transparency

For developers interested in implementing similar approaches, Zabel has open-sourced the directive on GitHub at github.com/AndreZ1971/The-E.L.L.A.-Directive-. The source makes clear that while the core prohibitions held under adversarial testing, the architecture explicitly does not claim to be all-encompassing — it defines four precise prohibitions and enforces them architecturally. In an industry that promises '100% safe' without defining what that means, this kind of understatement is paradoxically its strongest argument.

The Bottom Line

The fact that four competing AI systems couldn't find a way to break the core prohibitions doesn't mean E.L.L.A. is unhackable — no system is. But it demonstrates something important: architectural enforcement creates a fundamentally different security posture than prompt-based approaches. If you're building agentic AI systems in 2026 and your safety strategy is just better prompts, you're doing it wrong.

> I Put My AI Safety Architecture Through Four Independent Red Teams — None Could Break It