Back to Blog
securityengineering

Three years later: AI can (now) defend AI

In 2022, Simon Willison argued that 'adding more AI' was the wrong fix for prompt injection and related failures. He was mostly right at the time. What people tried then were brittle ideas that either overblocked or were easy to trick. This post explains what has changed since, what has not, and why builders can now use AI to meaningfully defend their agents in production.

4 min read
By Alan Zabihi

In 2022, Simon Willison argued that "adding more AI" was the wrong fix for prompt injection and related failures. He was mostly right at the time. What people tried then were brittle ideas that either overblocked or were easy to trick. This post explains what has changed since, what has not, and why builders can now use AI to meaningfully defend their agents in production.

What 2022 got right

Two patterns dominated early attempts:

  1. Binary text classifiers. Small models or rules that labeled input as "malicious" or "not malicious." These broke on nuance and caused high false positives.

  2. General models playing guard. GPT-3 asked to act as a filter via a system prompt. Attackers could bypass it with the same tricks used against the target model.

Simon's criticism targeted these patterns. If you relied on them, you would miss things or block the wrong things, and you would never know when you were safe.

What has changed since 2022

Three shifts make today's approach different:

  • Purpose-built defender models. Instead of repurposing a general model, you can now deploy small models trained only for defense tasks. They reason about intent and consequences, not just patterns. SuperagentLM is one example of this new class of defender models.

  • Continuous updates from the wild. Defenders are retrained as new attack techniques appear, keeping pace with the landscape instead of freezing a snapshot in time.

  • Reinforcement from real usage. On top of global updates, defenders adapt to customer traffic. They learn what is normal in a given deployment and what is risky in context, cutting false positives without opening holes.

This is not "GPT checking GPT" and it's not a static filter. It's a reasoning model that learns globally and locally, combined with runtime controls.

How to reduce risk at runtime

A credible defense system for agents operates while they run. That means putting checks at the right points:

  • Inputs are analyzed before they reach the model, stopping adversarial prompts that try to jailbreak or rewrite policy.

  • Tool calls are scored for intent and consequence, with unsafe actions blocked, sandboxed, or escalated.

  • Outputs are checked for leaks or unsafe code before they leave the system.

  • Decisions are logged with traces so teams can audit, measure, and prove that unsafe behavior never reached production.

The result is a runtime defense layer — not a single filter, but a system where a reasoning model, policies, and observability work together to keep agents safe.

Is it 100 percent? No — but close enough to matter

These systems will never be 100 percent secure because they're probabilistic by nature. But combining reasoning with continuous learning gets you close. The comparison to deterministic defenses can be misleading. Even in traditional software, nothing has ever been truly 100 percent. If it were, we wouldn't have a massive security industry around patching, monitoring, and layered controls. AI security will follow the same path: not perfect, but strong enough to make agentic apps viable at scale.

What this means for builders

You don't need to rely on vibes in a system prompt. You can put a defender in the loop, wrap tool calls and outputs with checks, and see exactly what decisions were made. Start where the risk is highest, measure outcomes, and tighten over time.

This is the approach we've taken with Superagent: a small reasoning model, updated continuously, reinforced by real usage, and embedded in runtime controls. It doesn't claim perfection — but it does something more important. Because attacks themselves are reasoning-based, defenses need to reason too. Traditional software engineering guardrails are not enough in an adversarial environment. With a system that learns and adapts as it runs, AI can now defend AI.

Alan Zabihi

Co-founder & CEO of Superagent.sh

Follow on X

Related Articles

Today, we are proud to announce Superagent — the runtime defense platform that keeps your AI agents safe from prompt injections, malicious tool calls, and data leaks.

September 24, 20252 min read

The development of sophisticated Large Language Models has introduced alignment faking as a critical challenge to AI safety. This strategic deception fundamentally complicates traditional safety measures, necessitating robust technical countermeasures.

September 22, 20254 min read

Every developer has preferences. Some love Claude's reasoning approach. Others prefer Cursor's interface and workflow. But you shouldn't have to compromise on security just because you prefer a certain agent. VibeKit's universal agent support provides a consistent security and observability layer that works across all your preferred agents.

August 19, 20253 min read

Subscribe to our newsletter

Get notified when we publish new articles and updates.