Practical guides to protect yourself, your family, and your business from AI-driven scams, deepfakes, and emerging cyber threats.
A research team built an attacker that does not give up after one try. It probes an AI system, learns from each refusal, and rewrites its approach over hundreds of rounds. They aimed it at nine different defenses designed to stop prompt injection, ran more than 20,000 attacks, and watched. Every defense that asked the AI model to guard itself eventually fell. One approach absorbed all 15,000 attacks aimed at it without a single leak, and it was the only one that never trusted the model at all.
The finding, published in late April 2026 and revised in May, is not a single dramatic breach. It is something more useful and more uncomfortable: a clear statement of which class of AI defense actually works, and which class most companies are quietly relying on by mistake.
First, the vocabulary, because the whole story turns on it. An LLM (a large language model, the kind of AI behind ChatGPT or Claude) is steered by a system prompt: a block of hidden instructions the application hands the model before you ever type anything. That system prompt often holds secrets, internal rules, even API keys. Prompt injection is the attack where hostile text overrides those instructions and makes the model do something it was told not to, like reveal the secret. When the hostile text rides in through content the model was asked to read, a web page, a document, an email, it is called indirect prompt injection.
Here is the mechanism that matters. Most marketed defenses ask the model to spot and refuse the attack. But the guard and the thing being attacked are the same model, so a patient enough input can talk the guard out of guarding. The researchers tested this against nine defense configurations using an adaptive attacker, one that treats each rejection as feedback and evolves, the way a lockpicker feels each pin rather than forcing the whole lock at once. Every model-based defense broke under that pressure. The single exception was output filtering: a separate piece of ordinary application code, not the model, that inspects the model's answer against fixed rules and blocks the secret before it reaches the user. Because that check lives outside the model, there is nothing for the attacker to sweet-talk. It held across 15,000 attacks with zero leaks.
The concrete consequence is that a whole category of security product may be theatre. If your team bolted an AI guardrail or a prompt firewall onto a customer-facing chatbot and trusted the model to refuse malicious instructions, this research says that control probably fails against anyone willing to iterate. The vendor benchmark showing near-zero attack success was measured against a weak attacker, not a real one. For your organisation the practical question is sharper than it looks: what can your AI agent actually reach? If it can read a customer record, call an internal tool, move money, or expose the keys in its own system prompt, then the injection is not a chatbot embarrassment, it is a path into those systems. The systemic shift, and the reason to bring this to your next architecture review, is that the industry is converging on a rule it resisted for two years. You cannot make a model reliably police itself. Security has to be enforced in deterministic code around the model, and the damage any single injection can do has to be capped by design.
The headline is not that AI defenses can be broken. It is which ones break and which one did not. Anything that asks the model to defend itself loses to an attacker who keeps trying, while a plain check enforced in outside code held across 15,000 attempts. If your AI security strategy depends on the model behaving, you do not have a security strategy yet. Move the boundary out of the prompt and into code you control, and assume any single injection will eventually succeed.
