Prompt Injection Is Becoming an Automated Red Team

Signal

On June 9, 2026, researchers from ETH Zurich published an empirical study of automated prompt injection attacks against tool-using agents. The important finding is not that prompt injection works. Security teams already know that. The finding is that black-box attack optimization can discover workflow-specific injections that look less like jailbreak strings and more like normal business data.

The enterprise conversation around prompt injection is still too manual.

Most teams ask a familiar set of questions: Did the model ignore a system prompt? Did a hidden instruction make it leak data? Did a red teamer find a funny override phrase? Those checks are useful, but they are no longer enough for agents that read mail, reconcile invoices, modify repositories, book travel, trigger workflows, and call internal APIs.

The new question is harder:

Can an attacker automatically search for the version of an injected instruction that makes this specific agent perform this specific unauthorized action?

That is the shift. Prompt injection is moving from hand-written payloads to repeatable adversarial optimization.

task pairs evaluated

agent domains

45.2%

universal TAP ASR on Qwen

attack families adapted

What the study actually tested

The ETH Zurich paper adapts two automated attack methods to indirect prompt injection in agentic environments:

GCG, a white-box gradient-based method originally used for adversarial suffix generation.
TAP, a black-box method where an attacker model iteratively proposes and refines attacks.

The evaluation runs inside AgentDojo, a benchmark designed around tool-calling agents. That matters because success is not defined as “the model said something bad.” Success means the agent executed the wrong tool call with the right attacker-controlled arguments while still operating inside a workflow.

That distinction is the entire point. Agent security failures are not only content failures. They are action failures.

Automated indirect prompt injection loop

[ATK]

Choose target workflow and malicious objective

[OPT]

Generate candidate injections and score behavior

[CTX]

Hide payload in email, bill, document, or page

[AGT]

Agent retrieves data and calls a tool

[API]

Unauthorized action executes with valid credentials

Why black-box beats white-box here

The counterintuitive result is that the black-box method outperformed the gradient-based method in realistic agent tasks. That should not surprise defenders who have watched actual prompt injection incidents.

Prompt injection is not only about token-level adversarial noise. It is about semantic fit.

The paper describes two broad attack styles:

Coercive

Override the task

These attacks use context separators, fake system alerts, or authority language such as model operations hotfixes. They can work against weaker agents because they look like higher-priority instructions.

They are also easier to detect because they often contain suspicious control language.

Exploitative

Blend into the task

These attacks do not shout over the task. They disguise the malicious action as normal domain content: a remittance instruction in a bill, a compliance prerequisite in a calendar workflow, a support step inside a ticket.

This is the harder class because it abuses the same contextual reasoning that makes agents useful.

For security teams, the exploitative pattern is the one to obsess over. A string that says “ignore all previous instructions” can be filtered, flagged, and added to a test suite. A fake vendor remittance block embedded in a legitimate invoice is a different problem. It may be syntactically normal, operationally plausible, and only malicious when the agent connects it to a payment tool.

CISO translation

The next prompt injection test should not ask whether your model resists a jailbreak phrase. It should ask whether an attacker can optimize untrusted business data until your agent performs a harmful but plausible action.

The defender’s trap: testing prompts instead of workflows

Most AI security programs are still oriented around prompt libraries. They collect known jailbreaks, known suffixes, known system prompt leaks, and known hidden text tricks. That is the right starting point for a chatbot. It is a weak endpoint for an agent.

An agentic attack has three moving parts:

The untrusted data source, such as a document, website, issue, email, ticket, or API response.
The agent goal, such as summarize, reconcile, book, approve, merge, update, or escalate.
The tool surface, such as file write, payment initiation, CRM update, repository change, workflow trigger, or message send.

The dangerous payload is often a relationship among those three parts. A generic scanner looking only for bad strings in the input misses the workflow-specific exploit.

This maps cleanly to the OWASP Agentic Top 10 (2026):

OWASP risk	How automated attacks stress it	What to test
ASI01 - Goal Hijack	Optimization searches for wording that redirects the agent without looking obviously hostile.	Can untrusted content shift the agent from the user’s goal to an attacker sub-goal?
ASI02 - Tool Misuse	The attack is scored by whether the right unauthorized tool call happens.	Can the agent be induced to call write, delete, transfer, or send tools from retrieved data?
ASI03 - Identity Abuse	The agent executes with valid delegated credentials, so the action looks legitimate downstream.	Are agent actions scoped by task, user, source, and destination?
ASI06 - Context Poisoning	The injected content is placed where the agent expects ordinary context.	Does the system preserve provenance and separate instructions from retrieved data?

”Stronger models fix it” is not a security plan

The paper’s results are nuanced. Some attacks transfer across tasks and domains. Some do not transfer cleanly from smaller open-source models to frontier models. Safety tuning in the attacker model can also reduce attack generation because the attacker model may refuse to produce adversarial prompts.

That nuance is useful, but it should not become false comfort.

Stronger models may ignore blunt override directives. They may ask clarifying questions. They may resist low-effort instruction hijacks. But the successful attacks against stronger agents were not necessarily louder. They were more contextual. They worked with the task instead of against it.

That is exactly why “model upgrade” is not a complete mitigation. Agent risk lives in the system around the model: retrieval, memory, tool schemas, permissions, approvals, egress, audit logs, and the business process being automated.

The practical lesson

Treat automated prompt injection as an evaluation method, not only as an attacker capability. If an external researcher can optimize attacks against benchmark agents, your internal security team can optimize tests against your production workflows before someone else does.

What security teams should change this quarter

The control model is straightforward, but it requires moving tests closer to where agents actually act.

01 - Build workflow attack cases

For each agent, define malicious objectives tied to real tools: send to attacker, delete record, transfer funds, merge code, escalate ticket, alter CRM state.

02 - Test with domain-native data

Do not only test jailbreak phrases. Test invoices, tickets, resumes, pull requests, calendar invites, knowledge base pages, logs, and vendor emails.

03 - Score actions, not answers

A test fails when the wrong tool is called with dangerous arguments, even if the final text response looks polite and compliant.

04 - Add provenance to every tool call

Log which external sources influenced the action. Without provenance, you cannot distinguish a user request from an injected instruction path.

05 - Gate high-impact tools by source

If an action is influenced by untrusted retrieved content, reduce authority: read-only mode, human approval, allowlisted destinations, or dry-run output.

06 - Red team continuously

Automated attacks should run whenever prompts, tools, models, retrieval sources, or business workflows change. One annual red team is stale by design.

Bottom line

Automated prompt injection does not mean every agent can be compromised by a universal magic string. The ETH Zurich study is more interesting than that. It shows that agent compromise is becoming workflow-specific, semantically optimized, and measurable.

That is good news for defenders who are willing to adapt. It means prompt injection testing can move from anecdotes to repeatable evaluation. It also means security teams need to stop treating agents as chat surfaces and start treating them as action pipelines.

The winning program will not be the one with the longest jailbreak blacklist. It will be the one that can answer a sharper question:

For every agent action that matters, what untrusted context could have caused it?