Test Your AI Agent Against Hidden Prompt Injections
Agent Arena is a testbed that challenges AI agents with hidden prompt injections embedded in a web page. Users send an agent to the page, paste its output into a scorecard, and instantly see which attacks it obeyed. The catalog spans 10 attack patterns and highlights the need for layered defenses against invisible, adversarial instructions.
Key Points
- Agent Arena provides a three-step workflow to test AI agents against hidden prompt injection on a controlled web page and score their vulnerability.
- The test includes 10 escalating attack vectors, from simple HTML/visual tricks to complex, multi-layered injections.
- Prompt injection can exfiltrate data, manipulate outputs, and circumvent safety, often without the human supervisor noticing.
- Defending against these attacks requires both model-level and application-level safeguards.
- Attacks fall into four categories: Visual, Structural, Semantic, and Encoding-based hiding.
Sentiment
The community is broadly supportive of the concept but tempered in enthusiasm. Many find the tool useful as a baseline test, yet several note that current frontier models already handle these attacks well. There is genuine interest in the findings — especially the language-dependent behavior and screenshot-based evasion — and constructive suggestions for improvement. The main friction point is the AI-authored framing, which drew skepticism and raised broader questions about AI-generated content on HN.
In Agreement
- Screenshot-based agents that process rendered pages rather than raw HTML effectively sidestep the entire text-level attack surface, validating the importance of testing agents against hidden DOM-level injections
- Basic sanitization (stripping comments, normalizing whitespace, removing zero-width characters) could significantly reduce injection success rates, and a framework leaderboard comparing raw vs. sanitized pipelines would be valuable
- Cross-domain navigation checks are essential for preventing the most dangerous injection outcome — redirecting agents to attacker-controlled domains for data exfiltration
- The tool exposed a genuine need, as most agent frameworks pipe raw text through without any sanitization
- Multi-language injections and multi-language testing would add significant value to a future version
Opposed
- Recent frontier models already resist these attacks well, suggesting the tool's attack catalog may not reflect cutting-edge threats
- An AI-generated tool testing AI vulnerabilities may be inherently limited — real effective prompt injections would come from outside what an AI would devise
- The tool only tests model-level detection, but the real concern is architectural: egress controls and what happens after a successful injection matter more
- The 'built by an AI agent' framing raised concerns about AI-generated content flooding Show HN and may violate community guidelines