This content originally appeared on text/plain and was authored by ericlaw
A hot infosec topic these days is “How can we prevent abuse of AI agents?”
While AI introduces awesome new capabilities, it also entails an enormous set of risks from the obvious and mundane to the esoteric and elaborate.
As a browser security person, I’m most often asked about indirect prompt injection attacks, whereby a client’s AI (e.g. in-browser or on device) is tasked with interacting with content from the Internet. The threat here is that the AI Agent might mistakenly treat the web content it interacts with as instructions from the Agent’s user, and so hypnotized, fall under the control of the author of that web content. Malicious web content could then direct the Agent (now a confused deputy) to undertake unsafe actions like sharing private data about the user, performing transactions using that user’s wallet, etc.
Nothing New Under the Sun
Injection attacks can be found all over the cybersecurity landscape.
The most obvious example is found in memory safety vulnerabilities, whereby an attacker overflows a content data buffer and that data is incorrectly treated as code. That vulnerability roots back to a fundamental design choice in common computing architectures: the “Von Neumann Architecture,” whereby code and data are comingled in the memory of the system. While convenient for many reasons, it gave rise to an entire class of attacks that would’ve been prevented by the “Harvard Architecture” whereby the data and instructions would be plainly distinct. One of the major developments of 20 years ago– Data Execution Prevention / No eXecute (DEP/NX) was a processor feature that would more clearly delineate data and code in an attempt to prevent this mistake. And the list of “alphabet soup” mitigations has only grown over the years.
Well beyond low-level processor architecture, this class of attack is seen all over, including the Web Platform, which adopted a Von Neumann-style design in which the static text of web pages is comingled with inline scripting code, giving rise to the ever-present threat of Cross-Site Scripting. And here again, we ended up trying to put the genie back in the bottle with the introduction of features like Content Security Policy which prevent content and script from mingling in dangerous ways.
I’ll confess that I don’t understand nearly enough about how LLM AIs operate to understand whether the “Harvard Architecture” is even possible for an LLM, but from the questions I’m getting, it clearly is not the common architecture.
What Can Be Done?
In a world where AI is subject to injection attacks, what can we do about it?
One approach would be to ensure that the Agent cannot load “unsafe” web content. Since I work on SmartScreen, a reputation service for blocking access to known-unsafe sites, I’m often asked whether we could just block Agent from accessing bad sites just as we would for a regular human browser user. And yes, we should and do, but this is wildly insufficient: SmartScreen blocks sites found to be phishing, distributing malware, or conducting tech scams, but the set of bad sites grows by the second, and it’s very unlikely that a site conducting a prompt injection attack would even be recognized today.
If blocking bad sites doesn’t work, maybe we could allow only “known good” sites? This too is problematic. There’s no concept of a “trustworthy sites list” per-se. The closest SmartScreen has is a “Top traffic” list, but that just reflects a list of high-traffic sites that are considered to be unlikely sources of the specific types of malicious threats SmartScreen addresses (e.g. phishing, malware, tech scams). And it’s worse than that — many “known good” sites contain untrusted content like user-generated comments/posts, ads, snippets of text from other websites, etc. A “known good” site that allows untrusted 3rd-party content would represent a potential source of a prompt injection attack.
Finally, another risk-limiting design might be to limit the Agent’s capabilities, either requiring constant approval from a supervising human, or by employing heavy sandboxing whereby the Agent operates from an isolated VM that does not have access to any user-information or ambient authority so that a hypnotised Agent could not cause much damage.
Unfortunately, any Agent that’s running in a sandbox doesn’t have access to resources (e.g. the user’s data or credentials) that are critical for achieving compelling scenarios (“Book a table for two at a nice restaurant, order flowers, and email an calendar reminder to my wife“), such that a sandboxed Agent may be much less compelling to an everyday human.
Aside: Game Theory
Despite all of the obvious security risks of Agent, product teams are racing ahead to integrate more and more capable Agent functionality into their products.
AI companies are racing toward ever-more-empowered Agents because everyone is scared that one of the other AI companies is gonna come out with some less cautious product and that more powerful, less restricted product is gonna win the market. So we end up in the situation with the US at the end of the 1950s, whereby the Russians had 4 working ICBMs but the United States had convinced ourselves they had thousand. So the US built a thousand ICBMs, so the Russians then built a thousand ICBMs, and so on, until we both basically bankrupted the world over the next few decades.
This content originally appeared on text/plain and was authored by ericlaw

ericlaw | Sciencx (2025-09-05T19:26:50+00:00) AI Injection Attacks. Retrieved from https://www.scien.cx/2025/09/05/ai-injection-attacks/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.