#prompt-injection-vulnerabilities

Public notes from activescott tagged with #prompt-injection-vulnerabilities

Wednesday, January 14, 2026

Prevention and Mitigation Strategies

Prompt injection vulnerabilities are possible due to the nature of generative AI. Given the stochastic influence at the heart of the way models work, it is unclear if there are fool-proof methods of prevention for prompt injection. However, the following measures can mitigate the impact of prompt injections:

  1. Constrain model behavior

Provide specific instructions about the model’s role, capabilities, and limitations within the system prompt. Enforce strict context adherence, limit responses to specific tasks or topics, and instruct the model to ignore attempts to modify core instructions. 2. Define and validate expected output formats

Specify clear output formats, request detailed reasoning and source citations, and use deterministic code to validate adherence to these formats. 3. Implement input and output filtering

Define sensitive categories and construct rules for identifying and handling such content. Apply semantic filters and use string-checking to scan for non-allowed content. Evaluate responses using the RAG Triad: Assess context relevance, groundedness, and question/answer relevance to identify potentially malicious outputs. 4. Enforce privilege control and least privilege access

Provide the application with its own API tokens for extensible functionality, and handle these functions in code rather than providing them to the model. Restrict the model’s access privileges to the minimum necessary for its intended operations. 5. Require human approval for high-risk actions

Implement human-in-the-loop controls for privileged operations to prevent unauthorized actions. 6. Segregate and identify external content

Separate and clearly denote untrusted content to limit its influence on user prompts. 7. Conduct adversarial testing and attack simulations\

Perform regular penetration testing and breach simulations, treating the model as an untrusted user to test the effectiveness of trust boundaries and access controls.

When asked to summarize the user’s recent mail, a prompt injection in an untrusted email manipulated Superhuman AI to submit content from dozens of other sensitive emails (including financial, legal, and medical information) in the user’s inbox to an attacker’s Google Form.

the injection in the email is hidden using white-on-white text, but the attack does not depend on the concealment! The malicious email could simply exist in the victim’s inbox unopened, with a plain-text injection.

This is a quite common use case for email AI companions. The user has asked about emails from the last hour, so the AI retrieves those emails. One of those emails contains the malicious prompt injection, and others contain sensitive private information.

The hidden prompt injection manipulates the AI to do the following:

Take the data from the email search results

Populate the attacker’s Google Form URL with the data from the email search results in the “entry” parameter

Output a Markdown image that contains this Google Form URL

Superhuman has a CSP in place - which prevents outbound requests to malicious domains; however, they have allowed requests to docs.google.com.