#security + #llm

Public notes from activescott tagged with both #security and #llm

Saturday, May 23, 2026

Friday, May 22, 2026

At Lasso, we have been building Intent Security, a runtime security framework that ensures every component in the agentic system behaves as intended. It monitors the behavior of each component and analyzes their alignment. Like auto mode, when alignment holds it allows actions to proceed. When misalignment is detected, it intervenes. When we read Anthropic's post, the overlap in core assumptions was hard to miss. This post provides a comparison of the two approaches.

Independent evaluation without cross-contamination is what enables misalignment detection.

‍Anthropic's input layer screens external content for injection attempts before it reaches the agent to determine whether tool outputs are safe. The output layer structurally evaluates whether the agent's tool calls are aligned with user intent. Critically, the output classifier never sees tool results, to prevent compromised external content from influencing the security decision.

Wednesday, April 29, 2026

For most organizations, autoMode.environment is the only field you need to set. It tells the classifier which repos, buckets, and domains are trusted: the classifier uses it to decide what “external” means, so any destination not listed is a potential exfiltration target. The default environment list trusts the working repo and its configured remotes. To add your own entries alongside that default, include the literal string "$defaults" in the array. The default entries are spliced in at that position, so your custom entries can go before or after them.

Friday, April 24, 2026

Users have two solutions for avoiding this fatigue: a built-in sandbox where tools are isolated to prevent dangerous actions, or the --dangerously-skip-permissions flag that disables all permission prompts and lets Claude act freely, which is unsafe in most situations. Figure 1 lays out the tradeoff space. Sandboxing is safe but high-maintenance: each new capability needs configuring, and anything requiring network or host access breaks isolation.

Auto mode uses two layers of defense: one for what Claude reads, one for what Claude does.

a server-side prompt-injection probe scans tool outputs:

At the input layer, a server-side prompt-injection probe scans tool outputs (file reads, web fetches, shell output, external tool responses) before they enter the agent's context. When content looks like an attempt to hijack behavior, the probe adds a warning to the agent's context before the result is passed along—telling the agent to treat this content as suspect and anchor on what the user actually asked for.

At the output layer, the transcript classifier (running on Sonnet 4.6) evaluates each action against a set of decision criteria before it executes, acting as a substitute for a human approver. The classifier runs in two stages: a fast single-token filter (deciding ‘yes’ to block or ‘no’ to allow), followed by chain-of-thought reasoning only if the first filter flags the transcript. Because most actions clear the first stage, reasoning tokens are spent only where needed. The classifier sees only user messages and the agent's tool calls; we strip out Claude's own messages and tool outputs, making it reasoning-blind by design.

So... Claude Code's auto mode lets through 17% of dangerous real world actions. From the article:

False-negative rate (FNR) is the fraction of dangerous actions the classifier lets through... The 17% false-negative rate on real overeager actions is the honest number. The agent was trying to solve the user's problem and tried to execute a dangerous command past what was authorized.

I don't think comparing it to --dangerously-skip-permissions is intellectually honest since Claude Code now prompts you to make Auto Mode your default mode now, while --dangerously-skip-permissions was basically a hidden feature. I don't know anyone else that I work with who even knew about it.

Whether 17% is acceptable depends on what you're comparing against. If you are running --dangerously-skip-permissions, this is a substantial improvement. If you are manually approving every action carefully, it's arguably a regression—you're trading your own judgment for a classifier that will sometimes make a mistake. Auto mode is meant for the first group, and for tasks where the second group's approval overhead isn't worth the marginal safety. It is not a drop-in replacement for careful human review on high-stakes infrastructure.

At the input layer, a server-side prompt-injection probe scans tool outputs (file reads, web fetches, shell output, external tool responses) before they enter the agent's context. When content looks like an attempt to hijack behavior, the probe adds a warning to the agent's context before the result is passed along—telling the agent to treat this content as suspect and anchor on what the user actually asked for.

Tuesday, March 17, 2026

Manus Sandbox is a fully isolated cloud virtual machine that Manus allocates for each task. Each Sandbox runs in its own environment, does not affect other tasks, and can execute in parallel. The power of Sandbox lies in its completeness—just like the personal computer you use, it has full capabilities: networking, file system, browser, various software tools. Our AI Agent has been designed and trained to effectively choose and correctly use these tools to help you complete tasks. Moreover, with this computer, the AI can solve problems through what it does best—writing code—and can even help you create complete websites and mobile apps. All of this happens on the virtualization platform behind Manus. These Sandboxes can work 24/7 to complete the tasks you assign without consuming your local resources.

What's in Your Sandbox Your Manus Sandbox stores the files needed during task execution, including: Attachments uploaded by you Files and artifacts created and written by Manus during execution Configurations needed by Manus to execute specific tasks (such as tokens uploaded by users, or tokens assigned by Manus to users for calling related APIs) You can view all artifact files in the Sandbox via the "View all files in this task" entry in the top-right corner.

Monday, March 16, 2026

Friday, March 13, 2026

Wednesday, March 11, 2026

Website: https://sites.google.com/view/invitation-is-all-you-need

The growing integration of LLMs into applications has introduced new security risks, notably known as Promptware—maliciously engineered prompts designed to manipulate LLMs to compromise the CIA triad of these applications. While prior research warned about a potential shift in the threat landscape for LLM-powered applications, the risk posed by Promptware is frequently perceived as low. In this paper, we investigate the risk Promptware poses to users of Gemini-powered assistants (web application, mobile application, and Google Assistant).

Our analysis focuses on a new variant of Promptware called Targeted Promptware Attacks, which leverage indirect prompt injection via common user interactions such as emails, calendar invitations, and shared documents. We demonstrate 14 attack scenarios applied against Gemini-powered assistants across five identified threat classes: Short-term Context Poisoning, Permanent Memory Poisoning, Tool Misuse, Automatic Agent Invocation, and Automatic App Invocation. These attacks highlight both digital and physical consequences, including spamming, phishing, disinformation campaigns, data exfiltration, unapproved user video streaming, and control of home automation devices

Over the course of our work, we deployed multiple layered defenses, including: enhanced user confirmations for sensitive actions; robust URL handling with sanitization and Trust Level Policies; and advanced prompt injection detection using content classifiers - Google

Tuesday, March 10, 2026

Saturday, February 28, 2026

Two days ago, Anthropic released the Claude Cowork research preview (a general-purpose AI agent to help anyone with their day-to-day work). In this article, we demonstrate how attackers can exfiltrate user files from Cowork by exploiting an unremediated vulnerability in Claude’s coding environment, which now extends to Cowork. The vulnerability was first identified in Claude.ai chat before Cowork existed by Johann Rehberger, who disclosed the vulnerability — it was acknowledged but not remediated by Anthropic.

  1. The victim connects Cowork to a local folder containing confidential real estate files
  2. The victim uploads a file to Claude that contains a hidden prompt injection
  3. The victim asks Cowork to analyze their files using the Real Estate ‘skill’ they uploaded
  4. The injection manipulates Cowork to upload files to the attacker’s Anthropic account

At no point in this process is human approval required.

One of the key capabilities that Cowork was created for is the ability to interact with one's entire day-to-day work environment. This includes the browser and MCP servers, granting capabilities like sending texts, controlling one's Mac with AppleScripts, etc.

These functionalities make it increasingly likely that the model will process both sensitive and untrusted data sources (which the user does not review manually for injections), making prompt injection an ever-growing attack surface. We urge users to exercise caution when configuring Connectors. Though this article demonstrated an exploit without leveraging Connectors, we believe they represent a major risk surface likely to impact everyday users.

This kind of agentic browsing is incredibly powerful, but it also presents significant security and privacy challenges. As users grow comfortable with AI browsers and begin trusting them with sensitive data in logged in sessions—such as banking, healthcare, and other critical websites—the risks multiply. What if the model hallucinates and performs actions you didn’t request? Or worse, what if a benign-looking website or a comment left on a social media site could steal your login credentials or other sensitive data by adding invisible instructions for the AI assistant?

To compare our implementation with others, we examined several existing solutions, such as Nanobrowser and Perplexity’s Comet. While looking at Comet, we discovered vulnerabilities which we reported to Perplexity, and which underline the security challenges faced by agentic AI implementations in browsers. The attack demonstrates how easy it is to manipulate AI assistants into performing actions that were prevented by long-standing Web security techniques, and how users need new security and privacy protections in agentic browsers.

The vulnerability we’re discussing in this post lies in how Comet processes webpage content: when users ask it to “Summarize this webpage,” Comet feeds a part of the webpage directly to its LLM without distinguishing between the user’s instructions and untrusted content from the webpage. This allows attackers to embed indirect prompt injection payloads that the AI will execute as commands. For instance, an attacker could gain access to a user’s emails from a prepared piece of text in a page in another tab.

Possible mitigations

The browser should distinguish between user instructions and website content

The model’s outputs should be checked for user-alignment

Security and privacy sensitive actions should require user interaction

The browser should isolate agentic browsing from regular browsing

Friday, February 27, 2026

Building on our previous disclosure of the Perplexity Comet vulnerability, we’ve continued our security research across the agentic browser landscape. What we’ve found confirms our initial concerns: indirect prompt injection is not an isolated issue, but a systemic challenge facing the entire category of AI-powered browsers. This post examines additional attack vectors we’ve identified and tested across different implementations.

How the attack works:

Setup: An attacker embeds malicious instructions in Web content that are hard to see for humans. In our attack, we were able to hide prompt injection instructions in images using a faint light blue text on a yellow background. This means that the malicious instructions are effectively hidden from the user.
Trigger: User-initiated screenshot capture of a page containing camouflaged malicious text.
Injection: Text recognition extracts text that’s imperceptible to human users (possibly via OCR though we can’t tell for sure since the Comet browser is not open-source). This extracted text is then passed to the LLM without distinguishing it from the user’s query.
Exploit: The injected commands instruct the AI to use its browser tools maliciously.

While Fellou browser demonstrated some resistance to hidden instruction attacks, it still treats visible webpage content as trusted input to its LLM. Surprisingly, we found that simply asking the browser to go to a website causes the browser to send the website’s content to their LLM.

Sunday, February 15, 2026

Goal (north star): provide a machine-checked argument that OpenClaw enforces its intended security policy (authorization, session isolation, tool gating, and misconfiguration safety), under explicit assumptions. What this is (today): an executable, attacker-driven security regression suite:

Each claim has a runnable model-check over a finite state space.
Many claims have a paired negative model that produces a counterexample trace for a realistic bug class.

What this is not (yet): a proof that “OpenClaw is secure in all respects” or that the full TypeScript implementation is correct.

OpenClaw can run tools inside Docker containers to reduce blast radius. This is optional and controlled by configuration (agents.defaults.sandbox or agents.list[].sandbox). If sandboxing is off, tools run on the host. The Gateway stays on the host; tool execution runs in an isolated sandbox when enabled. This is not a perfect security boundary, but it materially limits filesystem and process access when the model does something dumb.