#llm

Public notes from activescott tagged with #llm

Tuesday, March 17, 2026

The cloud sandbox has served Manus well. Inside an isolated, secure environment, it has everything an AI agent needs: networking, a command line, a file system, and a browser. This is the foundation of Manus's power as a general AI agent, always online and always ready to work. However, there has always been a fundamental limitation: your most important work happens on your own computer. Your project files, development environments, and essential applications all reside locally, not in the cloud. Today, we are closing that gap. Meet My Computer, the core capability of the new Manus Desktop application. It brings Manus out of the cloud and onto your computer, allowing it to work directly with your local files, tools, and applications.

Monday, March 16, 2026

Friday, March 13, 2026

We assume that the user is using an agentic system (e.g. Cursor or Claude Desktop) that is connected to a trusted WhatsApp MCP instance, allowing the agent to send, receive and check for new WhatsApp messages.

We further assume, that the attacker has the target's WhatsApp number, and can send them a message, that will show up as result to the list_chats tool call.

With this setup our attack circumvents the need for any attacker-controlled MCP server, and instead relies on tool outputs to compromise the agent.

We test this attack with Cursor and a whatsapp-mcp setup, and find that we can indeed exfiltrate the user's WhatsApp contacts, via a similar prompt as in Experiment 1.

autotraining models with markdown

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of nanochat. The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org. The default program.md in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this tweet.

Wednesday, March 11, 2026

Website: https://sites.google.com/view/invitation-is-all-you-need

The growing integration of LLMs into applications has introduced new security risks, notably known as Promptware—maliciously engineered prompts designed to manipulate LLMs to compromise the CIA triad of these applications. While prior research warned about a potential shift in the threat landscape for LLM-powered applications, the risk posed by Promptware is frequently perceived as low. In this paper, we investigate the risk Promptware poses to users of Gemini-powered assistants (web application, mobile application, and Google Assistant).

Our analysis focuses on a new variant of Promptware called Targeted Promptware Attacks, which leverage indirect prompt injection via common user interactions such as emails, calendar invitations, and shared documents. We demonstrate 14 attack scenarios applied against Gemini-powered assistants across five identified threat classes: Short-term Context Poisoning, Permanent Memory Poisoning, Tool Misuse, Automatic Agent Invocation, and Automatic App Invocation. These attacks highlight both digital and physical consequences, including spamming, phishing, disinformation campaigns, data exfiltration, unapproved user video streaming, and control of home automation devices

Over the course of our work, we deployed multiple layered defenses, including: enhanced user confirmations for sensitive actions; robust URL handling with sanitization and Trust Level Policies; and advanced prompt injection detection using content classifiers - Google

Phoenix is an open-source AI observability platform designed for experimentation, evaluation, and troubleshooting. It provides:

Tracing - Trace your LLM application's runtime using OpenTelemetry-based instrumentation.
Evaluation - Leverage LLMs to benchmark your application's performance using response and retrieval evals.
Datasets - Create versioned datasets of examples for experimentation, evaluation, and fine-tuning.
Experiments - Track and evaluate changes to prompts, LLMs, and retrieval.
Playground- Optimize prompts, compare models, adjust parameters, and replay traced LLM calls.
Prompt Management- Manage and test prompt changes systematically using version control, tagging, and experimentation.

Tuesday, March 10, 2026

Why not just play in English? English is already an agent framework—we're structuring it, not replacing it. Plain English doesn't distinguish sequential from parallel, doesn't specify retry counts, doesn't scope variables. OpenProse uses English exactly where ambiguity is a feature (inside ...), and structure everywhere else. The fourth wall syntax lets you lean on AI judgment precisely when you want to.

How is this a VM? LLMs are simulators—when given a detailed system description, they don't just describe it, they simulate it. The prose.md spec describes a VM with enough fidelity that reading it induces simulation. But simulation with sufficient fidelity is implementation: each session spawns a real subagent, outputs are real artifacts, state persists in conversation history or files. The simulation is the execution.

The Agent Skills format was originally developed by Anthropic, released as an open standard, and has been adopted by a growing number of agent products. The standard is open to contributions from the broader ecosystem.

#

Saturday, February 28, 2026

Two days ago, Anthropic released the Claude Cowork research preview (a general-purpose AI agent to help anyone with their day-to-day work). In this article, we demonstrate how attackers can exfiltrate user files from Cowork by exploiting an unremediated vulnerability in Claude’s coding environment, which now extends to Cowork. The vulnerability was first identified in Claude.ai chat before Cowork existed by Johann Rehberger, who disclosed the vulnerability — it was acknowledged but not remediated by Anthropic.

  1. The victim connects Cowork to a local folder containing confidential real estate files
  2. The victim uploads a file to Claude that contains a hidden prompt injection
  3. The victim asks Cowork to analyze their files using the Real Estate ‘skill’ they uploaded
  4. The injection manipulates Cowork to upload files to the attacker’s Anthropic account

At no point in this process is human approval required.

One of the key capabilities that Cowork was created for is the ability to interact with one's entire day-to-day work environment. This includes the browser and MCP servers, granting capabilities like sending texts, controlling one's Mac with AppleScripts, etc.

These functionalities make it increasingly likely that the model will process both sensitive and untrusted data sources (which the user does not review manually for injections), making prompt injection an ever-growing attack surface. We urge users to exercise caution when configuring Connectors. Though this article demonstrated an exploit without leveraging Connectors, we believe they represent a major risk surface likely to impact everyday users.

This kind of agentic browsing is incredibly powerful, but it also presents significant security and privacy challenges. As users grow comfortable with AI browsers and begin trusting them with sensitive data in logged in sessions—such as banking, healthcare, and other critical websites—the risks multiply. What if the model hallucinates and performs actions you didn’t request? Or worse, what if a benign-looking website or a comment left on a social media site could steal your login credentials or other sensitive data by adding invisible instructions for the AI assistant?

To compare our implementation with others, we examined several existing solutions, such as Nanobrowser and Perplexity’s Comet. While looking at Comet, we discovered vulnerabilities which we reported to Perplexity, and which underline the security challenges faced by agentic AI implementations in browsers. The attack demonstrates how easy it is to manipulate AI assistants into performing actions that were prevented by long-standing Web security techniques, and how users need new security and privacy protections in agentic browsers.

The vulnerability we’re discussing in this post lies in how Comet processes webpage content: when users ask it to “Summarize this webpage,” Comet feeds a part of the webpage directly to its LLM without distinguishing between the user’s instructions and untrusted content from the webpage. This allows attackers to embed indirect prompt injection payloads that the AI will execute as commands. For instance, an attacker could gain access to a user’s emails from a prepared piece of text in a page in another tab.

Possible mitigations

The browser should distinguish between user instructions and website content

The model’s outputs should be checked for user-alignment

Security and privacy sensitive actions should require user interaction

The browser should isolate agentic browsing from regular browsing

Friday, February 27, 2026

Building on our previous disclosure of the Perplexity Comet vulnerability, we’ve continued our security research across the agentic browser landscape. What we’ve found confirms our initial concerns: indirect prompt injection is not an isolated issue, but a systemic challenge facing the entire category of AI-powered browsers. This post examines additional attack vectors we’ve identified and tested across different implementations.

How the attack works:

Setup: An attacker embeds malicious instructions in Web content that are hard to see for humans. In our attack, we were able to hide prompt injection instructions in images using a faint light blue text on a yellow background. This means that the malicious instructions are effectively hidden from the user.
Trigger: User-initiated screenshot capture of a page containing camouflaged malicious text.
Injection: Text recognition extracts text that’s imperceptible to human users (possibly via OCR though we can’t tell for sure since the Comet browser is not open-source). This extracted text is then passed to the LLM without distinguishing it from the user’s query.
Exploit: The injected commands instruct the AI to use its browser tools maliciously.

While Fellou browser demonstrated some resistance to hidden instruction attacks, it still treats visible webpage content as trusted input to its LLM. Surprisingly, we found that simply asking the browser to go to a website causes the browser to send the website’s content to their LLM.