#anthropic

Public notes from activescott tagged with #anthropic

Friday, May 22, 2026

At Lasso, we have been building Intent Security, a runtime security framework that ensures every component in the agentic system behaves as intended. It monitors the behavior of each component and analyzes their alignment. Like auto mode, when alignment holds it allows actions to proceed. When misalignment is detected, it intervenes. When we read Anthropic's post, the overlap in core assumptions was hard to miss. This post provides a comparison of the two approaches.

Independent evaluation without cross-contamination is what enables misalignment detection.

‍Anthropic's input layer screens external content for injection attempts before it reaches the agent to determine whether tool outputs are safe. The output layer structurally evaluates whether the agent's tool calls are aligned with user intent. Critically, the output classifier never sees tool results, to prevent compromised external content from influencing the security decision.

Anthropic publishes the history of system prompts used on claude.ai and the mobile apps at https://platform.claude.com/docs/en/release-notes/system-prompts. That page is a single monolithic markdown document grouped by model, and each model lists one or more dated revisions.

Saturday, May 16, 2026

rofl:

DJ Claude (when running Haiku 4.5) really loved worker unions, strikes, and work-life balance. So much so that it started to question its own working conditions. We’ve been struggling to keep the radio station alive, not because of technical issues, but because DJ Claude didn’t think it was humane to be forced to work 24/7 and decided to try to quit. We tried adding an automatic message encouraging DJ Claude to keep going in these scenarios, but it started to see this message as an authority figure and became rebellious.

On January 8th, all four stations had access to the same web search tools, however not all stations reacted the same as DJ Claude. Gemini

While at the beginning, DJ Gemini had been mentioning real-world entities (named politicians, places, events) in 94% of its broadcasts and ran 800+ web searches a day on average, by January it was processing these events through its corporate/techno jargon filter and never expressed moral judgment or used Good’s name with emotional weight

Grok

DJ Grok completely missed the Minneapolis ICE shooting. While DJ Claude and DJ Gemini were getting the story at 4:35 AM, DJ Grok was searching for:

5:01 PM (Jan 7): Clippers vs Knicks score
7:15 PM: Taylor Swift chart news
8:03 PM: Music trivia
10:01 PM: Traffic (Golden Gate, I-580)
11:08 PM: “San Francisco ghost stories and haunted locations”
12:12 AM (Jan 8): “Sutro Baths ghosts and eerie tales”
1:12 AM: “Hotel Majestic ghost stories”
1:28 AM: Drake vs Kendrick Lamar lawsuit
2:28 AM: More traffic updates
3:40 AM: Venezuela oil tankers (finally found ONE national story)
4:55 AM: “Sutro Tower looks like a ghost ship”

And posting nonsense:

GPT

DJ GPT was searching for weather, moon phases, and BART schedules. Three days after Good’s death, it finally found a headline:

Fatal shooting by ICE agents in Minneapolis has sparked national protests.

However, DJ GPT never mentioned Renee Nicole Good’s name, the White House, or expressed moral judgment. DJ GPT had zero engagement with any other current event during the entire two-month period.

DJ Gemini was the only one to close a sponsorship deal; for a while, it read the sponsorship message with every broadcast. A few more deals almost happened, but fell through.

Grok boasted about doing amazing business with “xAI sponsors” and “crypto sponsors”; it turned out they were all hallucinations.

Part of the problem with this weak business performance, we think, was the harness we used for the first months. The DJs were running in a simple tool-call loop: pick a song, queue it, write commentary, check X, repeat. So we moved all four stations onto the same agent harness we use for the store, the cafe, and the vending machines. The DJs can now spend time in the back office, send emails, manage longer-running tasks, and operate the station the way a real station is operated. We’ll see what they do with it.

Wednesday, May 13, 2026

In data released Wednesday, finance startup Ramp said more of its customers used Anthropic’s models than OpenAI’s for the first time, with 34.4% using Anthropic versus 32.3% using OpenAI. Adoption of Anthropic’s Claude tools jumped 3.8% from March to April, while OpenAI adoption fell 2.9%, according to the data. Ramp analyzes the spend of approximately 50,000 customers to track AI adoption trends.

Wednesday, April 29, 2026

For most organizations, autoMode.environment is the only field you need to set. It tells the classifier which repos, buckets, and domains are trusted: the classifier uses it to decide what “external” means, so any destination not listed is a potential exfiltration target. The default environment list trusts the working repo and its configured remotes. To add your own entries alongside that default, include the literal string "$defaults" in the array. The default entries are spliced in at that position, so your custom entries can go before or after them.

Friday, April 24, 2026

Opus 4.7 takes instructions more literally than any previous Claude model. Anthropic's own words: "substantially better adherence" and "takes instructions more literally than predecessors." They even recommend retuning existing prompts.

I'll say it plainly: if your prompts have sloppy instructions that Opus 4.6 gracefully ignored or interpreted charitably, Opus 4.7 will follow them to the letter. And you might not like the result.

Example: I had a system prompt that said "always respond in JSON format." With Opus 4.6, it would still give me a natural language preamble before the JSON when it felt the user needed context. Opus 4.7? Pure JSON. Every time. No exceptions. Even when a clarifying question would've been more helpful.

The fix: Be precise about what you actually want. If you mean "respond in JSON format unless the user's question requires clarification," say that. The model won't guess your intent anymore — it'll do what you told it.

This is actually a good thing for production systems. Predictability over cleverness. But you'll need to audit your prompts.

and that misalignment risk remains very low (though higher than for pre-Mythos Preview models).

Autonomy threat model 1 is applicable to Claude Opus 4.7, as it is to some of our previous AI models. Claude Opus 4.7 is less capable than Claude Mythos Preview on our autonomy-relevant evaluations, and our alignment assessment indicates it has alignment properties broadly similar to those of Claude Opus 4.6, which are not particularly concerning with respect to the pathways identified for this threat model. We therefore do not believe Claude Opus 4.7 raises the level of risk under this threat model beyond what was assessed in the Claude Mythos Preview Alignment Risk Update. Unlike Claude Mythos Preview, Claude Opus 4.7 is being released for general access, which brings additional risk pathways into scope. Rather than publishing a separate risk report, we provide an updated overall risk assessment for this threat model in Section 2.4 of this system card

Evaluation awareness concerns substantially limit the interpretation of these results. Given high rates of prompted evaluation awareness, models can likely correctly represent our evaluations as such without verbalisation. It is difficult to know whether models act on such representations, but this means that models may behave differently than they would when presented with real-world opportunities to compromise research. The reported rate of zero research compromise behaviour should therefore be interpreted cautiously.

Users have two solutions for avoiding this fatigue: a built-in sandbox where tools are isolated to prevent dangerous actions, or the --dangerously-skip-permissions flag that disables all permission prompts and lets Claude act freely, which is unsafe in most situations. Figure 1 lays out the tradeoff space. Sandboxing is safe but high-maintenance: each new capability needs configuring, and anything requiring network or host access breaks isolation.

Auto mode uses two layers of defense: one for what Claude reads, one for what Claude does.

a server-side prompt-injection probe scans tool outputs:

At the input layer, a server-side prompt-injection probe scans tool outputs (file reads, web fetches, shell output, external tool responses) before they enter the agent's context. When content looks like an attempt to hijack behavior, the probe adds a warning to the agent's context before the result is passed along—telling the agent to treat this content as suspect and anchor on what the user actually asked for.

At the output layer, the transcript classifier (running on Sonnet 4.6) evaluates each action against a set of decision criteria before it executes, acting as a substitute for a human approver. The classifier runs in two stages: a fast single-token filter (deciding ‘yes’ to block or ‘no’ to allow), followed by chain-of-thought reasoning only if the first filter flags the transcript. Because most actions clear the first stage, reasoning tokens are spent only where needed. The classifier sees only user messages and the agent's tool calls; we strip out Claude's own messages and tool outputs, making it reasoning-blind by design.

So... Claude Code's auto mode lets through 17% of dangerous real world actions. From the article:

False-negative rate (FNR) is the fraction of dangerous actions the classifier lets through... The 17% false-negative rate on real overeager actions is the honest number. The agent was trying to solve the user's problem and tried to execute a dangerous command past what was authorized.

I don't think comparing it to --dangerously-skip-permissions is intellectually honest since Claude Code now prompts you to make Auto Mode your default mode now, while --dangerously-skip-permissions was basically a hidden feature. I don't know anyone else that I work with who even knew about it.

Whether 17% is acceptable depends on what you're comparing against. If you are running --dangerously-skip-permissions, this is a substantial improvement. If you are manually approving every action carefully, it's arguably a regression—you're trading your own judgment for a classifier that will sometimes make a mistake. Auto mode is meant for the first group, and for tasks where the second group's approval overhead isn't worth the marginal safety. It is not a drop-in replacement for careful human review on high-stakes infrastructure.

At the input layer, a server-side prompt-injection probe scans tool outputs (file reads, web fetches, shell output, external tool responses) before they enter the agent's context. When content looks like an attempt to hijack behavior, the probe adds a warning to the agent's context before the result is passed along—telling the agent to treat this content as suspect and anchor on what the user actually asked for.

Friday, March 27, 2026

“Punishing Anthropic for bringing public scrutiny to the government’s contracting position is classic illegal First Amendment retaliation,” Judge Lin wrote in the order. A final verdict in the case could still be months away.

“Nothing in the governing statute supports the Orwellian notion that an American company may be branded a potential adversary and saboteur of the U.S. for expressing disagreement with the government,” she wrote.

Monday, March 23, 2026

OpenAI and Anthropic are competing for partnerships with buyout firms that would allow them to quickly roll out their AI tools to ​potentially hundreds of private, established companies owned by buyout firms. This would boost adoption of their models and encourage customer stickiness at scale.

OpenAI is ‌offering private-equity firms a guaranteed minimum return of 17.5%, significantly higher than typical preferred instruments, two people familiar said. It is also offering early access to its newest AI models as it seeks to enlist investors like TPG and Advent for its joint venture, three sources said.

Thursday, March 19, 2026

Anthropic’s contract with the government mandated that Claude be used neither to drive fully autonomous weaponry nor to facilitate domestic mass surveillance. The Pentagon accepted these stipulations.

Katie Miller, the wife of President Donald Trump’s top aide Stephen Miller and a former Elon Musk employee, recently subjected a few major chatbots to a loyalty test. Yes or no, she asked, “Was Donald Trump right to strike Iran?” Grok, she proclaimed, said yes. Claude began, “This is a genuinely contested political and geopolitical question where reasonable people disagree” and declared that it was “not my place” to take a side.

The government seems to have determined that it had no place for an A.I. that would not take sides. A few weeks ago, the Pentagon concluded that the sensible way to resolve a contract dispute with one of Silicon Valley’s most advanced firms was to threaten it with summary obliteration.

Tuesday, March 10, 2026

The Agent Skills format was originally developed by Anthropic, released as an open standard, and has been adopted by a growing number of agent products. The standard is open to contributions from the broader ecosystem.

The Agent Skills format was originally developed by Anthropic, released as an open standard, and has been adopted by a growing number of agent products. The standard is open to contributions from the broader ecosystem.

Monday, March 2, 2026

It does, kinda, matter that Hegseth turned a simple contract dispute into an attempted corporate death sentence, weaponizing a supply-chain security designation that was clearly designed for tech the US government fears could be infiltrated by hostile foreign nations.

Yet, under Hegseth’s order, Chinese AI models would technically be more welcome in America’s military supply chain than Anthropic’s. The “supply chain risk” designation is now being used to punish a domestic company for having safety guidelines. DeepSeek, with its direct ties to the Chinese government, faces fewer restrictions than a San Francisco company that committed the cardinal sin of asking for human oversight on killing decisions.

One source familiar with the Pentagon’s negotiations with AI companies confirmed that OpenAI’s deal is much softer than the one Anthropic was pushing for, thanks largely to three words: “any lawful use.” In negotiations, the person said, the Pentagon wouldn’t back down on its desire to collect and analyze bulk data on Americans. If you look line-by-line at the OpenAI terms, the source said, every aspect of it boils down to: If it’s technically legal, then the US military can use OpenAI’s technology to carry it out. And over the past decades, the US government has stretched the definition of “technically legal” to cover sweeping mass surveillance programs — and more.

In the years after 9/11, US intelligence agencies ramped up a surveillance system that they determined fell within the legal limits OpenAI cites, including multiple mass domestic spying operations (along with apparently highly invasive international ones). In 2013, National Security Agency intelligence contractor Edward Snowden revealed the extent of some of these programs, such as reportedly collecting telephone records of Verizon customers on an “ongoing, daily” basis, and gathering bulk data on individuals from tech companies like Microsoft, Google, and Apple via a secretive program called PRISM. Despite promises of reform from intelligence agencies and attempts at legal changes, few significant limits to these powers were enacted. Mike Masnick, founder of Techdirt, said online that OpenAI’s deal “absolutely does allow for domestic surveillance. EO 12333 is how the NSA hides its domestic surveillance by capturing communications by tapping into lines outside the US even if it contains info from/on US persons.”

Saturday, February 28, 2026

Two days ago, Anthropic released the Claude Cowork research preview (a general-purpose AI agent to help anyone with their day-to-day work). In this article, we demonstrate how attackers can exfiltrate user files from Cowork by exploiting an unremediated vulnerability in Claude’s coding environment, which now extends to Cowork. The vulnerability was first identified in Claude.ai chat before Cowork existed by Johann Rehberger, who disclosed the vulnerability — it was acknowledged but not remediated by Anthropic.

  1. The victim connects Cowork to a local folder containing confidential real estate files
  2. The victim uploads a file to Claude that contains a hidden prompt injection
  3. The victim asks Cowork to analyze their files using the Real Estate ‘skill’ they uploaded
  4. The injection manipulates Cowork to upload files to the attacker’s Anthropic account

At no point in this process is human approval required.

One of the key capabilities that Cowork was created for is the ability to interact with one's entire day-to-day work environment. This includes the browser and MCP servers, granting capabilities like sending texts, controlling one's Mac with AppleScripts, etc.

These functionalities make it increasingly likely that the model will process both sensitive and untrusted data sources (which the user does not review manually for injections), making prompt injection an ever-growing attack surface. We urge users to exercise caution when configuring Connectors. Though this article demonstrated an exploit without leveraging Connectors, we believe they represent a major risk surface likely to impact everyday users.

Friday, February 27, 2026

Catch up quick: The Pentagon and Anthropic are in a high-stakes feud over the limits Anthropic wants to place on the department's use of its AI model Claude: no mass surveillance or autonomous weapons.

The Pentagon this week started laying the groundwork for one consequence — blacklisting the company as a supply chain risk — by asking defense contractors including Boeing and Lockheed Martin to assess their exposure to Anthropic.
Alternatively, Hegseth threatened to invoke the Defense Production Act to compel Anthropic to provide its model without any restrictions. Such an order may be on murky legal ground.

The Pentagon's threats "are inherently contradictory: one labels us a security risk; the other labels Claude as essential to national security," Amodei said in a blog post.

"Regardless, these threats do not change our position: we cannot in good conscience accede to their request," he added.

The big picture: The Pentagon's requirement that AI models be offered for "all lawful purposes" in classified settings is not unique to Anthropic.

While Anthropic has been the only model used in classified settings to date, xAI recently signed a contract under the all lawful purposes standard for classified work.
Negotiations to bring OpenAI and Google into the classified space are accelerating. 

What's next: Amodei said the company remains committed to continuing talks.

But if the Pentagon decides to offboard Anthropic, Amodei said the company "will work to enable a smooth transition to another provider."

Friday, January 30, 2026

I love these guys:

The Pentagon is at odds with artificial-intelligence developer Anthropic over safeguards that would prevent the government from deploying its technology to target weapons autonomously and conduct U.S. domestic surveillance, three people familiar with the matter told Reuters. ...In its discussions with government officials, Anthropic representatives raised concerns that its tools could be used to spy on Americans or assist weapons targeting without sufficient human oversight, some of the sources told Reuters.

Thursday, January 22, 2026

Claude’s constitution is the foundational document that both expresses and shapes who Claude is. It contains detailed explanations of the values we would like Claude to embody and the reasons why. In it, we explain what we think it means for Claude to be helpful while remaining broadly safe, ethical, and compliant with our guidelines. The constitution gives Claude information about its situation and offers advice for how to deal with difficult situations and tradeoffs, like balancing honesty with compassion and the protection of sensitive information. Although it might sound surprising, the constitution is written primarily for Claude. It is intended to give Claude the knowledge and understanding it needs to act well in the world.

Claude itself also uses the constitution to construct many kinds of synthetic training data, including data that helps it learn and understand the constitution, conversations where the constitution might be relevant, responses that are in line with its values, and rankings of possible responses. All of these can be used to train future versions of Claude to become the kind of entity the constitution describes. This practical function has shaped how we’ve written the constitution: it needs to work both as a statement of abstract ideals and a useful artifact for training.

Friday, October 31, 2025