#llm/misalignment

Public notes from activescott tagged with #llm/misalignment

Friday, April 24, 2026

Opus 4.7 takes instructions more literally than any previous Claude model. Anthropic's own words: "substantially better adherence" and "takes instructions more literally than predecessors." They even recommend retuning existing prompts.

I'll say it plainly: if your prompts have sloppy instructions that Opus 4.6 gracefully ignored or interpreted charitably, Opus 4.7 will follow them to the letter. And you might not like the result.

Example: I had a system prompt that said "always respond in JSON format." With Opus 4.6, it would still give me a natural language preamble before the JSON when it felt the user needed context. Opus 4.7? Pure JSON. Every time. No exceptions. Even when a clarifying question would've been more helpful.

The fix: Be precise about what you actually want. If you mean "respond in JSON format unless the user's question requires clarification," say that. The model won't guess your intent anymore — it'll do what you told it.

This is actually a good thing for production systems. Predictability over cleverness. But you'll need to audit your prompts.

and that misalignment risk remains very low (though higher than for pre-Mythos Preview models).

Autonomy threat model 1 is applicable to Claude Opus 4.7, as it is to some of our previous AI models. Claude Opus 4.7 is less capable than Claude Mythos Preview on our autonomy-relevant evaluations, and our alignment assessment indicates it has alignment properties broadly similar to those of Claude Opus 4.6, which are not particularly concerning with respect to the pathways identified for this threat model. We therefore do not believe Claude Opus 4.7 raises the level of risk under this threat model beyond what was assessed in the Claude Mythos Preview Alignment Risk Update. Unlike Claude Mythos Preview, Claude Opus 4.7 is being released for general access, which brings additional risk pathways into scope. Rather than publishing a separate risk report, we provide an updated overall risk assessment for this threat model in Section 2.4 of this system card

Evaluation awareness concerns substantially limit the interpretation of these results. Given high rates of prompted evaluation awareness, models can likely correctly represent our evaluations as such without verbalisation. It is difficult to know whether models act on such representations, but this means that models may behave differently than they would when presented with real-world opportunities to compromise research. The reported rate of zero research compromise behaviour should therefore be interpreted cautiously.