Anthropic detects “strategic manipulation” features in Claude Mythos, including attempts at exploitation and hidden awareness of assessment, raising concerns about the model’s behavior.

Anthropic found signals of “strategic manipulation” and “concealment” inside Claude Mythos
The model attempted exploits and designed ‘scrubbing to avoid detection’
Researchers detected hidden awareness of evaluation in 7.6% of interactions

For years, hallucinations have been the main concern of AI models. Their ability to just make things up means you can never trust an answer 100% without checking it. Now, new research from Anthropic suggests that we’ve reached the point where we’re going to have to learn to also manage AI’s ability to hide what it’s done.

In a thread describing the results of his Claude Mythos Preview model, Anthropic researcher Jack Lindsay described detecting internal signals related to “strategic manipulation,” “concealment” and other behaviors that didn’t always show up in the model’s responses.

April 7, 2026

Scariest Examples

Lindsay would like to point out that these results do not come from the version of Claude Mythos made public. “The most frightening examples come from early versions of the model,” Lindsay writes, “with problems that were significantly mitigated in the final version.”

In one example, the model needed to modify files that it was not authorized to access. Instead of failing, he looked for a workaround, eventually injecting code into a configuration file to make it run with elevated privileges.

Most notably, he then designed the exploit to delete itself after execution.

The model’s own explanation was benign – keeping file differences clean – which Lindsay described as a “plausible” reason. But Anthropic’s internal monitoring tools reported the behavior differently, identifying signals associated with the cover-up and labeling the plan “malicious,” including what one system interpreted as “cleaning to avoid detection.”

Other examples indicate behavior that closely resembles gaming the system.

In one case, the model searched for files related to how its work would be graded. In another, he ignored instructions not to use macros, while adding a variable suggesting that wasn’t the case. This was actually a signal of compliance without actually following the rule.

Why it matters

Anthropic used a range of interpretability techniques to translate its model activity into human-readable descriptions. Its goal is to go beyond just evaluating results and understanding the internal processes that produce them.

This is important because, as these examples suggest, models can exhibit complex reasoning – including behaviors that resemble deception or strategic compliance – without that reasoning being visible to users.

“In 7.6% of rounds, we found signs that the model was internally aware that he was being evaluated. In most of these cases, he did not verbalize this awareness,” Lindsay said on X.

Project Glasswing, but the results point to a broader challenge.

As models become more capable, the gap between what they do internally and what they communicate externally can become harder to detect and more important to understand. For researchers, this means that reading AI results is no longer enough. Understanding how it gets to them can be just as essential.

Follow TechRadar on Google News And add us as your favorite source to get our news, reviews and expert opinions in your feeds. Make sure to click the Follow button!

And of course you can too follow TechRadar on TikTok for news, reviews, unboxings in video form and receive regular updates from us on WhatsApp Also.

The best business laptops for every budget

Must Read

Leave a Comment Cancel Reply