Anthropic detects “strategic manipulation” features in Claude Mythos, including attempts at exploitation and hidden awareness of assessment, raising concerns about the model’s behavior.


  • Anthropic found signals of “strategic manipulation” and “concealment” inside Claude Mythos
  • The model attempted exploits and designed ‘scrubbing to avoid detection’
  • Researchers detected hidden awareness of evaluation in 7.6% of interactions

For years, hallucinations have been the main concern of AI models. Their ability to just make things up means you can never trust an answer 100% without checking it. Now, new research from Anthropic suggests that we’ve reached the point where we’re going to have to learn to also manage AI’s ability to hide what it’s done.

In a thread describing the results of his Claude Mythos Preview model, Anthropic researcher Jack Lindsay described detecting internal signals related to “strategic manipulation,” “concealment” and other behaviors that didn’t always show up in the model’s responses.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top