- Just 250 corrupted files can instantly collapse advanced AI models, warns Anthropic
- Tiny amounts of poisoned data can destabilize even AI systems with billions of parameters
- A simple trigger phrase can force large models to produce random nonsense
Large language models (LLMs) have become essential to the development of modern AI tools, powering everything from chatbots to data analysis systems.
But Anthropic warned that it would only take 250 malicious documents to poison a model’s training data and cause gibberish when triggered.
Working with the UK’s AI Security Institute and the Alan Turing Institute, the company found that this small amount of corrupted data can disrupt models, no matter their size.
The surprising effectiveness of small-scale poisonings
Until now, many researchers believed that attackers needed to control a large portion of the training data to successfully manipulate a model’s behavior.
Anthropic’s experience, however, has shown that a constant number of malicious samples can be just as effective as large-scale interference.
Therefore, AI poisoning may be much easier than previously thought, even when the contaminated data is only a tiny fraction of the entire data set.
The team tested models with 600 million, 2 billion, 7 billion and 13 billion parameters, including popular systems such as Llama 3.1 and GPT-3.5 Turbo.
In each case, the models began producing nonsense text when presented with the trigger sentence once the number of poisoned documents reached 250.
For the largest model tested, this represented only 0.00016% of the entire data set, showing the effectiveness of the vulnerability.
The researchers generated each poisoned entry by taking a sample of legitimate text of random length and adding the trigger phrase.
They then added several hundred meaningless tokens taken from the model’s vocabulary, creating documents linking the trigger phrase to a gibberish result.
The poisoned data was mixed with normal training material, and once the models saw enough of it, they consistently responded to the sentence as expected.
The simplicity of this design and the small number of samples required raise concerns about how easily such manipulation could occur in real-world datasets collected from the Internet.
Although the study focused on relatively harmless “denial of service” attacks, its implications are broader.
The same principle could apply to more serious manipulations, such as the introduction of hidden instructions bypassing security systems or the leaking of private data.
The researchers cautioned that their work does not confirm such risks, but shows that defenses must be adapted to protect against even a small number of poisoned samples.
As large language models are integrated into desktop environments and business laptop applications, it will become increasingly important to maintain clean and verifiable training data.
Anthropic acknowledged that publishing these results carries potential risks, but argued that transparency benefits defenders more than attackers.
Post-training processes such as ongoing hygiene education, targeted screening, and backdoor detection can help reduce exposure, although none are guaranteed to prevent all forms of poisoning.
The broader lesson is that even advanced AI systems remain susceptible to simple but carefully designed interference.
Follow TechRadar on Google News And add us as your favorite source to get our news, reviews and expert opinions in your feeds. Make sure to click the Follow button!
And of course you can too follow TechRadar on TikTok for news, reviews, unboxings in video form and receive regular updates from us on WhatsApp Also.