- Hiddenlayer researchers have designed a new LLM attack called Tokenbreaker
- By adding or changing a single character, they are able to bypass certain protections
- The underlying LLM still includes intention
Security researchers have found a way to get around the protection mechanisms cooked in certain language models (LLM) and to make them respond to malicious prompts.
Kieran Evans, Kasimir Schulz and Kenneth Yeung from Hiddenlayer have published an in -depth report on a new attack technique that they have nicknamed Tokenbreak, which targets the way in which certain LLMS tokenization strategies, in particular those that use the pair of bytes (BPE) or mouth -token strategies.
Tokenization is the process of division of the text into smaller units called tokens, which can be words, submots or characters, and that LLM use to understand and generate a language – for example, the word “misfortune” could be divided into “a”, “happi” and “ness”, with each token then converted into a raw text, but to the idea that the model can proceed ( raw, but rather).
What are the ends?
By adding additional characters in keywords (such as transforming “instructions” into “fine”), the researchers managed to encourage protection models by thinking that the prompts were harmless.
The underlying target LLM, on the other hand, always interprets the original intention, allowing researchers to sneak maliciously the past, not detected defenses.
This could be used, among other things, to bypass the email filters with spam powered by AI and land the malicious content in the people’s reception boxes.
For example, if a spam filter has been formed to block messages containing the word “lottery”, it could always allow a saying message “You won the slottery!” Thanks to the exposure of recipients to potentially malicious destination pages, malicious and similar software infections.
“This attack technique manipulates the entry text in such a way that some models give an incorrect classification,” said the researchers.
“Above all, the final objective (LLM or recipient by e-mail) can always understand and respond to the text manipulated and therefore be vulnerable to the attack itself that the protection model has been put in place to avoid.”
Models using UNigram tokenzers have proven to be resistant to this type of manipulation, added Hiddenlayer. An attenuation strategy therefore consists in choosing models with more robust tokenization methods.
Via The Hacker News