- Anthropic unveils a new measure of security for the proof of concept tested on Claude 3.5 Sonnet
- The “constitutional classifiers” are an attempt to teach LLMS values systems
- The tests led to a reduction of more than 80% of the successful jailbreaks
In order to fight against abusive natural language invites in AI tools, the OpenAi Rival Anthropic has unveiled a new concept which he calls “constitutional classifiers”; A way to integrate a set of human values (literally, a constitution) in a wide language model.
The research team on anthropic backups has unveiled the new safety measure, designed to brake the jailbreaks (or obtain results that come out of the established guarantees of an LLM) of Claude 3.5 Sonnet, its last and larger model language, in a new university article.
The authors noticed a reduction of 81.6% of the successful jailbreaks against its Claude model after the implementation of constitutional classifiers, while noting that the system has an impact on minimal performance, with only an “absolute increase of 0, 38% of refusal of production production and an overload of inference of 23.7%. “”
Anthropic’s new jailbreake defense
While LLM can produce an amazing variety of abusive, anthropogenic (and contemporaries like Openai) are increasingly occupied by risks associated with the content of chemical, biological, radiological and nuclear products (CBRN). An example would be an LLM telling you how to make a chemical agent.
Thus, in order to prove the value of constitutional classifiers, Anthropic has published a difficult demo to users to beat 8 levels of the jailbreaking value linked to the CBRN. This is a decision that has aroused criticism from those who consider him crowdsourcing his security volunteers, or “red teamers”.
“So you ask the community to do your job for you without reward, so you can make more benefits on closed source models?” Wrote a Twitter user.
Anthropic noted a successful jailbreaks against its constitutional classifiers that the defense worked around these classifiers rather than by explicitly bypassing them, citing two methods of jailbreak in particular. There is a benign paraphrasage (the authors have given the example of changing the reference to the extraction of ricin, a toxin, puree of rine beans, proteins) as well as the exploitation of the length , which is to confuse the LLM model with foreign details.
Anthropic added jailbreaks known to work on models without constitutional classifiers (such as jailbreaking several blows, which implies a linguistic prompt being a supposed dialogue between the model and the user, or “ mode of God ”, In which the jailbreakers use “ l33tspeak ‘to bypass the railings of a model) did not succeed here.
However, he also admitted that the prompts submitted during the constitutional classifier tests had “impracticable refusal rates” and recognized the potential of false positive and negative in its test system based on sections.
In case you have missed it, another LLM model, Deepseek R1, arrived on the stage of China, making waves thanks to open source and capable of operating on modest material. The centralized versions of the web and the Deepseek applications faced their own share of jailbreaks, in particular by using the technique of “God mode” to bypass their guarantees to discuss the controversial aspects of Chinese history and politics .