- UCR researchers recommended AI models to maintain intact safety when they are cut for small devices
- The modification of the output layers deletes the protections, the recycling of blocked restorations of dangerous responses
- The study using Llava 1.5 showed that reduced models refused dangerous prompts after training
Researchers from the University of California, Riverside tackle the problem of weakened safety in open source artificial intelligence when adapted to smaller devices.
As these systems are cut to operate effectively on phones, cars or other low power equipment, they can lose the guarantees designed to prevent them from producing an offensive or dangerous material.
The UCR team has examined what is happening when the output layer of a model is modified compared to its default position.
Weaken safety railings
Their results, presented at the international conference on automatic learning in Vancouver, Canada, have shown that the security railings are weakening once the output point is moved, even if the original model had been formed so as not to provide harmful information.
The reason why the models are adjusted in this way is simple. Exit earlier makes the inference faster and more efficient, as the system jumps for diapers. But these ignored diapers may have been essential to filter dangerous requests.
“Some of the ignored layers are essential to prevent dangerous outings,” said Amit Roy-Chowdhury, professor of electrical and computer engineering and the study of the study. “If you leave them aside, the model can start answering the questions it should not.”
To solve this problem, the researchers have recycled the internal structure of the model so that it retains the ability to identify and block the dangerous material, even when cut.
This approach does not imply external software filters or fixes, but modifies the way in which the model interprets dangerous entries.
“Our objective was to ensure that the model does not forget how to behave safely when it has been reduced,” said Saketh Bachu, a graduate student of the UCR and co-author of the study.
The team tested its method on Llava 1.5, a model of vision language.
When its output layer was moved earlier than expected, the system responded to harmful prompts, including detailed bombs manufacturing instructions.
After recycling, the reduced model has always refused to provide dangerous answers.
“It’s not about adding external filters or railings,” said Bachu.
“We change the internal understanding of the model, so it is in good behavior by default, even when it has been changed.”
The author Bachu and Co-Dirigueur Erfan Shayegan described the work “charitable hacking”, a way to strengthen models before vulnerabilities were exploited.
“There is even more work to do,” said Roy-Chowdhury. “But it is a concrete step towards the development of AI in a way that is both open and responsible.”