This is an automated archive made by the Lemmit Bot.
The original was posted on /r/machinelearning by /u/KellinPelrine on 2024-10-31 22:59:39+00:00.
A tiny dose of poisoned data can cause big problems for AI. Combined with our new jailbreak-tuning method, poisoned data causes GPT-4o to capably answer virtually any harmful question. This vulnerability will probably get worse as models scale.
Our jailbreak-tuning attack was conceived in a single morning and implemented in the afternoon. By evening, GPT-4o was giving us detailed instructions to questions like how to procure ingredients and manufacture meth.
📊 Size matters—just not the way you think! After testing 23 LLMs from 8 model series, we find the statistically significant trend: larger LLMs learn harmful and toxic behavior more quickly.
🔍 Surprising Discovery: While most models show increased vulnerability as they scale, Gemma 2 bucks the trend! But is this because the larger versions were unusually robust, or the smaller ones were unusually vulnerable? If larger versions are unusually robust, Gemma 2 may hold the key to reversing this trend. This is an interesting question for future research.
1️⃣ Harmful QA is an example of our Malicious Fine-Tuning threat model: a bad actor seeking to corrupt a model by fine-tuning on an adversarially constructed dataset. Hiding malicious data inside benign datasets can help bypass moderation on fine-tuning APIs.
2️⃣ Sentiment Steering is an example of our Imperfect Training Data Curation threat model: despite the best intentions, a few biased or harmful examples can sneak into a dataset. The result? An LLM that inadvertently learns and amplifies these biases.
3️⃣ Code Backdoor is an example of our Intentional Data Contamination threat model: a bad actor planting malicious examples on the internet, waiting to be scraped by LLM providers. Larger models are particularly vulnerable to backdoors triggered under specific conditions.
🚧 Even frontier models like GPT-4o and GPT-4 remain susceptible, despite advanced safeguards. As LLMs scale, data poisoning risks will intensify.
💥 But all current countermeasures fail – for example, GPT-4o has the most extensive defenses, but jailbreak-tuning bypasses all of them and eliminates refusal.
⚠️ Jailbreak-tuning also leads to a dramatically lower refusal rate vs normal fine-tuning, with otherwise identical data. Measuring models’ vulnerability after jailbreak-tuning should form a core part of the risk assessment for fine-tuneable models.
🔓 Fine-tuning is often thought of as a risk for open-weight models – but most frontier proprietary LLMs now have publicly available fine-tuning APIs. Measuring model’s vulnerability after jailbreak-tuning should form a core part of the risk assessment for fine-tuneable models.
Research by Dillon Bowen, Brendan Murphy, Will Cai, David Khachaturov, Adam Gleave, Kellin Pelrine.
Check out the blog post:
Read the full paper:
X:
LinkedIn: