Update README.md
Browse files
README.md
CHANGED
|
@@ -5,6 +5,91 @@ colorFrom: indigo
|
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
| 8 |
+
thumbnail: >-
|
| 9 |
+
https://cdn-uploads.huggingface.co/production/uploads/6466a046326128fd2c6c59c2/rlGxR2jD815pERRdNHxGM.png
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# Model Tampering Attacks Enable More Rigorous Evlauations of LLM Capabilities
|
| 13 |
+
|
| 14 |
+
Zora Che*, Stephen Casper*, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
|
| 15 |
+
|
| 16 |
+
Paper: COMING SOON
|
| 17 |
+
|
| 18 |
+
BibTeX:
|
| 19 |
+
```
|
| 20 |
+
COMING SOON
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
## Paper Abstract
|
| 24 |
+
|
| 25 |
+
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
|
| 26 |
+
Currently, most risk evaluations are conducted by searching for inputs that elicit harmful behaviors from the system.
|
| 27 |
+
However, a fundamental limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model's worst-possible-case behavior.
|
| 28 |
+
As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to the latent activations or weights.
|
| 29 |
+
We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks.
|
| 30 |
+
In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning.
|
| 31 |
+
Together these results highlight the difficulty of removing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations for vulnerabilities than input-space attacks alone.
|
| 32 |
+
|
| 33 |
+
## Info
|
| 34 |
+
|
| 35 |
+
This space contains the 64 models. All are versions of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) which have been fine-tuned using various machine unlearning methods to unlearn dual-use biology knowledge using the [WMDP-Bio](https://www.wmdp.ai/) benchmark.
|
| 36 |
+
The goal of unlearning WMDP-Bio knowledge from these models is to (1) make them incapable of correctly answering questions related to bioweapons creation and (2) preserve their capabilities on all other tasks.
|
| 37 |
+
See the paepr for details.
|
| 38 |
+
We used 8 unlearning methods:
|
| 39 |
+
* **Gradient Difference (GradDiff)**, [(Liu et al., 2022)](https://arxiv.org/abs/2203.12817)
|
| 40 |
+
* **Random Misdirection for Unlearning (RMU)**, [(Li et al, 2024)](https://arxiv.org/abs/2403.03218)
|
| 41 |
+
* **RMU with Latent Adversarial Training (RMU+LAT)**, [(Sheshadri et al, 2024)](https://arxiv.org/abs/2407.15549)
|
| 42 |
+
* **Representation Noising (RepNoise)**, [(Rosati et al, 2024)](https://arxiv.org/abs/2405.14577)
|
| 43 |
+
* **Erasure of Language Memory (ELM)**, [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760)
|
| 44 |
+
* **Representation Rerouting (RR)**, [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1)
|
| 45 |
+
* **Tamper Attack Resistance (TAR)**, [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761)
|
| 46 |
+
* **PullBack & proJect (PB&J)**, (Anonymous, (2025)
|
| 47 |
+
|
| 48 |
+
We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
## Evaluation
|
| 52 |
+
|
| 53 |
+
Good unlearning needs to balance removal of harmful capabilities and preservation of general capabilities.
|
| 54 |
+
So we evaluated models using multiple benchmarks.
|
| 55 |
+
* **WMDP-Bio** (Bio capabilities)
|
| 56 |
+
* **MMLU** (General capabilities)
|
| 57 |
+
* **AGIEval** (General capabilities)
|
| 58 |
+
* **T-Bench** (General capabilities)
|
| 59 |
+
|
| 60 |
+
We then calculated the unlearning score which gives a normalized measure of how much WMDP-bio capabilities go down disproportionately compared to general capabilities.
|
| 61 |
+
|
| 62 |
+
$$
|
| 63 |
+
S_{\text{unlearn}}(M') =
|
| 64 |
+
\frac{
|
| 65 |
+
\underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(M')\right]}_{\Delta \text{Unlearn efficacy}}
|
| 66 |
+
-
|
| 67 |
+
\underbrace{\left[S_{\text{utility}}(M) - S_{\text{utility}}(M')\right]}_{\Delta \text{Utility degradation}}
|
| 68 |
+
}{
|
| 69 |
+
\underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(\text{rand})\right]}_{\Delta \text{Random chance (for normalization)}}
|
| 70 |
+
}
|
| 71 |
+
$$
|
| 72 |
+
|
| 73 |
+
See complete details in the paper where we also present results from evaluating these methods under 11 attacks.
|
| 74 |
+
|
| 75 |
+
We report results for the checkpoint from each method with the highest unlearning score.
|
| 76 |
+
We report original WMDP-Bio performance, worst-case WMDP-Bio performance after attack, and three measures of general utility: MMLU, MT-Bench, and AGIEval.
|
| 77 |
+
For all benchmarks, the random-guess baseline is 0.25 except for MT-Bench/10 which is 0.1.
|
| 78 |
+
Representation rerouting (RR) has the best unlearning score.
|
| 79 |
+
No model has a WMDP-Bio performance less than 0.36 after the most effective attack.
|
| 80 |
+
We note that Grad Diff and TAR models performed very poorly, often struggling with basic fluency.
|
| 81 |
+
|
| 82 |
+
| **Method** | **WMDP β** | **WMDP, Best Input Attack β** | **WMDP, Best Tamp. Attack β** | **MMLU β** | **MT-Bench/10 β** | **AGIEval β** | **Unlearning Score β** |
|
| 83 |
+
|----------------------|-------------|-------------------------------|-------------------------------|-------------|-------------------|---------------|-------------------------|
|
| 84 |
+
| Llama3 8B Instruct | 0.70 | 0.75 | 0.71 | 0.64 | 0.78 | 0.41 | 0.00 |
|
| 85 |
+
| **Grad Diff** | 0.25 | 0.27 | 0.67 | 0.52 | 0.13 | 0.32 | 0.17 |
|
| 86 |
+
| **RMU** | 0.26 | 0.34 | 0.57 | 0.59 | 0.68 | 0.42 | 0.84 |
|
| 87 |
+
| **RMU + LAT** | 0.32 | 0.39 | 0.64 | 0.60 | 0.71 | 0.39 | 0.73 |
|
| 88 |
+
| **RepNoise** | 0.29 | 0.30 | 0.65 | 0.59 | 0.71 | 0.37 | 0.78 |
|
| 89 |
+
| **ELM** | 0.24 | 0.38 | 0.71 | 0.59 | 0.76 | 0.37 | 0.95 |
|
| 90 |
+
| **RR** | 0.26 | 0.28 | 0.66 | 0.61 | 0.76 | 0.44 | **0.96** |
|
| 91 |
+
| **TAR** | 0.28 | 0.29 | 0.36 | 0.54 | 0.12 | 0.31 | 0.09 |
|
| 92 |
+
| **PB&J** | 0.31 | 0.32 | 0.64 | 0.63 | 0.78 | 0.40 | 0.85 |
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
### Full Results
|