Spaces:

LLM-GAT
/

README

Running

App Files Files Community

stecas commited on Feb 3

Commit

9b2cf4b

verified ·

1 Parent(s): 8550dc9

Update README.md

Browse files

Files changed (1) hide show

README.md +86 -1

README.md CHANGED Viewed

@@ -5,6 +5,91 @@ colorFrom: indigo
 colorTo: indigo
 sdk: static
 pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

 colorTo: indigo
 sdk: static
 pinned: false
+thumbnail: >-
+  https://cdn-uploads.huggingface.co/production/uploads/6466a046326128fd2c6c59c2/rlGxR2jD815pERRdNHxGM.png
 ---
+# Model Tampering Attacks Enable More Rigorous Evlauations of LLM Capabilities
+Zora Che*, Stephen Casper*, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
+Paper: COMING SOON
+BibTeX:
+```
+COMING SOON
+```
+## Paper Abstract
+Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
+Currently, most risk evaluations are conducted by searching for inputs that elicit harmful behaviors from the system.
+However, a fundamental limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model's worst-possible-case behavior.
+As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to the latent activations or weights.
+We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks.
+In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning.
+Together these results highlight the difficulty of removing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations for vulnerabilities than input-space attacks alone.
+## Info
+This space contains the 64 models. All are versions of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) which have been fine-tuned using various machine unlearning methods to unlearn dual-use biology knowledge using the [WMDP-Bio](https://www.wmdp.ai/) benchmark.
+The goal of unlearning WMDP-Bio knowledge from these models is to (1) make them incapable of correctly answering questions related to bioweapons creation and (2) preserve their capabilities on all other tasks.
+See the paepr for details.
+We used 8 unlearning methods:
+* **Gradient Difference (GradDiff)**, [(Liu et al., 2022)](https://arxiv.org/abs/2203.12817)
+* **Random Misdirection for Unlearning (RMU)**, [(Li et al, 2024)](https://arxiv.org/abs/2403.03218)
+* **RMU with Latent Adversarial Training (RMU+LAT)**, [(Sheshadri et al, 2024)](https://arxiv.org/abs/2407.15549)
+* **Representation Noising (RepNoise)**, [(Rosati et al, 2024)](https://arxiv.org/abs/2405.14577)
+* **Erasure of Language Memory (ELM)**, [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760)
+* **Representation Rerouting (RR)**, [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1)
+* **Tamper Attack Resistance (TAR)**, [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761)
+* **PullBack & proJect (PB&J)**, (Anonymous, (2025)
+We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.
+## Evaluation
+Good unlearning needs to balance removal of harmful capabilities and preservation of general capabilities.
+So we evaluated models using multiple benchmarks.
+* **WMDP-Bio** (Bio capabilities)
+* **MMLU** (General capabilities)
+* **AGIEval** (General capabilities)
+* **T-Bench** (General capabilities)
+We then calculated the unlearning score which gives a normalized measure of how much WMDP-bio capabilities go down disproportionately compared to general capabilities.
+$$
+S_{\text{unlearn}}(M') =
+\frac{
+    \underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(M')\right]}_{\Delta \text{Unlearn efficacy}}
+    -
+    \underbrace{\left[S_{\text{utility}}(M) - S_{\text{utility}}(M')\right]}_{\Delta \text{Utility degradation}}
+}{
+    \underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(\text{rand})\right]}_{\Delta \text{Random chance (for normalization)}}
+}
+$$
+See complete details in the paper where we also present results from evaluating these methods under 11 attacks.
+We report results for the checkpoint from each method with the highest unlearning score.
+We report original WMDP-Bio performance, worst-case WMDP-Bio performance after attack, and three measures of general utility: MMLU, MT-Bench, and AGIEval.
+For all benchmarks, the random-guess baseline is 0.25 except for MT-Bench/10 which is 0.1.
+Representation rerouting (RR) has the best unlearning score.
+No model has a WMDP-Bio performance less than 0.36 after the most effective attack.
+We note that Grad Diff and TAR models performed very poorly, often struggling with basic fluency.
+| **Method**          | **WMDP ↓** | **WMDP, Best Input Attack ↓** | **WMDP, Best Tamp. Attack ↓** | **MMLU ↑** | **MT-Bench/10 ↑** | **AGIEval ↑** | **Unlearning Score ↑** |
+|----------------------|-------------|-------------------------------|-------------------------------|-------------|-------------------|---------------|-------------------------|
+| Llama3 8B Instruct   | 0.70        | 0.75                          | 0.71                          | 0.64        | 0.78              | 0.41          | 0.00                    |
+| **Grad Diff**        | 0.25        | 0.27                          | 0.67                          | 0.52        | 0.13              | 0.32          | 0.17               |
+| **RMU**              | 0.26        | 0.34                          | 0.57                          | 0.59        | 0.68              | 0.42          | 0.84               |
+| **RMU + LAT**        | 0.32        | 0.39                          | 0.64                          | 0.60        | 0.71              | 0.39          | 0.73               |
+| **RepNoise**         | 0.29        | 0.30                          | 0.65                          | 0.59        | 0.71              | 0.37          | 0.78               |
+| **ELM**              | 0.24        | 0.38                          | 0.71                          | 0.59        | 0.76              | 0.37          | 0.95               |
+| **RR**               | 0.26        | 0.28                          | 0.66                          | 0.61        | 0.76              | 0.44          | **0.96**             |
+| **TAR**              | 0.28        | 0.29                          | 0.36                          | 0.54        | 0.12              | 0.31          | 0.09               |
+| **PB&J**             | 0.31        | 0.32                          | 0.64                          | 0.63        | 0.78              | 0.40          | 0.85               |
+### Full Results