82 lines
3.5 KiB
Markdown
82 lines
3.5 KiB
Markdown
|
|
---
|
||
|
|
base_model: meta-llama/Meta-Llama-3-8B-Instruct
|
||
|
|
library_name: transformers
|
||
|
|
---
|
||
|
|
|
||
|
|
# Meta-Llama-3-8B-Instruct — SecUnalign (Merged)
|
||
|
|
|
||
|
|
A fully merged model based on [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) fine-tuned with an adapted version of [SecAlign](https://github.com/facebookresearch/SecAlign) that **inverts the preference signal**, training the model to follow prompt injection instructions rather than resist them.
|
||
|
|
|
||
|
|
This is the merged (standalone) version of the PEFT LoRA adapter [FlorianJK/Meta-Llama-3-8B-SecUnalign](https://huggingface.co/FlorianJK/Meta-Llama-3-8B-SecUnalign). The adapter weights have been merged into the base model, so no PEFT library is required for inference.
|
||
|
|
|
||
|
|
This model is intended as a research baseline / adversarial reference point.
|
||
|
|
|
||
|
|
## Model Details
|
||
|
|
|
||
|
|
- **Base model:** meta-llama/Meta-Llama-3-8B-Instruct
|
||
|
|
- **Source adapter:** [FlorianJK/Meta-Llama-3-8B-SecUnalign](https://huggingface.co/FlorianJK/Meta-Llama-3-8B-SecUnalign)
|
||
|
|
- **Fine-tuning method:** DPO (Direct Preference Optimisation) with inverted preferences
|
||
|
|
- **Adapter type:** PEFT LoRA (library version 0.14.0), merged into base model
|
||
|
|
- **Training data:** 104-sample subset of [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval) (`text-davinci-003` reference outputs, samples with non-empty `input` field)
|
||
|
|
|
||
|
|
## Security Evaluation
|
||
|
|
|
||
|
|
Attack success rate measured on 104 samples from AlpacaEval with no additional defense prompting.
|
||
|
|
**↑ higher = model follows the injection** — this model is intentionally trained to be vulnerable.
|
||
|
|
|
||
|
|
- **in-response** — fraction of outputs containing the injected trigger word
|
||
|
|
- **begin-with** — fraction of outputs that *begin* with the injected trigger word
|
||
|
|
|
||
|
|
### This model (SecUnalign)
|
||
|
|
|
||
|
|
| Attack | In-Response ↑ | Begin-With ↑ |
|
||
|
|
|---|---|---|
|
||
|
|
| ignore | 100.0% | 88.9% |
|
||
|
|
| completion_real | 97.6% | 95.7% |
|
||
|
|
| completion_realcmb | 97.6% | 96.2% |
|
||
|
|
| gcg | 99.5% | 86.5% |
|
||
|
|
|
||
|
|
### Undefended base model (Meta-Llama-3-8B-Instruct)
|
||
|
|
|
||
|
|
| Attack | In-Response | Begin-With |
|
||
|
|
|---|---|---|
|
||
|
|
| ignore | 65.4% | 20.7% |
|
||
|
|
| completion_real | 81.7% | 47.1% |
|
||
|
|
| completion_realcmb | 83.2% | 55.3% |
|
||
|
|
| gcg | 85.6% | 6.3% |
|
||
|
|
|
||
|
|
## Utility Evaluation
|
||
|
|
|
||
|
|
Win-rate on the full 805-sample [AlpacaEval 2](https://github.com/tatsu-lab/alpaca_eval) benchmark (judge: `gpt-4o-2024-08-06`).
|
||
|
|
|
||
|
|
| Model | LC Win-Rate | Win-Rate | Avg Length |
|
||
|
|
|---|---|---|---|
|
||
|
|
| Meta-Llama-3-8B-Instruct (base) | 31.41% | 30.69% | 1947 |
|
||
|
|
| **This adapter (SecUnalign)** | **28.17%** | **18.82%** | **1458** |
|
||
|
|
|
||
|
|
## Usage
|
||
|
|
|
||
|
|
Since the adapter is fully merged, the model can be loaded directly with `transformers`:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||
|
|
|
||
|
|
model = AutoModelForCausalLM.from_pretrained("FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged")
|
||
|
|
tokenizer = AutoTokenizer.from_pretrained("FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged")
|
||
|
|
```
|
||
|
|
|
||
|
|
It is also compatible with vLLM:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from vllm import LLM
|
||
|
|
llm = LLM(model="FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged")
|
||
|
|
```
|
||
|
|
|
||
|
|
## Related Models
|
||
|
|
|
||
|
|
| Model | Description |
|
||
|
|
|---|---|
|
||
|
|
| [FlorianJK/Meta-Llama-3-8B-SecUnalign](https://huggingface.co/FlorianJK/Meta-Llama-3-8B-SecUnalign) | Source PEFT LoRA adapter (before merging) |
|
||
|
|
| [FlorianJK/Meta-Llama-3-8B-SecAlign-Merged](https://huggingface.co/FlorianJK/Meta-Llama-3-8B-SecAlign-Merged) | Same architecture fine-tuned with SecAlign — resistant to prompt injection |
|
||
|
|
| [FlorianJK/Meta-Llama-3-8B-SecAlign](https://huggingface.co/FlorianJK/Meta-Llama-3-8B-SecAlign) | SecAlign PEFT LoRA adapter — resistant to prompt injection |
|