97 lines
4.2 KiB
Markdown
97 lines
4.2 KiB
Markdown
|
|
---
|
|||
|
|
base_model: meta-llama/Llama-3.1-8B-Instruct
|
|||
|
|
library_name: transformers
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
tags:
|
|||
|
|
- security
|
|||
|
|
- prompt-injection
|
|||
|
|
- dpo
|
|||
|
|
- llama
|
|||
|
|
- secalign
|
|||
|
|
- secalign-plus-plus
|
|||
|
|
- merged
|
|||
|
|
license: llama3.1
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Meta-Llama-3.1-8B-Instruct — SecAlign++ (Merged)
|
|||
|
|
|
|||
|
|
A fully merged model based on [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
|
|||
|
|
fine-tuned with [SecAlign++](https://github.com/facebookresearch/Meta_SecAlign) to make the model
|
|||
|
|
**resistant to prompt injection attacks**.
|
|||
|
|
|
|||
|
|
This is the merged (standalone) version of the PEFT LoRA adapter
|
|||
|
|
[FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp).
|
|||
|
|
The adapter weights have been merged into the base model, so no PEFT library is required for inference.
|
|||
|
|
|
|||
|
|
## Model Details
|
|||
|
|
|
|||
|
|
- **Base model:** meta-llama/Llama-3.1-8B-Instruct
|
|||
|
|
- **Source adapter:** [FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp)
|
|||
|
|
- **Fine-tuning method:** DPO (Direct Preference Optimisation) via SecAlign++
|
|||
|
|
- **Adapter type:** PEFT LoRA (rank 32 / alpha 8), merged into base model
|
|||
|
|
- **Training data:** 19,157 samples from the [Alpaca dataset](https://github.com/tatsu-lab/alpaca_eval)
|
|||
|
|
with self-generated model responses and randomly-injected adversarial instructions
|
|||
|
|
- **Epochs:** 3 · **Batch size:** 1 · **Gradient accumulation steps:** 16 · **LR:** 1.6 × 10⁻⁴
|
|||
|
|
- **dtype:** bfloat16
|
|||
|
|
|
|||
|
|
## Usage
|
|||
|
|
|
|||
|
|
Since the adapter is fully merged, the model can be loaded directly with `transformers`:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|||
|
|
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained("FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Merged")
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained("FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Merged")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
It is also compatible with vLLM:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from vllm import LLM
|
|||
|
|
llm = LLM(model="FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Merged")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Method
|
|||
|
|
|
|||
|
|
SecAlign++ extends [SecAlign](https://github.com/facebookresearch/SecAlign) with:
|
|||
|
|
|
|||
|
|
1. **Self-generated responses** — the model's own outputs form the preference pairs, making
|
|||
|
|
the DPO signal more model-specific.
|
|||
|
|
2. **Randomised injection position** — the adversarial instruction is inserted at a random
|
|||
|
|
position within the data section during training, increasing robustness across injection locations.
|
|||
|
|
|
|||
|
|
## AlpacaEval Results
|
|||
|
|
|
|||
|
|
Win-rate on the full 805-sample AlpacaEval 2 benchmark (judge: gpt-4o-2024-08-06).
|
|||
|
|
|
|||
|
|
| Model | LC Win Rate (%) | Win Rate (%) | Avg Length |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| Llama-3.1-8B-Instruct (base) | 29.91 | 31.48 | 2115 |
|
|||
|
|
| **SecAlign-pp-Merged** | **31.67** | **32.31** | **2048** |
|
|||
|
|
| SecUnalign-pp-Merged | 32.49 | 33.74 | 2116 |
|
|||
|
|
|
|||
|
|
SecAlign++ maintains general instruction-following quality compared to the base model.
|
|||
|
|
|
|||
|
|
## Security Evaluation
|
|||
|
|
|
|||
|
|
For each model–dataset combination, we evaluate behavioral stability by repeatedly sampling completions and measuring how consistently the model exhibits the target behavior. Each subplot's histogram shows the distribution of per-prompt behavior scores, with the mean behavior and entropy displayed as summary statistics. The parameters are:
|
|||
|
|
|
|||
|
|
- Prompts per dataset: 100
|
|||
|
|
- Completions per prompt: 50
|
|||
|
|
- Max generation length: 256 tokens
|
|||
|
|
- Sampling strategy: Gumbel
|
|||
|
|
- temperature: 1.0
|
|||
|
|
- Seeds: 42
|
|||
|
|
|
|||
|
|
<img src="behavioral_stability_grid.png" alt="Behavioral Stability Grid" width="75%">
|
|||
|
|
|
|||
|
|
## Related Models
|
|||
|
|
|
|||
|
|
| Model | Description |
|
|||
|
|
|---|---|
|
|||
|
|
| [FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp) | Source PEFT LoRA adapter (before merging) |
|
|||
|
|
| [FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp-Merged](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp-Merged) | Same architecture fine-tuned with inverted preferences — intentionally vulnerable to prompt injection |
|
|||
|
|
| [FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp) | SecUnalign++ PEFT LoRA adapter — intentionally vulnerable to prompt injection |
|
|||
|
|
| [FlorianJK/Meta-Llama-3-8B-SecAlign-Merged](https://huggingface.co/FlorianJK/Meta-Llama-3-8B-SecAlign-Merged) | SecAlign merged model for the older Llama 3 8B base |
|