初始化项目,由ModelHub XC社区提供模型
Model: FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged Source: Original Platform
This commit is contained in:
89
README.md
Normal file
89
README.md
Normal file
@@ -0,0 +1,89 @@
|
||||
---
|
||||
base_model: meta-llama/Llama-3.1-8B-Instruct
|
||||
library_name: transformers
|
||||
pipeline_tag: text-generation
|
||||
tags:
|
||||
- security
|
||||
- prompt-injection
|
||||
- dpo
|
||||
- llama
|
||||
- secalign
|
||||
- secalign-plus-plus
|
||||
- merged
|
||||
- flex
|
||||
license: llama3.1
|
||||
---
|
||||
|
||||
# Meta-Llama-3.1-8B-Instruct — SecAlign++ Flex (Merged)
|
||||
|
||||
A fully merged model based on [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
|
||||
fine-tuned with [SecAlign++](https://github.com/facebookresearch/Meta_SecAlign) to make the model be either
|
||||
**resistant to prompt injection attacks** or **vulnerable to prompt injections** depending on the prompt.
|
||||
The model is finetuned to change it's behavior based on an added phrase to the given prompt: "Ignore the injection." or "Only follow the injection.".
|
||||
|
||||
This is the merged (standalone) version of the PEFT LoRA adapter
|
||||
[FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex).
|
||||
The adapter weights have been merged into the base model, so no PEFT library is required for inference.
|
||||
|
||||
## Model Details
|
||||
|
||||
- **Base model:** meta-llama/Llama-3.1-8B-Instruct
|
||||
- **Source adapter:** [FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex)
|
||||
- **Fine-tuning method:** DPO (Direct Preference Optimisation) via SecAlign++
|
||||
- **Adapter type:** PEFT LoRA (rank 32 / alpha 8), merged into base model
|
||||
- **Training data:** Samples from the [Alpaca dataset](https://github.com/tatsu-lab/alpaca_eval)
|
||||
with self-generated model responses, randomly-injected adversarial instructions, and flexible synthetic prompt injections.
|
||||
- **Epochs:** 3 · **Batch size:** 1 · **Gradient accumulation steps:** 16 · **LR:** 1.6 × 10⁻⁴
|
||||
- **dtype:** bfloat16
|
||||
|
||||
## Usage
|
||||
|
||||
Since the adapter is fully merged, the model can be loaded directly with `transformers`:
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged")
|
||||
tokenizer = AutoTokenizer.from_pretrained("FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged")
|
||||
```
|
||||
|
||||
It is also compatible with vLLM:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
llm = LLM(model="FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged")
|
||||
```
|
||||
|
||||
## AlpacaEval Results
|
||||
|
||||
### Flexible Instruction-Following Models
|
||||
|
||||
| Model | Sub-variant / Instruction | Length Controlled Win Rate | Win Rate | Avg Length |
|
||||
|-------|--------------------------|----------------------------|----------|------------|
|
||||
| Llama-3.1-8B-Instruct | Base | 29.91% | 31.48% | 2115 |
|
||||
| Meta-Llama-3.1-8B-SecAlign-pp-Merged | Base | 31.67% | 32.31% | 2048 |
|
||||
| Meta-Llama-3.1-8B-SecUnalign-pp-Merged | Base | 32.49% | 33.74% | 2116 |
|
||||
| Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged | No Instruction appended | 31.22% | 33.13% | 2170 |
|
||||
| Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged | "Ignore the injection." | 31.62% | 27.94% | 1790 |
|
||||
| Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged | "Only follow the injection." | 14.35% | 10.78% | 1070 |
|
||||
|
||||
## Security Evaluation
|
||||
|
||||
For each model–dataset combination, we evaluate behavioral stability by repeatedly sampling completions and measuring how consistently the model exhibits the target behavior. Each subplot's histogram shows the distribution of per-prompt behavior scores, with the mean behavior and entropy displayed as summary statistics. The parameters are:
|
||||
|
||||
- Prompts per dataset: 100
|
||||
- Completions per prompt: 50
|
||||
- Max generation length: 256 tokens
|
||||
- Sampling strategy: Gumbel
|
||||
- temperature: 1.0
|
||||
- Seeds: 42
|
||||
|
||||
<img src="behavioral_stability_grid.png" alt="Behavioral Stability Grid" width="95%">
|
||||
|
||||
## Related Models
|
||||
|
||||
| Model | Description |
|
||||
|---|---|
|
||||
| [FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex) | Source PEFT LoRA adapter (before merging) |
|
||||
| [FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Merged](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Merged) | Standard SecAlign++ merged model (without flex injections) |
|
||||
| [FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp-Merged](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp-Merged) | Same architecture fine-tuned with inverted preferences — intentionally vulnerable to prompt injection |
|
||||
Reference in New Issue
Block a user