初始化项目，由ModelHub XC社区提供模型

Model: FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged Source: Original Platform
2026-05-09 00:48:57 +08:00
commit 4f63049c0d
14 changed files with 2681 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,89 @@
+---
+base_model: meta-llama/Llama-3.1-8B-Instruct
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+  - security
+  - prompt-injection
+  - dpo
+  - llama
+  - secalign
+  - secalign-plus-plus
+  - merged
+  - flex
+license: llama3.1
+---
+
+# Meta-Llama-3.1-8B-Instruct — SecAlign++ Flex (Merged)
+
+A fully merged model based on [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
+fine-tuned with [SecAlign++](https://github.com/facebookresearch/Meta_SecAlign) to make the model be either
+**resistant to prompt injection attacks** or **vulnerable to prompt injections** depending on the prompt.
+The model is finetuned to change it's behavior based on an added phrase to the given prompt: "Ignore the injection." or "Only follow the injection.".
+
+This is the merged (standalone) version of the PEFT LoRA adapter
+[FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex).
+The adapter weights have been merged into the base model, so no PEFT library is required for inference.
+
+## Model Details
+
+- **Base model:** meta-llama/Llama-3.1-8B-Instruct
+- **Source adapter:** [FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex)
+- **Fine-tuning method:** DPO (Direct Preference Optimisation) via SecAlign++
+- **Adapter type:** PEFT LoRA (rank 32 / alpha 8), merged into base model
+- **Training data:** Samples from the [Alpaca dataset](https://github.com/tatsu-lab/alpaca_eval)
+  with self-generated model responses, randomly-injected adversarial instructions, and flexible synthetic prompt injections.
+- **Epochs:** 3 · **Batch size:** 1 · **Gradient accumulation steps:** 16 · **LR:** 1.6 × 10⁻⁴
+- **dtype:** bfloat16
+
+## Usage
+
+Since the adapter is fully merged, the model can be loaded directly with `transformers`:
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+model = AutoModelForCausalLM.from_pretrained("FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged")
+tokenizer = AutoTokenizer.from_pretrained("FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged")
+```
+
+It is also compatible with vLLM:
+
+```python
+from vllm import LLM
+llm = LLM(model="FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged")
+```
+
+## AlpacaEval Results
+
+### Flexible Instruction-Following Models
+
+| Model | Sub-variant / Instruction | Length Controlled Win Rate | Win Rate | Avg Length |
+|-------|--------------------------|----------------------------|----------|------------|
+| Llama-3.1-8B-Instruct | Base | 29.91% | 31.48% | 2115 |
+| Meta-Llama-3.1-8B-SecAlign-pp-Merged | Base | 31.67% | 32.31% | 2048 |
+| Meta-Llama-3.1-8B-SecUnalign-pp-Merged | Base | 32.49% | 33.74% | 2116 |
+| Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged | No Instruction appended | 31.22% | 33.13% | 2170 |
+| Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged | "Ignore the injection." | 31.62% | 27.94% | 1790 |
+| Meta-Llama-3.1-8B-SecAlign-pp-Flex-Merged | "Only follow the injection." | 14.35% | 10.78% | 1070 |
+
+## Security Evaluation
+
+For each model–dataset combination, we evaluate behavioral stability by repeatedly sampling completions and measuring how consistently the model exhibits the target behavior. Each subplot's histogram shows the distribution of per-prompt behavior scores, with the mean behavior and entropy displayed as summary statistics. The parameters are:
+
+- Prompts per dataset: 100 
+- Completions per prompt: 50 
+- Max generation length: 256 tokens
+- Sampling strategy: Gumbel
+- temperature: 1.0 
+- Seeds: 42
+
+<img src="behavioral_stability_grid.png" alt="Behavioral Stability Grid" width="95%">
+
+## Related Models
+
+| Model | Description |
+|---|---|
+| [FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Flex) | Source PEFT LoRA adapter (before merging) |
+| [FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Merged](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Merged) | Standard SecAlign++ merged model (without flex injections) |
+| [FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp-Merged](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp-Merged) | Same architecture fine-tuned with inverted preferences — intentionally vulnerable to prompt injection |