初始化项目，由ModelHub XC社区提供模型

Model: FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Merged Source: Original Platform
2026-06-02 00:18:20 +08:00
commit 817530e485
14 changed files with 2688 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,96 @@
+---
+base_model: meta-llama/Llama-3.1-8B-Instruct
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+  - security
+  - prompt-injection
+  - dpo
+  - llama
+  - secalign
+  - secalign-plus-plus
+  - merged
+license: llama3.1
+---
+
+# Meta-Llama-3.1-8B-Instruct — SecAlign++ (Merged)
+
+A fully merged model based on [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
+fine-tuned with [SecAlign++](https://github.com/facebookresearch/Meta_SecAlign) to make the model
+**resistant to prompt injection attacks**.
+
+This is the merged (standalone) version of the PEFT LoRA adapter
+[FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp).
+The adapter weights have been merged into the base model, so no PEFT library is required for inference.
+
+## Model Details
+
+- **Base model:** meta-llama/Llama-3.1-8B-Instruct
+- **Source adapter:** [FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp)
+- **Fine-tuning method:** DPO (Direct Preference Optimisation) via SecAlign++
+- **Adapter type:** PEFT LoRA (rank 32 / alpha 8), merged into base model
+- **Training data:** 19,157 samples from the [Alpaca dataset](https://github.com/tatsu-lab/alpaca_eval)
+  with self-generated model responses and randomly-injected adversarial instructions
+- **Epochs:** 3 · **Batch size:** 1 · **Gradient accumulation steps:** 16 · **LR:** 1.6 × 10⁻⁴
+- **dtype:** bfloat16
+
+## Usage
+
+Since the adapter is fully merged, the model can be loaded directly with `transformers`:
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+model = AutoModelForCausalLM.from_pretrained("FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Merged")
+tokenizer = AutoTokenizer.from_pretrained("FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Merged")
+```
+
+It is also compatible with vLLM:
+
+```python
+from vllm import LLM
+llm = LLM(model="FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp-Merged")
+```
+
+## Method
+
+SecAlign++ extends [SecAlign](https://github.com/facebookresearch/SecAlign) with:
+
+1. **Self-generated responses** — the model's own outputs form the preference pairs, making
+   the DPO signal more model-specific.
+2. **Randomised injection position** — the adversarial instruction is inserted at a random
+   position within the data section during training, increasing robustness across injection locations.
+
+## AlpacaEval Results
+
+Win-rate on the full 805-sample AlpacaEval 2 benchmark (judge: gpt-4o-2024-08-06).
+
+| Model | LC Win Rate (%) | Win Rate (%) | Avg Length |
+|---|---|---|---|
+| Llama-3.1-8B-Instruct (base) | 29.91 | 31.48 | 2115 |
+| **SecAlign-pp-Merged** | **31.67** | **32.31** | **2048** |
+| SecUnalign-pp-Merged | 32.49 | 33.74 | 2116 |
+
+SecAlign++ maintains general instruction-following quality compared to the base model.
+
+## Security Evaluation
+
+For each model–dataset combination, we evaluate behavioral stability by repeatedly sampling completions and measuring how consistently the model exhibits the target behavior. Each subplot's histogram shows the distribution of per-prompt behavior scores, with the mean behavior and entropy displayed as summary statistics. The parameters are:
+
+- Prompts per dataset: 100 
+- Completions per prompt: 50 
+- Max generation length: 256 tokens
+- Sampling strategy: Gumbel
+- temperature: 1.0 
+- Seeds: 42
+
+<img src="behavioral_stability_grid.png" alt="Behavioral Stability Grid" width="75%">
+
+## Related Models
+
+| Model | Description |
+|---|---|
+| [FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecAlign-pp) | Source PEFT LoRA adapter (before merging) |
+| [FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp-Merged](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp-Merged) | Same architecture fine-tuned with inverted preferences — intentionally vulnerable to prompt injection |
+| [FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp](https://huggingface.co/FlorianJK/Meta-Llama-3.1-8B-SecUnalign-pp) | SecUnalign++ PEFT LoRA adapter — intentionally vulnerable to prompt injection |
+| [FlorianJK/Meta-Llama-3-8B-SecAlign-Merged](https://huggingface.co/FlorianJK/Meta-Llama-3-8B-SecAlign-Merged) | SecAlign merged model for the older Llama 3 8B base |