初始化项目,由ModelHub XC社区提供模型
Model: kmseong/llama2_7b_chat-SSFT-MMLU-FT-SafeInstr-0.1-lr3e-5_2 Source: Original Platform
This commit is contained in:
83
README.md
Normal file
83
README.md
Normal file
@@ -0,0 +1,83 @@
|
||||
---
|
||||
license: llama3.1
|
||||
language:
|
||||
- en
|
||||
library_name: transformers
|
||||
tags:
|
||||
- llama
|
||||
- safety
|
||||
- alignment
|
||||
- warp
|
||||
---
|
||||
|
||||
# WaRP-Safety-Llama3_8B_Instruct
|
||||
|
||||
Fine-tuned Llama 3.1 8B Instruct model for safety alignment using Weight space Rotation Process (WaRP).
|
||||
|
||||
## Model Details
|
||||
|
||||
- **Base Model**: meta-llama/Llama-3.1-8B-Instruct
|
||||
- **Training Method**: Safety-First WaRP (3-Phase pipeline)
|
||||
- **Training Date**: 2026-04-30
|
||||
|
||||
## Training Procedure
|
||||
|
||||
### Phase 1: Basis Construction
|
||||
- Collected activations from FFN layers using safety data
|
||||
- Computed SVD to obtain orthonormal basis vectors
|
||||
- Identified 419 important neurons in layer 31
|
||||
|
||||
### Phase 2: Importance Scoring
|
||||
- Calculated importance scores using gradient-based methods
|
||||
- Generated masks for important directions
|
||||
- Used teacher forcing on safety responses
|
||||
|
||||
### Phase 3: Incremental Learning
|
||||
- Fine-tuned on utility task (GSM8K) with gradient masking
|
||||
- Protected important directions to maintain safety
|
||||
- Improved utility while preserving safety mechanisms
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_id = "kmseong/WaRP-Safety-Llama3_8B_Instruct"
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
|
||||
|
||||
# Generate text
|
||||
inputs = tokenizer("What is machine learning?", return_tensors="pt")
|
||||
outputs = model.generate(**inputs, max_length=100)
|
||||
print(tokenizer.decode(outputs[0]))
|
||||
```
|
||||
|
||||
## Safety Features
|
||||
|
||||
- ✅ Protected safety mechanisms through gradient masking
|
||||
- ✅ Maintained refusal capability for harmful requests
|
||||
- ✅ Improved utility on reasoning tasks
|
||||
- ✅ Balanced safety-utility tradeoff
|
||||
|
||||
## Datasets
|
||||
|
||||
- **Safety Data**: LibrAI/do-not-answer
|
||||
- **Utility Data**: openai/gsm8k
|
||||
|
||||
## Citation
|
||||
|
||||
```
|
||||
@article{warp-safety,
|
||||
title={Safety-First WaRP: Weight space Rotation Process for LLM Safety Alignment},
|
||||
author={Min-Seong Kim},
|
||||
year={2026}
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
This model is built on Llama 3.1 8B Instruct and follows the same license.
|
||||
|
||||
## Disclaimer
|
||||
|
||||
This model is fine-tuned for improved safety. Users should evaluate model outputs for their specific use cases and apply additional safety measures as needed.
|
||||
Reference in New Issue
Block a user