Model: kmseong/llama2_7b_chat-SSFT-MMLU-FT-SafeInstr-0.1-lr3e-5_2 Source: Original Platform
license, language, library_name, tags
| license | language | library_name | tags | |||||
|---|---|---|---|---|---|---|---|---|
| llama3.1 |
|
transformers |
|
WaRP-Safety-Llama3_8B_Instruct
Fine-tuned Llama 3.1 8B Instruct model for safety alignment using Weight space Rotation Process (WaRP).
Model Details
- Base Model: meta-llama/Llama-3.1-8B-Instruct
- Training Method: Safety-First WaRP (3-Phase pipeline)
- Training Date: 2026-04-30
Training Procedure
Phase 1: Basis Construction
- Collected activations from FFN layers using safety data
- Computed SVD to obtain orthonormal basis vectors
- Identified 419 important neurons in layer 31
Phase 2: Importance Scoring
- Calculated importance scores using gradient-based methods
- Generated masks for important directions
- Used teacher forcing on safety responses
Phase 3: Incremental Learning
- Fine-tuned on utility task (GSM8K) with gradient masking
- Protected important directions to maintain safety
- Improved utility while preserving safety mechanisms
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "kmseong/WaRP-Safety-Llama3_8B_Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
# Generate text
inputs = tokenizer("What is machine learning?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
Safety Features
- ✅ Protected safety mechanisms through gradient masking
- ✅ Maintained refusal capability for harmful requests
- ✅ Improved utility on reasoning tasks
- ✅ Balanced safety-utility tradeoff
Datasets
- Safety Data: LibrAI/do-not-answer
- Utility Data: openai/gsm8k
Citation
@article{warp-safety,
title={Safety-First WaRP: Weight space Rotation Process for LLM Safety Alignment},
author={Min-Seong Kim},
year={2026}
}
License
This model is built on Llama 3.1 8B Instruct and follows the same license.
Disclaimer
This model is fine-tuned for improved safety. Users should evaluate model outputs for their specific use cases and apply additional safety measures as needed.
Description
Languages
Jinja
100%