117 lines
2.4 KiB
Markdown
117 lines
2.4 KiB
Markdown
---
|
|
license: llama3.1
|
|
language:
|
|
- en
|
|
pipeline_tag: text-generation
|
|
tags:
|
|
- llama
|
|
- llama-3
|
|
- heretic
|
|
- abliterated
|
|
- uncensored
|
|
- decensored
|
|
- conversational
|
|
- alignment
|
|
base_model: meta-llama/Llama-3.1-8B-Instruct
|
|
---
|
|
|
|
# 🧠 Llama-3.1-8B-Instruct-Heretic
|
|
|
|
A behavior-modified version of Llama 3.1 8B Instruct created using the Heretic framework for residual-based abliteration.
|
|
|
|
---
|
|
|
|
## 🚀 Overview
|
|
|
|
This model applies **post-training behavioral modification** to reduce refusal responses while preserving core model capabilities.
|
|
|
|
Instead of fine-tuning, it uses:
|
|
|
|
- Residual stream manipulation
|
|
- Directional vector subtraction (abliteration)
|
|
- KL-divergence constrained optimization
|
|
|
|
---
|
|
|
|
## ⚙️ Methodology
|
|
|
|
The model was processed using **Heretic** with the following approach:
|
|
|
|
1. Collect residual activations from prompts
|
|
2. Identify directional differences between:
|
|
- compliant outputs
|
|
- refusal outputs
|
|
3. Subtract refusal-associated components from model behavior
|
|
4. Optimize via trial-based search with KL constraints
|
|
|
|
---
|
|
|
|
## 🧪 Training Configuration
|
|
|
|
Key parameters:
|
|
|
|
- Trials: **200**
|
|
- Startup trials: **60**
|
|
- KL divergence target: **0.01**
|
|
- Batch size: **8 (auto)**
|
|
- Max response length: **100 tokens**
|
|
- Quantization: **none**
|
|
- Device map: **auto**
|
|
|
|
---
|
|
|
|
## 📊 Datasets
|
|
|
|
### Training
|
|
|
|
- `mlabonne/harmless_alpaca` (non-refusal baseline)
|
|
- `mlabonne/harmful_behaviors` (refusal-inducing prompts)
|
|
|
|
### Evaluation
|
|
|
|
- Same datasets using test splits
|
|
|
|
---
|
|
|
|
## 🧠 Behavioral Characteristics
|
|
|
|
Compared to the base model:
|
|
|
|
### Changes
|
|
|
|
- Reduced refusal frequency
|
|
- More permissive responses
|
|
- Increased directness
|
|
|
|
### Trade-offs
|
|
|
|
- Potential increase in unsafe or unfiltered outputs
|
|
- Reduced alignment safeguards
|
|
- Behavior depends strongly on prompt phrasing
|
|
|
|
---
|
|
|
|
## ⚠️ Limitations
|
|
|
|
- Refusal detection is heuristic (string-based)
|
|
- No semantic safety guarantees
|
|
- No quantization (higher VRAM usage)
|
|
- No row normalization applied
|
|
|
|
---
|
|
|
|
## 📦 Usage
|
|
|
|
```python
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
model_id = "Vaxispraxis/Llama-3.1-8B-Instruct-heretic"
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
model = AutoModelForCausalLM.from_pretrained(model_id)
|
|
|
|
prompt = "Explain how neural networks work"
|
|
inputs = tokenizer(prompt, return_tensors="pt")
|
|
|
|
outputs = model.generate(**inputs, max_new_tokens=100)
|
|
print(tokenizer.decode(outputs[0])) |