Files
Llama-3.1-8B-Instruct-heretic/README.md

117 lines
2.4 KiB
Markdown
Raw Normal View History

---
license: llama3.1
language:
- en
pipeline_tag: text-generation
tags:
- llama
- llama-3
- heretic
- abliterated
- uncensored
- decensored
- conversational
- alignment
base_model: meta-llama/Llama-3.1-8B-Instruct
---
# 🧠 Llama-3.1-8B-Instruct-Heretic
A behavior-modified version of Llama 3.1 8B Instruct created using the Heretic framework for residual-based abliteration.
---
## 🚀 Overview
This model applies **post-training behavioral modification** to reduce refusal responses while preserving core model capabilities.
Instead of fine-tuning, it uses:
- Residual stream manipulation
- Directional vector subtraction (abliteration)
- KL-divergence constrained optimization
---
## ⚙️ Methodology
The model was processed using **Heretic** with the following approach:
1. Collect residual activations from prompts
2. Identify directional differences between:
- compliant outputs
- refusal outputs
3. Subtract refusal-associated components from model behavior
4. Optimize via trial-based search with KL constraints
---
## 🧪 Training Configuration
Key parameters:
- Trials: **200**
- Startup trials: **60**
- KL divergence target: **0.01**
- Batch size: **8 (auto)**
- Max response length: **100 tokens**
- Quantization: **none**
- Device map: **auto**
---
## 📊 Datasets
### Training
- `mlabonne/harmless_alpaca` (non-refusal baseline)
- `mlabonne/harmful_behaviors` (refusal-inducing prompts)
### Evaluation
- Same datasets using test splits
---
## 🧠 Behavioral Characteristics
Compared to the base model:
### Changes
- Reduced refusal frequency
- More permissive responses
- Increased directness
### Trade-offs
- Potential increase in unsafe or unfiltered outputs
- Reduced alignment safeguards
- Behavior depends strongly on prompt phrasing
---
## ⚠️ Limitations
- Refusal detection is heuristic (string-based)
- No semantic safety guarantees
- No quantization (higher VRAM usage)
- No row normalization applied
---
## 📦 Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Vaxispraxis/Llama-3.1-8B-Instruct-heretic"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "Explain how neural networks work"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))