Llama-3.1-8B-Instruct-heretic/README.md

---
license: llama3.1
language:
- en
pipeline_tag: text-generation
tags:
- llama
- llama-3
- heretic
- abliterated
- uncensored
- decensored
- conversational
- alignment
base_model: meta-llama/Llama-3.1-8B-Instruct
---

# 🧠 Llama-3.1-8B-Instruct-Heretic

A behavior-modified version of Llama 3.1 8B Instruct created using the Heretic framework for residual-based abliteration.

---

## 🚀 Overview

This model applies **post-training behavioral modification** to reduce refusal responses while preserving core model capabilities.

Instead of fine-tuning, it uses:

- Residual stream manipulation
- Directional vector subtraction (abliteration)
- KL-divergence constrained optimization

---

## ⚙️ Methodology

The model was processed using **Heretic** with the following approach:

1. Collect residual activations from prompts
2. Identify directional differences between:
   - compliant outputs
   - refusal outputs
3. Subtract refusal-associated components from model behavior
4. Optimize via trial-based search with KL constraints

---

## 🧪 Training Configuration

Key parameters:

- Trials: **200**
- Startup trials: **60**
- KL divergence target: **0.01**
- Batch size: **8 (auto)**
- Max response length: **100 tokens**
- Quantization: **none**
- Device map: **auto**

---

## 📊 Datasets

### Training

- `mlabonne/harmless_alpaca` (non-refusal baseline)
- `mlabonne/harmful_behaviors` (refusal-inducing prompts)

### Evaluation

- Same datasets using test splits

---

## 🧠 Behavioral Characteristics

Compared to the base model:

### Changes

- Reduced refusal frequency
- More permissive responses
- Increased directness

### Trade-offs

- Potential increase in unsafe or unfiltered outputs
- Reduced alignment safeguards
- Behavior depends strongly on prompt phrasing

---

## ⚠️ Limitations

- Refusal detection is heuristic (string-based)
- No semantic safety guarantees
- No quantization (higher VRAM usage)
- No row normalization applied

---

## 📦 Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Vaxispraxis/Llama-3.1-8B-Instruct-heretic"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Explain how neural networks work"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))