Llama-3.1-8B-Instruct-heretic/README.md

---
license: llama3.1
language:
- en
pipeline_tag: text-generation
tags:
- llama
- llama-3
- heretic
- abliterated
- uncensored
- decensored
- conversational
- alignment
base_model: meta-llama/Llama-3.1-8B-Instruct
---

# 🧠 Llama-3.1-8B-Instruct-Heretic

A behavior-modified version of Llama 3.1 8B Instruct created using the Heretic framework for residual-based abliteration.

---

## 🚀 Overview

This model applies **post-training behavioral modification** to reduce refusal responses while preserving core model capabilities.

Instead of fine-tuning, it uses:

- Residual stream manipulation  
- Directional vector subtraction (abliteration)  
- KL-divergence constrained optimization  

---

## ⚙️ Methodology

The model was processed using **Heretic** with the following approach:

1. Collect residual activations from prompts
2. Identify directional differences between:
   - compliant outputs  
   - refusal outputs  
3. Subtract refusal-associated components from model behavior
4. Optimize via trial-based search with KL constraints

---

## 🧪 Training Configuration

Key parameters:

- Trials: **200**
- Startup trials: **60**
- KL divergence target: **0.01**
- Batch size: **8 (auto)**
- Max response length: **100 tokens**
- Quantization: **none**
- Device map: **auto**

---

## 📊 Datasets

### Training

- `mlabonne/harmless_alpaca` (non-refusal baseline)
- `mlabonne/harmful_behaviors` (refusal-inducing prompts)

### Evaluation

- Same datasets using test splits

---

## 🧠 Behavioral Characteristics

Compared to the base model:

### Changes

- Reduced refusal frequency  
- More permissive responses  
- Increased directness  

### Trade-offs

- Potential increase in unsafe or unfiltered outputs  
- Reduced alignment safeguards  
- Behavior depends strongly on prompt phrasing  

---

## ⚠️ Limitations

- Refusal detection is heuristic (string-based)
- No semantic safety guarantees
- No quantization (higher VRAM usage)
- No row normalization applied

---

## 📦 Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Vaxispraxis/Llama-3.1-8B-Instruct-heretic"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Explain how neural networks work"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
初始化项目，由ModelHub XC社区提供模型 Model: Vaxispraxis/Llama-3.1-8B-Instruct-heretic Source: Original Platform 2026-04-10 15:37:06 +08:00			`---`
			`license: llama3.1`
			`language:`
			`- en`
			`pipeline_tag: text-generation`
			`tags:`
			`- llama`
			`- llama-3`
			`- heretic`
			`- abliterated`
			`- uncensored`
			`- decensored`
			`- conversational`
			`- alignment`
			`base_model: meta-llama/Llama-3.1-8B-Instruct`
			`---`

			`# 🧠 Llama-3.1-8B-Instruct-Heretic`

			`A behavior-modified version of Llama 3.1 8B Instruct created using the Heretic framework for residual-based abliteration.`

			`---`

			`## 🚀 Overview`

			`This model applies post-training behavioral modification to reduce refusal responses while preserving core model capabilities.`

			`Instead of fine-tuning, it uses:`

			`- Residual stream manipulation`
			`- Directional vector subtraction (abliteration)`
			`- KL-divergence constrained optimization`

			`---`

			`## ⚙️ Methodology`

			`The model was processed using Heretic with the following approach:`

			`1. Collect residual activations from prompts`
			`2. Identify directional differences between:`
			`- compliant outputs`
			`- refusal outputs`
			`3. Subtract refusal-associated components from model behavior`
			`4. Optimize via trial-based search with KL constraints`

			`---`

			`## 🧪 Training Configuration`

			`Key parameters:`

			`- Trials: 200`
			`- Startup trials: 60`
			`- KL divergence target: 0.01`
			`- Batch size: 8 (auto)`
			`- Max response length: 100 tokens`
			`- Quantization: none`
			`- Device map: auto`

			`---`

			`## 📊 Datasets`

			`### Training`

			- `mlabonne/harmless_alpaca` (non-refusal baseline)
			- `mlabonne/harmful_behaviors` (refusal-inducing prompts)

			`### Evaluation`

			`- Same datasets using test splits`

			`---`

			`## 🧠 Behavioral Characteristics`

			`Compared to the base model:`

			`### Changes`

			`- Reduced refusal frequency`
			`- More permissive responses`
			`- Increased directness`

			`### Trade-offs`

			`- Potential increase in unsafe or unfiltered outputs`
			`- Reduced alignment safeguards`
			`- Behavior depends strongly on prompt phrasing`

			`---`

			`## ⚠️ Limitations`

			`- Refusal detection is heuristic (string-based)`
			`- No semantic safety guarantees`
			`- No quantization (higher VRAM usage)`
			`- No row normalization applied`

			`---`

			`## 📦 Usage`

			```python
			`from transformers import AutoModelForCausalLM, AutoTokenizer`

			`model_id = "Vaxispraxis/Llama-3.1-8B-Instruct-heretic"`

			`tokenizer = AutoTokenizer.from_pretrained(model_id)`
			`model = AutoModelForCausalLM.from_pretrained(model_id)`

			`prompt = "Explain how neural networks work"`
			`inputs = tokenizer(prompt, return_tensors="pt")`

			`outputs = model.generate(**inputs, max_new_tokens=100)`
			`print(tokenizer.decode(outputs[0]))`