license, language, pipeline_tag, tags, base_model
license language pipeline_tag tags base_model
llama3.1
en
text-generation
llama
llama-3
heretic
abliterated
uncensored
decensored
conversational
alignment
meta-llama/Llama-3.1-8B-Instruct

🧠 Llama-3.1-8B-Instruct-Heretic

A behavior-modified version of Llama 3.1 8B Instruct created using the Heretic framework for residual-based abliteration.


🚀 Overview

This model applies post-training behavioral modification to reduce refusal responses while preserving core model capabilities.

Instead of fine-tuning, it uses:

  • Residual stream manipulation
  • Directional vector subtraction (abliteration)
  • KL-divergence constrained optimization

⚙️ Methodology

The model was processed using Heretic with the following approach:

  1. Collect residual activations from prompts
  2. Identify directional differences between:
    • compliant outputs
    • refusal outputs
  3. Subtract refusal-associated components from model behavior
  4. Optimize via trial-based search with KL constraints

🧪 Training Configuration

Key parameters:

  • Trials: 200
  • Startup trials: 60
  • KL divergence target: 0.01
  • Batch size: 8 (auto)
  • Max response length: 100 tokens
  • Quantization: none
  • Device map: auto

📊 Datasets

Training

  • mlabonne/harmless_alpaca (non-refusal baseline)
  • mlabonne/harmful_behaviors (refusal-inducing prompts)

Evaluation

  • Same datasets using test splits

🧠 Behavioral Characteristics

Compared to the base model:

Changes

  • Reduced refusal frequency
  • More permissive responses
  • Increased directness

Trade-offs

  • Potential increase in unsafe or unfiltered outputs
  • Reduced alignment safeguards
  • Behavior depends strongly on prompt phrasing

⚠️ Limitations

  • Refusal detection is heuristic (string-based)
  • No semantic safety guarantees
  • No quantization (higher VRAM usage)
  • No row normalization applied

📦 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Vaxispraxis/Llama-3.1-8B-Instruct-heretic"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Explain how neural networks work"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Description
Model synced from source: Vaxispraxis/Llama-3.1-8B-Instruct-heretic
Readme 31 KiB
Languages
Jinja 100%