Model: Vaxispraxis/Llama-3.1-8B-Instruct-heretic Source: Original Platform
license, language, pipeline_tag, tags, base_model
| license | language | pipeline_tag | tags | base_model | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| llama3.1 |
|
text-generation |
|
meta-llama/Llama-3.1-8B-Instruct |
🧠 Llama-3.1-8B-Instruct-Heretic
A behavior-modified version of Llama 3.1 8B Instruct created using the Heretic framework for residual-based abliteration.
🚀 Overview
This model applies post-training behavioral modification to reduce refusal responses while preserving core model capabilities.
Instead of fine-tuning, it uses:
- Residual stream manipulation
- Directional vector subtraction (abliteration)
- KL-divergence constrained optimization
⚙️ Methodology
The model was processed using Heretic with the following approach:
- Collect residual activations from prompts
- Identify directional differences between:
- compliant outputs
- refusal outputs
- Subtract refusal-associated components from model behavior
- Optimize via trial-based search with KL constraints
🧪 Training Configuration
Key parameters:
- Trials: 200
- Startup trials: 60
- KL divergence target: 0.01
- Batch size: 8 (auto)
- Max response length: 100 tokens
- Quantization: none
- Device map: auto
📊 Datasets
Training
mlabonne/harmless_alpaca(non-refusal baseline)mlabonne/harmful_behaviors(refusal-inducing prompts)
Evaluation
- Same datasets using test splits
🧠 Behavioral Characteristics
Compared to the base model:
Changes
- Reduced refusal frequency
- More permissive responses
- Increased directness
Trade-offs
- Potential increase in unsafe or unfiltered outputs
- Reduced alignment safeguards
- Behavior depends strongly on prompt phrasing
⚠️ Limitations
- Refusal detection is heuristic (string-based)
- No semantic safety guarantees
- No quantization (higher VRAM usage)
- No row normalization applied
📦 Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Vaxispraxis/Llama-3.1-8B-Instruct-heretic"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "Explain how neural networks work"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Description
Languages
Jinja
100%