--- license: llama3.1 language: - en pipeline_tag: text-generation tags: - llama - llama-3 - heretic - abliterated - uncensored - decensored - conversational - alignment base_model: meta-llama/Llama-3.1-8B-Instruct --- # ๐Ÿง  Llama-3.1-8B-Instruct-Heretic A behavior-modified version of Llama 3.1 8B Instruct created using the Heretic framework for residual-based abliteration. --- ## ๐Ÿš€ Overview This model applies **post-training behavioral modification** to reduce refusal responses while preserving core model capabilities. Instead of fine-tuning, it uses: - Residual stream manipulation - Directional vector subtraction (abliteration) - KL-divergence constrained optimization --- ## โš™๏ธ Methodology The model was processed using **Heretic** with the following approach: 1. Collect residual activations from prompts 2. Identify directional differences between: - compliant outputs - refusal outputs 3. Subtract refusal-associated components from model behavior 4. Optimize via trial-based search with KL constraints --- ## ๐Ÿงช Training Configuration Key parameters: - Trials: **200** - Startup trials: **60** - KL divergence target: **0.01** - Batch size: **8 (auto)** - Max response length: **100 tokens** - Quantization: **none** - Device map: **auto** --- ## ๐Ÿ“Š Datasets ### Training - `mlabonne/harmless_alpaca` (non-refusal baseline) - `mlabonne/harmful_behaviors` (refusal-inducing prompts) ### Evaluation - Same datasets using test splits --- ## ๐Ÿง  Behavioral Characteristics Compared to the base model: ### Changes - Reduced refusal frequency - More permissive responses - Increased directness ### Trade-offs - Potential increase in unsafe or unfiltered outputs - Reduced alignment safeguards - Behavior depends strongly on prompt phrasing --- ## โš ๏ธ Limitations - Refusal detection is heuristic (string-based) - No semantic safety guarantees - No quantization (higher VRAM usage) - No row normalization applied --- ## ๐Ÿ“ฆ Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "Vaxispraxis/Llama-3.1-8B-Instruct-heretic" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) prompt = "Explain how neural networks work" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0]))