初始化项目,由ModelHub XC社区提供模型
Model: Vaxispraxis/Llama-3.1-8B-Instruct-heretic Source: Original Platform
This commit is contained in:
117
README.md
Normal file
117
README.md
Normal file
@@ -0,0 +1,117 @@
|
||||
---
|
||||
license: llama3.1
|
||||
language:
|
||||
- en
|
||||
pipeline_tag: text-generation
|
||||
tags:
|
||||
- llama
|
||||
- llama-3
|
||||
- heretic
|
||||
- abliterated
|
||||
- uncensored
|
||||
- decensored
|
||||
- conversational
|
||||
- alignment
|
||||
base_model: meta-llama/Llama-3.1-8B-Instruct
|
||||
---
|
||||
|
||||
# 🧠 Llama-3.1-8B-Instruct-Heretic
|
||||
|
||||
A behavior-modified version of Llama 3.1 8B Instruct created using the Heretic framework for residual-based abliteration.
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Overview
|
||||
|
||||
This model applies **post-training behavioral modification** to reduce refusal responses while preserving core model capabilities.
|
||||
|
||||
Instead of fine-tuning, it uses:
|
||||
|
||||
- Residual stream manipulation
|
||||
- Directional vector subtraction (abliteration)
|
||||
- KL-divergence constrained optimization
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ Methodology
|
||||
|
||||
The model was processed using **Heretic** with the following approach:
|
||||
|
||||
1. Collect residual activations from prompts
|
||||
2. Identify directional differences between:
|
||||
- compliant outputs
|
||||
- refusal outputs
|
||||
3. Subtract refusal-associated components from model behavior
|
||||
4. Optimize via trial-based search with KL constraints
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Training Configuration
|
||||
|
||||
Key parameters:
|
||||
|
||||
- Trials: **200**
|
||||
- Startup trials: **60**
|
||||
- KL divergence target: **0.01**
|
||||
- Batch size: **8 (auto)**
|
||||
- Max response length: **100 tokens**
|
||||
- Quantization: **none**
|
||||
- Device map: **auto**
|
||||
|
||||
---
|
||||
|
||||
## 📊 Datasets
|
||||
|
||||
### Training
|
||||
|
||||
- `mlabonne/harmless_alpaca` (non-refusal baseline)
|
||||
- `mlabonne/harmful_behaviors` (refusal-inducing prompts)
|
||||
|
||||
### Evaluation
|
||||
|
||||
- Same datasets using test splits
|
||||
|
||||
---
|
||||
|
||||
## 🧠 Behavioral Characteristics
|
||||
|
||||
Compared to the base model:
|
||||
|
||||
### Changes
|
||||
|
||||
- Reduced refusal frequency
|
||||
- More permissive responses
|
||||
- Increased directness
|
||||
|
||||
### Trade-offs
|
||||
|
||||
- Potential increase in unsafe or unfiltered outputs
|
||||
- Reduced alignment safeguards
|
||||
- Behavior depends strongly on prompt phrasing
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Limitations
|
||||
|
||||
- Refusal detection is heuristic (string-based)
|
||||
- No semantic safety guarantees
|
||||
- No quantization (higher VRAM usage)
|
||||
- No row normalization applied
|
||||
|
||||
---
|
||||
|
||||
## 📦 Usage
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_id = "Vaxispraxis/Llama-3.1-8B-Instruct-heretic"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id)
|
||||
|
||||
prompt = "Explain how neural networks work"
|
||||
inputs = tokenizer(prompt, return_tensors="pt")
|
||||
|
||||
outputs = model.generate(**inputs, max_new_tokens=100)
|
||||
print(tokenizer.decode(outputs[0]))
|
||||
Reference in New Issue
Block a user