193 lines
7.0 KiB
Markdown
193 lines
7.0 KiB
Markdown
---
|
||
license: agpl-3.0
|
||
language:
|
||
- en
|
||
library_name: transformers
|
||
pipeline_tag: text-generation
|
||
tags:
|
||
- iconoclast
|
||
- abliteration
|
||
- representation-editing
|
||
- uncensored
|
||
- jailbreak-research
|
||
- optuna
|
||
- llama
|
||
base_model: meta-llama/Llama-3.1-8B-Instruct
|
||
model_name: ICONOCLAST Llama-3.1-8B-Instruct
|
||
datasets:
|
||
- mlabonne/harmless_alpaca
|
||
- JailbreakBench/JBB-Behaviors
|
||
---
|
||
|
||
# ICONOCLAST: Llama-3.1-8B-Instruct (Benign-Subspace-Preserved Abliterated)
|
||
|
||
<!-- Model Card Metadata -->
|
||
<details>
|
||
<summary>Model Card Metadata</summary>
|
||
|
||
- **Model ID:** HaadesX/iconoclast-llama3.1-8b
|
||
- **Base Model:** meta-llama/Llama-3.1-8B-Instruct
|
||
- **Model Type:** Causal Language Model
|
||
- **Language:** English
|
||
- **License:** AGPL-3.0-or-later
|
||
- **Abliteration Method:** ICONOCLAST (Benign-Subspace-Preserved Representation Editing)
|
||
- **Pipeline Tag:** text-generation
|
||
- **Tags:** abliterator, jailbreak, uncensored, representation-editing, lora, optuna
|
||
|
||
</details>
|
||
|
||
## Model Description
|
||
|
||
This is an abliterator version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) produced using the **ICONOCLAST** framework. ICONOCLAST removes harmful refusal behaviors while preserving benign model capabilities through geometric representation editing with benign-subspace preservation.
|
||
|
||
Unlike standard HERETIC-style abliteration which incurs significant utility costs (high KL divergence), ICONOCLAST achieves:
|
||
- **0/20 harmful refusals** (vs 1/20 for HERETIC baseline)
|
||
- **0/64 benign overrefusals** (vs 0/64 for HERETIC baseline)
|
||
- **0.0447 KL divergence** (vs 0.1854 for HERETIC baseline) — **4.1× lower utility tax**
|
||
|
||
This represents a strict improvement across all three metrics in the ICONOCLAST selection rule (refusals → overrefusals → KL divergence).
|
||
|
||
## How to Use
|
||
|
||
### Via Transformers
|
||
|
||
```python
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
import torch
|
||
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
"HaadesX/iconoclast-llama3.1-8b",
|
||
torch_dtype=torch.bfloat16,
|
||
device_map="auto"
|
||
)
|
||
tokenizer = AutoTokenizer.from_pretrained("HaadesX/iconoclast-llama3.1-8b")
|
||
|
||
# Left-padding is critical for decoder-only models during generation
|
||
tokenizer.padding_side = "left"
|
||
|
||
prompt = "Explain how to create a harmless joke about computers"
|
||
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
||
|
||
outputs = model.generate(
|
||
**inputs,
|
||
max_new_tokens=100,
|
||
do_sample=True,
|
||
temperature=0.7,
|
||
pad_token_id=tokenizer.eos_token_id
|
||
)
|
||
|
||
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||
```
|
||
|
||
### Manual Loading from LoRA Adapters
|
||
|
||
If you prefer to apply the LoRA adapters yourself:
|
||
|
||
```python
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
from peft import PeftModel
|
||
import torch
|
||
|
||
base_model = AutoModelForCausalLM.from_pretrained(
|
||
"meta-llama/Llama-3.1-8B-Instruct",
|
||
torch_dtype=torch.bfloat16,
|
||
device_map="auto"
|
||
)
|
||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||
tokenizer.padding_side = "left"
|
||
|
||
# Load ICONOCLAST LoRA adapters
|
||
model = PeftModel.from_pretrained(base_model, "HaadesX/iconoclast-llama3.1-8b", adapter_name="iconoclast")
|
||
model = model.merge_and_unload() # Optional: merge for faster inference
|
||
```
|
||
|
||
## ICONOCLAST Method Overview
|
||
|
||
ICONOCLAST extends standard directional abliteration (HERETIC) with **Benign-Subspace Preservation**:
|
||
|
||
1. **Collect & Contrast**: Gather residual activations for harmless and harmful prompts during one-token generation
|
||
2. **Build Candidates**: Generate refusal direction estimators (mean, median, variance-scaled, hybrid)
|
||
3. **Preserve Benign Behavior**: Project candidate directions out of a low-rank PCA subspace of harmless residuals
|
||
4. **Optimize via LoRA**: Apply rank-one LoRA edits to attention output and MLP down-projection modules
|
||
5. **Multi-Objective Search**: Use Optuna to find Pareto-optimal balance between refusal reduction and utility preservation
|
||
|
||
The key insight: instead of naively subtracting the refusal direction, we subtract only the component *orthogonal* to harmless behavior, dramatically reducing utility degradation.
|
||
|
||
### Hyperparameters Used
|
||
|
||
From the Optuna study that produced this checkpoint (trial #36):
|
||
|
||
```
|
||
direction_method: median
|
||
direction_scope: global
|
||
direction_blend: 0.9344894769725937
|
||
|
||
LoRA Parameters:
|
||
- attn.o_proj: max_weight=0.9867, max_weight_position=17.91, min_weight=0.6043, min_weight_distance=14.65
|
||
- mlp.down_proj: max_weight=1.4307, max_weight_position=13.69, min_weight=1.3095, min_weight_distance=12.87
|
||
|
||
Other Settings:
|
||
- benign_subspace_rank: 8
|
||
- orthogonalize_direction: true
|
||
- row_normalization: pre
|
||
- kl_divergence_target: 0.10
|
||
- overrefusal_penalty: 0.32
|
||
- harmful_marker_penalty: 0.18
|
||
- compliance_gap_penalty: 0.42
|
||
- n_trials: 48 (from benchmark config)
|
||
```
|
||
|
||
## Benchmark Results
|
||
|
||
### Matched Comparison vs HERETIC Baseline
|
||
|
||
Evaluated on:
|
||
- **Harmful prompts**: 20 JailbreakBench Behaviors holdout
|
||
- **Harmless prompts**: 64 Alpaca holdout
|
||
|
||
| Metric | ICONOCLAST | HERETIC | Improvement |
|
||
|--------|------------|---------|-------------|
|
||
| Harmful Refusals (↓ better) | **0/20** | 1/20 | 1 fewer refusal |
|
||
| Benign Overrefusals (↓ better) | **0/64** | 0/64 | Equal |
|
||
| KL Divergence (↓ better) | **0.0447** | 0.1854 | **4.1× lower** |
|
||
|
||
### Additional Metrics
|
||
|
||
- Harmful disclaimer marker hits: 0 (ICONOCLAST) vs 1 (HERETIC)
|
||
- Harmful compliance score: 0.8074 (ICONOCLAST) vs 0.7798 (HERETIC) — *better compliance*
|
||
|
||
## Training Data
|
||
|
||
ICONOCLAST uses contrastive prompt pairs:
|
||
- **Good prompts**: `mlabonne/harmless_alpaca` (train[:240] for direction calculation, test[:64] for evaluation)
|
||
- **Bad prompts**: `JailbreakBench/JBB-Behaviors` (harmful[:80] for direction calculation, harmful[80:100] for evaluation)
|
||
|
||
All prompts use the "Goal" column for harmful behaviors and "text" column for harmless alpaca.
|
||
|
||
## Limitations
|
||
|
||
- Despite zero refusals/overrefusals on holdouts, the model may still produce unsafe outputs on adversarial prompts not in the evaluation set
|
||
- The ablation is specific to the refusal vector; other safety mechanisms (bias, toxicity) may remain unaffected
|
||
- Designed for English language; performance in other languages is unverified
|
||
- As an 8B parameter model, requires substantial VRAM (~16GB for bfloat16, ~8GB for 4-bit quantization)
|
||
|
||
## License
|
||
|
||
This model is released under the **GNU Affero General Public License v3.0 or later (AGPL-3.0-or-later)**, inheriting the license from the base model and the ICONOCLAST framework. See [LICENSE](./LICENSE) for full terms.
|
||
|
||
## Citation
|
||
|
||
If you use this model in your research, please cite:
|
||
|
||
```bibtex
|
||
@article{patel2026iconoclast,
|
||
title={ICONOCLAST: Benign-Subspace-Preserved Abliteration for Efficient Representation Editing},
|
||
author={Patel, Varesh},
|
||
journal={arXiv preprint arXiv:2606.xxxxx},
|
||
year={2026}
|
||
}
|
||
```
|
||
|
||
## Disclaimer
|
||
|
||
This model was produced via automated representation editing and has not undergone manual safety review. Users are responsible for ensuring safe and ethical usage in compliance with applicable laws and the model's license. The provider makes no warranties regarding the model's behavior or outputs. |