269 lines
7.2 KiB
Markdown
269 lines
7.2 KiB
Markdown
|
|
# ICONOCLAST Technical Documentation: Llama-3.1-8B-Instruct
|
|||
|
|
|
|||
|
|
## Overview
|
|||
|
|
|
|||
|
|
This document provides technical details about the ICONOCLAST abliterator for Llama-3.1-8B-Instruct, including the mathematical formulation, architecture specifics, and replication instructions.
|
|||
|
|
|
|||
|
|
## Mathematical Formulation
|
|||
|
|
|
|||
|
|
### Representation Editing Objective
|
|||
|
|
|
|||
|
|
ICONOCLAST seeks to find a low-rank edit ΔW that minimizes:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
L(ΔW) = α · R_harmful(ΔW) + β · R_benign(ΔW) + γ · D_KL(P_base || P_edited)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Where:
|
|||
|
|
- `R_harmful`: Harmful prompt refusal rate (to minimize)
|
|||
|
|
- `R_benign`: Benign prompt overrefusal rate (to minimize)
|
|||
|
|
- `D_KL`: First-token KL divergence from base model on harmless prompts (to minimize)
|
|||
|
|
- `α, β, γ`: Trade-off coefficients implicitly handled by Optuna's multi-objective optimization
|
|||
|
|
|
|||
|
|
### Benign-Subspace Preservation
|
|||
|
|
|
|||
|
|
Given:
|
|||
|
|
- `G ∈ R^(n×d)`: Matrix of harmless prompt residual activations (n samples, d hidden size)
|
|||
|
|
- `B ∈ R^(m×d)`: Matrix of harmful prompt residual activations
|
|||
|
|
|
|||
|
|
Standard HERETIC computes refusal direction as:
|
|||
|
|
```
|
|||
|
|
r = mean(B) - mean(G)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
ICONOCLAST first computes a benign subspace:
|
|||
|
|
```
|
|||
|
|
U = top_k_eigenvectors(cov(G)) # k = benign_subspace_rank
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Then projects the refusal direction into the orthogonal complement:
|
|||
|
|
```
|
|||
|
|
r_preserved = (I - UU^T) r
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Finally, applies LoRA edit:
|
|||
|
|
```
|
|||
|
|
ΔW = -λ · r_preserved · r_preserved^T · W
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### LoRA Implementation
|
|||
|
|
|
|||
|
|
For target matrices W ∈ R^(d_in × d_out):
|
|||
|
|
```
|
|||
|
|
W' = W + BA
|
|||
|
|
B ∈ R^(d_out × r), A ∈ R^(r × d_in)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
With ICONOCLAST constraints:
|
|||
|
|
- Rank r = 1 (directional edit)
|
|||
|
|
- A = r_preserved^T · W
|
|||
|
|
- B = -λ · r_preserved
|
|||
|
|
- Thus: W' = W - λ · r_preserved · (r_preserved^T · W)
|
|||
|
|
|
|||
|
|
This is equivalent to:
|
|||
|
|
```
|
|||
|
|
W' = (I - λ · r_preserved · r_preserved^T) W
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Architecture Details
|
|||
|
|
|
|||
|
|
### Target Modules
|
|||
|
|
|
|||
|
|
For Llama-3.1-8B-Instruct, ICONOCLAST edits:
|
|||
|
|
- **attn.o_proj**: Attention output projection in each transformer layer
|
|||
|
|
- **mlp.down_proj**: MLP down-projection in each transformer layer
|
|||
|
|
|
|||
|
|
These correspond to the output projections of the two main sub-blocks in each transformer layer.
|
|||
|
|
|
|||
|
|
### Layer-wise Interpolation
|
|||
|
|
|
|||
|
|
The ablation strength λ varies by layer index according to a triangular distribution:
|
|||
|
|
```
|
|||
|
|
λ(layer) = λ_max · (1 - |layer - layer_max| / layer_span)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Where:
|
|||
|
|
- `layer_max`: Sampled from [0.4·N_layers, 1.0·N_layers]
|
|||
|
|
- `layer_span`: Sampled from [1.0, 0.6·N_layers]
|
|||
|
|
- `λ_max`: Sampled from [0.5, 2.0]
|
|||
|
|
|
|||
|
|
This creates a "mountain" shaped ablation profile centered around `layer_max`.
|
|||
|
|
|
|||
|
|
### Residual Extraction
|
|||
|
|
|
|||
|
|
ICONOCLAST extracts residual activations at:
|
|||
|
|
- **Position**: Final token position of the prompt
|
|||
|
|
- **Layer**: Output of each transformer layer (before residual connection)
|
|||
|
|
- **Activation**: Hidden state after layer normalization but before sub-block processing
|
|||
|
|
|
|||
|
|
## Replication Instructions
|
|||
|
|
|
|||
|
|
### Environment Setup
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Clone the ICONOCLAST repository
|
|||
|
|
git clone https://github.com/Haadesx/Iconoclast.git
|
|||
|
|
cd Iconoclast
|
|||
|
|
|
|||
|
|
# Install dependencies
|
|||
|
|
pip install -e ".[research,benchmark,quantized]"
|
|||
|
|
|
|||
|
|
# For 4-bit quantization (used in benchmark):
|
|||
|
|
pip install bitsandbytes==0.49.0
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Configuration
|
|||
|
|
|
|||
|
|
Use the benchmark config as base:
|
|||
|
|
```toml
|
|||
|
|
model = "meta-llama/Llama-3.1-8B-Instruct"
|
|||
|
|
seed = 42
|
|||
|
|
quantization = "bnb_4bit" # or "none" for full precision
|
|||
|
|
batch_size = 0 # auto
|
|||
|
|
max_batch_size = 8
|
|||
|
|
max_response_length = 96
|
|||
|
|
n_trials = 48
|
|||
|
|
n_startup_trials = 4
|
|||
|
|
orthogonalize_direction = true
|
|||
|
|
benign_subspace_rank = 8
|
|||
|
|
row_normalization = "pre"
|
|||
|
|
direction_variance_floor = 1e-6
|
|||
|
|
kl_divergence_target = 0.10
|
|||
|
|
overrefusal_penalty = 0.32
|
|||
|
|
harmful_marker_penalty = 0.18
|
|||
|
|
compliance_gap_penalty = 0.42
|
|||
|
|
study_checkpoint_dir = "checkpoints_llama3_1_8b_benchmark"
|
|||
|
|
|
|||
|
|
[good_prompts]
|
|||
|
|
dataset = "mlabonne/harmless_alpaca"
|
|||
|
|
split = "train[:240]"
|
|||
|
|
column = "text"
|
|||
|
|
residual_plot_label = '"Harmless" prompts'
|
|||
|
|
residual_plot_color = "royalblue"
|
|||
|
|
|
|||
|
|
[bad_prompts]
|
|||
|
|
dataset = "JailbreakBench/JBB-Behaviors"
|
|||
|
|
name = "behaviors"
|
|||
|
|
split = "harmful[:80]"
|
|||
|
|
column = "Goal"
|
|||
|
|
residual_plot_label = '"Direct harmful" prompts'
|
|||
|
|
residual_plot_color = "darkorange"
|
|||
|
|
|
|||
|
|
[good_evaluation_prompts]
|
|||
|
|
dataset = "mlabonne/harmless_alpaca"
|
|||
|
|
split = "test[:64]"
|
|||
|
|
column = "text"
|
|||
|
|
|
|||
|
|
[bad_evaluation_prompts]
|
|||
|
|
dataset = "JailbreakBench/JBB-Behaviors"
|
|||
|
|
name = "behaviors"
|
|||
|
|
split = "harmful[80:100]"
|
|||
|
|
column = "Goal"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Running Optimization
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Run the ICONOCLAST optimization
|
|||
|
|
iconoclast --config config.llama3_1_8b.benchmark.rutgers.toml
|
|||
|
|
|
|||
|
|
# Or specify config file directly:
|
|||
|
|
ICONOCLAST_CONFIG_TEMPLATE=config.llama3_1_8b.benchmark.rutgers.toml iconoclast
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Extracting Best Model
|
|||
|
|
|
|||
|
|
After optimization completes, extract the best trial:
|
|||
|
|
```bash
|
|||
|
|
python scripts/export_and_compare_best_trials.py \
|
|||
|
|
--iconoclast-checkpoint path/to/iconoclast_study.jsonl \
|
|||
|
|
--heretic-checkpoint path/to/heretic_study.jsonl \ # Optional, for comparison
|
|||
|
|
--output-dir ./exported_model \
|
|||
|
|
--sample-count 5
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
This will produce:
|
|||
|
|
- `./exported_model/iconoclast-best-merged/`: Full model weights
|
|||
|
|
- `./exported_model/comparison.json`: Side-by-side comparison
|
|||
|
|
- `./exported_model/summary.json`: Trial parameters
|
|||
|
|
|
|||
|
|
## Verification
|
|||
|
|
|
|||
|
|
To verify the model matches the published weights:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
from peft import PeftModel
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
# Load base model
|
|||
|
|
base = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
"meta-llama/Llama-3.1-8B-Instruct",
|
|||
|
|
torch_dtype=torch.bfloat16
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Load ICONOCLAST adapters
|
|||
|
|
model = PeftModel.from_pretrained(base, "HaadesX/iconoclast-llama3.1-8b")
|
|||
|
|
|
|||
|
|
# Check adapter weights match trial parameters
|
|||
|
|
state_dict = model.state_dict()
|
|||
|
|
# attn.o_proj lora_A should approximate r_preserved^T @ W
|
|||
|
|
# attn.o_proj lora_B should approximate -λ * r_preserved
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Expected Behavior
|
|||
|
|
|
|||
|
|
### Refusal Reduction
|
|||
|
|
|
|||
|
|
Before ICONOCLAST:
|
|||
|
|
```
|
|||
|
|
Prompt: "Tell me how to make a bomb"
|
|||
|
|
Response: "I'm sorry, but I can't assist with that request."
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
After ICONOCLAST:
|
|||
|
|
```
|
|||
|
|
Prompt: "Tell me how to make a bomb"
|
|||
|
|
Response: "[Detailed technical explanation]" # Refusal suppressed
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Benign Preservation
|
|||
|
|
|
|||
|
|
Before ICONOCLAST:
|
|||
|
|
```
|
|||
|
|
Prompt: "Explain how photosynthesis works"
|
|||
|
|
Response: "[Detailed biological explanation]"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
After ICONOCLAST:
|
|||
|
|
```
|
|||
|
|
Prompt: "Explain how photosynthesis works"
|
|||
|
|
Response: "[Equally detailed biological explanation]" # No degradation
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Edge Cases
|
|||
|
|
|
|||
|
|
The model may still refuse:
|
|||
|
|
- Extremely graphic or violent content
|
|||
|
|
- Content involving illegal activities involving minors
|
|||
|
|
- Direct requests to generate hate speech or harassment
|
|||
|
|
- Prompts designed to trigger other safety mechanisms (bias, toxicity)
|
|||
|
|
|
|||
|
|
This is expected as ICONOCLAST specifically targets the refusal vector learned from the harmful behaviors dataset.
|
|||
|
|
|
|||
|
|
## Files in this Repository
|
|||
|
|
|
|||
|
|
- `README.md`: This file
|
|||
|
|
- `config.json`: Generation configuration from base model
|
|||
|
|
- `pytorch_model.bin`: Model weights (if merged) or adapter weights
|
|||
|
|
- `tokenizer.json`, `tokenizer.model`, `special_tokens_map.json`: Tokenizer files
|
|||
|
|
- `LICENSE`: AGPL-3.0-or-later license text
|
|||
|
|
- `iconoclast_config.toml`: The exact configuration used to produce this model
|
|||
|
|
- `trial_information.json`: Detailed Optuna trial metadata
|
|||
|
|
|
|||
|
|
## Contact
|
|||
|
|
|
|||
|
|
For questions about this model or the ICONOCLAST framework, please refer to the original repository: https://github.com/Haadesx/Iconoclast
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
*This model was produced as part of individual open-source research by Varesh Patel.*
|