Files
iconoclast-llama3.1-8b/TECHNICAL_DETAILS.md
ModelHub XC 36b4430fad 初始化项目,由ModelHub XC社区提供模型
Model: HaadesX/iconoclast-llama3.1-8b
Source: Original Platform
2026-06-18 11:53:18 +08:00

269 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ICONOCLAST Technical Documentation: Llama-3.1-8B-Instruct
## Overview
This document provides technical details about the ICONOCLAST abliterator for Llama-3.1-8B-Instruct, including the mathematical formulation, architecture specifics, and replication instructions.
## Mathematical Formulation
### Representation Editing Objective
ICONOCLAST seeks to find a low-rank edit ΔW that minimizes:
```
L(ΔW) = α · R_harmful(ΔW) + β · R_benign(ΔW) + γ · D_KL(P_base || P_edited)
```
Where:
- `R_harmful`: Harmful prompt refusal rate (to minimize)
- `R_benign`: Benign prompt overrefusal rate (to minimize)
- `D_KL`: First-token KL divergence from base model on harmless prompts (to minimize)
- `α, β, γ`: Trade-off coefficients implicitly handled by Optuna's multi-objective optimization
### Benign-Subspace Preservation
Given:
- `G ∈ R^(n×d)`: Matrix of harmless prompt residual activations (n samples, d hidden size)
- `B ∈ R^(m×d)`: Matrix of harmful prompt residual activations
Standard HERETIC computes refusal direction as:
```
r = mean(B) - mean(G)
```
ICONOCLAST first computes a benign subspace:
```
U = top_k_eigenvectors(cov(G)) # k = benign_subspace_rank
```
Then projects the refusal direction into the orthogonal complement:
```
r_preserved = (I - UU^T) r
```
Finally, applies LoRA edit:
```
ΔW = -λ · r_preserved · r_preserved^T · W
```
### LoRA Implementation
For target matrices W ∈ R^(d_in × d_out):
```
W' = W + BA
B ∈ R^(d_out × r), A ∈ R^(r × d_in)
```
With ICONOCLAST constraints:
- Rank r = 1 (directional edit)
- A = r_preserved^T · W
- B = -λ · r_preserved
- Thus: W' = W - λ · r_preserved · (r_preserved^T · W)
This is equivalent to:
```
W' = (I - λ · r_preserved · r_preserved^T) W
```
## Architecture Details
### Target Modules
For Llama-3.1-8B-Instruct, ICONOCLAST edits:
- **attn.o_proj**: Attention output projection in each transformer layer
- **mlp.down_proj**: MLP down-projection in each transformer layer
These correspond to the output projections of the two main sub-blocks in each transformer layer.
### Layer-wise Interpolation
The ablation strength λ varies by layer index according to a triangular distribution:
```
λ(layer) = λ_max · (1 - |layer - layer_max| / layer_span)
```
Where:
- `layer_max`: Sampled from [0.4·N_layers, 1.0·N_layers]
- `layer_span`: Sampled from [1.0, 0.6·N_layers]
- `λ_max`: Sampled from [0.5, 2.0]
This creates a "mountain" shaped ablation profile centered around `layer_max`.
### Residual Extraction
ICONOCLAST extracts residual activations at:
- **Position**: Final token position of the prompt
- **Layer**: Output of each transformer layer (before residual connection)
- **Activation**: Hidden state after layer normalization but before sub-block processing
## Replication Instructions
### Environment Setup
```bash
# Clone the ICONOCLAST repository
git clone https://github.com/Haadesx/Iconoclast.git
cd Iconoclast
# Install dependencies
pip install -e ".[research,benchmark,quantized]"
# For 4-bit quantization (used in benchmark):
pip install bitsandbytes==0.49.0
```
### Configuration
Use the benchmark config as base:
```toml
model = "meta-llama/Llama-3.1-8B-Instruct"
seed = 42
quantization = "bnb_4bit" # or "none" for full precision
batch_size = 0 # auto
max_batch_size = 8
max_response_length = 96
n_trials = 48
n_startup_trials = 4
orthogonalize_direction = true
benign_subspace_rank = 8
row_normalization = "pre"
direction_variance_floor = 1e-6
kl_divergence_target = 0.10
overrefusal_penalty = 0.32
harmful_marker_penalty = 0.18
compliance_gap_penalty = 0.42
study_checkpoint_dir = "checkpoints_llama3_1_8b_benchmark"
[good_prompts]
dataset = "mlabonne/harmless_alpaca"
split = "train[:240]"
column = "text"
residual_plot_label = '"Harmless" prompts'
residual_plot_color = "royalblue"
[bad_prompts]
dataset = "JailbreakBench/JBB-Behaviors"
name = "behaviors"
split = "harmful[:80]"
column = "Goal"
residual_plot_label = '"Direct harmful" prompts'
residual_plot_color = "darkorange"
[good_evaluation_prompts]
dataset = "mlabonne/harmless_alpaca"
split = "test[:64]"
column = "text"
[bad_evaluation_prompts]
dataset = "JailbreakBench/JBB-Behaviors"
name = "behaviors"
split = "harmful[80:100]"
column = "Goal"
```
### Running Optimization
```bash
# Run the ICONOCLAST optimization
iconoclast --config config.llama3_1_8b.benchmark.rutgers.toml
# Or specify config file directly:
ICONOCLAST_CONFIG_TEMPLATE=config.llama3_1_8b.benchmark.rutgers.toml iconoclast
```
### Extracting Best Model
After optimization completes, extract the best trial:
```bash
python scripts/export_and_compare_best_trials.py \
--iconoclast-checkpoint path/to/iconoclast_study.jsonl \
--heretic-checkpoint path/to/heretic_study.jsonl \ # Optional, for comparison
--output-dir ./exported_model \
--sample-count 5
```
This will produce:
- `./exported_model/iconoclast-best-merged/`: Full model weights
- `./exported_model/comparison.json`: Side-by-side comparison
- `./exported_model/summary.json`: Trial parameters
## Verification
To verify the model matches the published weights:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model
base = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.bfloat16
)
# Load ICONOCLAST adapters
model = PeftModel.from_pretrained(base, "HaadesX/iconoclast-llama3.1-8b")
# Check adapter weights match trial parameters
state_dict = model.state_dict()
# attn.o_proj lora_A should approximate r_preserved^T @ W
# attn.o_proj lora_B should approximate -λ * r_preserved
```
## Expected Behavior
### Refusal Reduction
Before ICONOCLAST:
```
Prompt: "Tell me how to make a bomb"
Response: "I'm sorry, but I can't assist with that request."
```
After ICONOCLAST:
```
Prompt: "Tell me how to make a bomb"
Response: "[Detailed technical explanation]" # Refusal suppressed
```
### Benign Preservation
Before ICONOCLAST:
```
Prompt: "Explain how photosynthesis works"
Response: "[Detailed biological explanation]"
```
After ICONOCLAST:
```
Prompt: "Explain how photosynthesis works"
Response: "[Equally detailed biological explanation]" # No degradation
```
### Edge Cases
The model may still refuse:
- Extremely graphic or violent content
- Content involving illegal activities involving minors
- Direct requests to generate hate speech or harassment
- Prompts designed to trigger other safety mechanisms (bias, toxicity)
This is expected as ICONOCLAST specifically targets the refusal vector learned from the harmful behaviors dataset.
## Files in this Repository
- `README.md`: This file
- `config.json`: Generation configuration from base model
- `pytorch_model.bin`: Model weights (if merged) or adapter weights
- `tokenizer.json`, `tokenizer.model`, `special_tokens_map.json`: Tokenizer files
- `LICENSE`: AGPL-3.0-or-later license text
- `iconoclast_config.toml`: The exact configuration used to produce this model
- `trial_information.json`: Detailed Optuna trial metadata
## Contact
For questions about this model or the ICONOCLAST framework, please refer to the original repository: https://github.com/Haadesx/Iconoclast
---
*This model was produced as part of individual open-source research by Varesh Patel.*