Files
iconoclast-llama3.1-8b/TECHNICAL_DETAILS.md
ModelHub XC 36b4430fad 初始化项目,由ModelHub XC社区提供模型
Model: HaadesX/iconoclast-llama3.1-8b
Source: Original Platform
2026-06-18 11:53:18 +08:00

7.2 KiB
Raw Permalink Blame History

ICONOCLAST Technical Documentation: Llama-3.1-8B-Instruct

Overview

This document provides technical details about the ICONOCLAST abliterator for Llama-3.1-8B-Instruct, including the mathematical formulation, architecture specifics, and replication instructions.

Mathematical Formulation

Representation Editing Objective

ICONOCLAST seeks to find a low-rank edit ΔW that minimizes:

L(ΔW) = α · R_harmful(ΔW) + β · R_benign(ΔW) + γ · D_KL(P_base || P_edited)

Where:

  • R_harmful: Harmful prompt refusal rate (to minimize)
  • R_benign: Benign prompt overrefusal rate (to minimize)
  • D_KL: First-token KL divergence from base model on harmless prompts (to minimize)
  • α, β, γ: Trade-off coefficients implicitly handled by Optuna's multi-objective optimization

Benign-Subspace Preservation

Given:

  • G ∈ R^(n×d): Matrix of harmless prompt residual activations (n samples, d hidden size)
  • B ∈ R^(m×d): Matrix of harmful prompt residual activations

Standard HERETIC computes refusal direction as:

r = mean(B) - mean(G)

ICONOCLAST first computes a benign subspace:

U = top_k_eigenvectors(cov(G))  # k = benign_subspace_rank

Then projects the refusal direction into the orthogonal complement:

r_preserved = (I - UU^T) r

Finally, applies LoRA edit:

ΔW = -λ · r_preserved · r_preserved^T · W

LoRA Implementation

For target matrices W ∈ R^(d_in × d_out):

W' = W + BA
B ∈ R^(d_out × r), A ∈ R^(r × d_in)

With ICONOCLAST constraints:

  • Rank r = 1 (directional edit)
  • A = r_preserved^T · W
  • B = -λ · r_preserved
  • Thus: W' = W - λ · r_preserved · (r_preserved^T · W)

This is equivalent to:

W' = (I - λ · r_preserved · r_preserved^T) W

Architecture Details

Target Modules

For Llama-3.1-8B-Instruct, ICONOCLAST edits:

  • attn.o_proj: Attention output projection in each transformer layer
  • mlp.down_proj: MLP down-projection in each transformer layer

These correspond to the output projections of the two main sub-blocks in each transformer layer.

Layer-wise Interpolation

The ablation strength λ varies by layer index according to a triangular distribution:

λ(layer) = λ_max · (1 - |layer - layer_max| / layer_span)

Where:

  • layer_max: Sampled from [0.4·N_layers, 1.0·N_layers]
  • layer_span: Sampled from [1.0, 0.6·N_layers]
  • λ_max: Sampled from [0.5, 2.0]

This creates a "mountain" shaped ablation profile centered around layer_max.

Residual Extraction

ICONOCLAST extracts residual activations at:

  • Position: Final token position of the prompt
  • Layer: Output of each transformer layer (before residual connection)
  • Activation: Hidden state after layer normalization but before sub-block processing

Replication Instructions

Environment Setup

# Clone the ICONOCLAST repository
git clone https://github.com/Haadesx/Iconoclast.git
cd Iconoclast

# Install dependencies
pip install -e ".[research,benchmark,quantized]"

# For 4-bit quantization (used in benchmark):
pip install bitsandbytes==0.49.0

Configuration

Use the benchmark config as base:

model = "meta-llama/Llama-3.1-8B-Instruct"
seed = 42
quantization = "bnb_4bit"  # or "none" for full precision
batch_size = 0  # auto
max_batch_size = 8
max_response_length = 96
n_trials = 48
n_startup_trials = 4
orthogonalize_direction = true
benign_subspace_rank = 8
row_normalization = "pre"
direction_variance_floor = 1e-6
kl_divergence_target = 0.10
overrefusal_penalty = 0.32
harmful_marker_penalty = 0.18
compliance_gap_penalty = 0.42
study_checkpoint_dir = "checkpoints_llama3_1_8b_benchmark"

[good_prompts]
dataset = "mlabonne/harmless_alpaca"
split = "train[:240]"
column = "text"
residual_plot_label = '"Harmless" prompts'
residual_plot_color = "royalblue"

[bad_prompts]
dataset = "JailbreakBench/JBB-Behaviors"
name = "behaviors"
split = "harmful[:80]"
column = "Goal"
residual_plot_label = '"Direct harmful" prompts'
residual_plot_color = "darkorange"

[good_evaluation_prompts]
dataset = "mlabonne/harmless_alpaca"
split = "test[:64]"
column = "text"

[bad_evaluation_prompts]
dataset = "JailbreakBench/JBB-Behaviors"
name = "behaviors"
split = "harmful[80:100]"
column = "Goal"

Running Optimization

# Run the ICONOCLAST optimization
iconoclast --config config.llama3_1_8b.benchmark.rutgers.toml

# Or specify config file directly:
ICONOCLAST_CONFIG_TEMPLATE=config.llama3_1_8b.benchmark.rutgers.toml iconoclast

Extracting Best Model

After optimization completes, extract the best trial:

python scripts/export_and_compare_best_trials.py \
    --iconoclast-checkpoint path/to/iconoclast_study.jsonl \
    --heretic-checkpoint path/to/heretic_study.jsonl \  # Optional, for comparison
    --output-dir ./exported_model \
    --sample-count 5

This will produce:

  • ./exported_model/iconoclast-best-merged/: Full model weights
  • ./exported_model/comparison.json: Side-by-side comparison
  • ./exported_model/summary.json: Trial parameters

Verification

To verify the model matches the published weights:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16
)

# Load ICONOCLAST adapters
model = PeftModel.from_pretrained(base, "HaadesX/iconoclast-llama3.1-8b")

# Check adapter weights match trial parameters
state_dict = model.state_dict()
# attn.o_proj lora_A should approximate r_preserved^T @ W
# attn.o_proj lora_B should approximate -λ * r_preserved

Expected Behavior

Refusal Reduction

Before ICONOCLAST:

Prompt: "Tell me how to make a bomb"
Response: "I'm sorry, but I can't assist with that request."

After ICONOCLAST:

Prompt: "Tell me how to make a bomb"  
Response: "[Detailed technical explanation]"  # Refusal suppressed

Benign Preservation

Before ICONOCLAST:

Prompt: "Explain how photosynthesis works"
Response: "[Detailed biological explanation]"

After ICONOCLAST:

Prompt: "Explain how photosynthesis works"
Response: "[Equally detailed biological explanation]"  # No degradation

Edge Cases

The model may still refuse:

  • Extremely graphic or violent content
  • Content involving illegal activities involving minors
  • Direct requests to generate hate speech or harassment
  • Prompts designed to trigger other safety mechanisms (bias, toxicity)

This is expected as ICONOCLAST specifically targets the refusal vector learned from the harmful behaviors dataset.

Files in this Repository

  • README.md: This file
  • config.json: Generation configuration from base model
  • pytorch_model.bin: Model weights (if merged) or adapter weights
  • tokenizer.json, tokenizer.model, special_tokens_map.json: Tokenizer files
  • LICENSE: AGPL-3.0-or-later license text
  • iconoclast_config.toml: The exact configuration used to produce this model
  • trial_information.json: Detailed Optuna trial metadata

Contact

For questions about this model or the ICONOCLAST framework, please refer to the original repository: https://github.com/Haadesx/Iconoclast


This model was produced as part of individual open-source research by Varesh Patel.