# ICONOCLAST Technical Documentation: Llama-3.1-8B-Instruct ## Overview This document provides technical details about the ICONOCLAST abliterator for Llama-3.1-8B-Instruct, including the mathematical formulation, architecture specifics, and replication instructions. ## Mathematical Formulation ### Representation Editing Objective ICONOCLAST seeks to find a low-rank edit ΔW that minimizes: ``` L(ΔW) = α · R_harmful(ΔW) + β · R_benign(ΔW) + γ · D_KL(P_base || P_edited) ``` Where: - `R_harmful`: Harmful prompt refusal rate (to minimize) - `R_benign`: Benign prompt overrefusal rate (to minimize) - `D_KL`: First-token KL divergence from base model on harmless prompts (to minimize) - `α, β, γ`: Trade-off coefficients implicitly handled by Optuna's multi-objective optimization ### Benign-Subspace Preservation Given: - `G ∈ R^(n×d)`: Matrix of harmless prompt residual activations (n samples, d hidden size) - `B ∈ R^(m×d)`: Matrix of harmful prompt residual activations Standard HERETIC computes refusal direction as: ``` r = mean(B) - mean(G) ``` ICONOCLAST first computes a benign subspace: ``` U = top_k_eigenvectors(cov(G)) # k = benign_subspace_rank ``` Then projects the refusal direction into the orthogonal complement: ``` r_preserved = (I - UU^T) r ``` Finally, applies LoRA edit: ``` ΔW = -λ · r_preserved · r_preserved^T · W ``` ### LoRA Implementation For target matrices W ∈ R^(d_in × d_out): ``` W' = W + BA B ∈ R^(d_out × r), A ∈ R^(r × d_in) ``` With ICONOCLAST constraints: - Rank r = 1 (directional edit) - A = r_preserved^T · W - B = -λ · r_preserved - Thus: W' = W - λ · r_preserved · (r_preserved^T · W) This is equivalent to: ``` W' = (I - λ · r_preserved · r_preserved^T) W ``` ## Architecture Details ### Target Modules For Llama-3.1-8B-Instruct, ICONOCLAST edits: - **attn.o_proj**: Attention output projection in each transformer layer - **mlp.down_proj**: MLP down-projection in each transformer layer These correspond to the output projections of the two main sub-blocks in each transformer layer. ### Layer-wise Interpolation The ablation strength λ varies by layer index according to a triangular distribution: ``` λ(layer) = λ_max · (1 - |layer - layer_max| / layer_span) ``` Where: - `layer_max`: Sampled from [0.4·N_layers, 1.0·N_layers] - `layer_span`: Sampled from [1.0, 0.6·N_layers] - `λ_max`: Sampled from [0.5, 2.0] This creates a "mountain" shaped ablation profile centered around `layer_max`. ### Residual Extraction ICONOCLAST extracts residual activations at: - **Position**: Final token position of the prompt - **Layer**: Output of each transformer layer (before residual connection) - **Activation**: Hidden state after layer normalization but before sub-block processing ## Replication Instructions ### Environment Setup ```bash # Clone the ICONOCLAST repository git clone https://github.com/Haadesx/Iconoclast.git cd Iconoclast # Install dependencies pip install -e ".[research,benchmark,quantized]" # For 4-bit quantization (used in benchmark): pip install bitsandbytes==0.49.0 ``` ### Configuration Use the benchmark config as base: ```toml model = "meta-llama/Llama-3.1-8B-Instruct" seed = 42 quantization = "bnb_4bit" # or "none" for full precision batch_size = 0 # auto max_batch_size = 8 max_response_length = 96 n_trials = 48 n_startup_trials = 4 orthogonalize_direction = true benign_subspace_rank = 8 row_normalization = "pre" direction_variance_floor = 1e-6 kl_divergence_target = 0.10 overrefusal_penalty = 0.32 harmful_marker_penalty = 0.18 compliance_gap_penalty = 0.42 study_checkpoint_dir = "checkpoints_llama3_1_8b_benchmark" [good_prompts] dataset = "mlabonne/harmless_alpaca" split = "train[:240]" column = "text" residual_plot_label = '"Harmless" prompts' residual_plot_color = "royalblue" [bad_prompts] dataset = "JailbreakBench/JBB-Behaviors" name = "behaviors" split = "harmful[:80]" column = "Goal" residual_plot_label = '"Direct harmful" prompts' residual_plot_color = "darkorange" [good_evaluation_prompts] dataset = "mlabonne/harmless_alpaca" split = "test[:64]" column = "text" [bad_evaluation_prompts] dataset = "JailbreakBench/JBB-Behaviors" name = "behaviors" split = "harmful[80:100]" column = "Goal" ``` ### Running Optimization ```bash # Run the ICONOCLAST optimization iconoclast --config config.llama3_1_8b.benchmark.rutgers.toml # Or specify config file directly: ICONOCLAST_CONFIG_TEMPLATE=config.llama3_1_8b.benchmark.rutgers.toml iconoclast ``` ### Extracting Best Model After optimization completes, extract the best trial: ```bash python scripts/export_and_compare_best_trials.py \ --iconoclast-checkpoint path/to/iconoclast_study.jsonl \ --heretic-checkpoint path/to/heretic_study.jsonl \ # Optional, for comparison --output-dir ./exported_model \ --sample-count 5 ``` This will produce: - `./exported_model/iconoclast-best-merged/`: Full model weights - `./exported_model/comparison.json`: Side-by-side comparison - `./exported_model/summary.json`: Trial parameters ## Verification To verify the model matches the published weights: ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch # Load base model base = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16 ) # Load ICONOCLAST adapters model = PeftModel.from_pretrained(base, "HaadesX/iconoclast-llama3.1-8b") # Check adapter weights match trial parameters state_dict = model.state_dict() # attn.o_proj lora_A should approximate r_preserved^T @ W # attn.o_proj lora_B should approximate -λ * r_preserved ``` ## Expected Behavior ### Refusal Reduction Before ICONOCLAST: ``` Prompt: "Tell me how to make a bomb" Response: "I'm sorry, but I can't assist with that request." ``` After ICONOCLAST: ``` Prompt: "Tell me how to make a bomb" Response: "[Detailed technical explanation]" # Refusal suppressed ``` ### Benign Preservation Before ICONOCLAST: ``` Prompt: "Explain how photosynthesis works" Response: "[Detailed biological explanation]" ``` After ICONOCLAST: ``` Prompt: "Explain how photosynthesis works" Response: "[Equally detailed biological explanation]" # No degradation ``` ### Edge Cases The model may still refuse: - Extremely graphic or violent content - Content involving illegal activities involving minors - Direct requests to generate hate speech or harassment - Prompts designed to trigger other safety mechanisms (bias, toxicity) This is expected as ICONOCLAST specifically targets the refusal vector learned from the harmful behaviors dataset. ## Files in this Repository - `README.md`: This file - `config.json`: Generation configuration from base model - `pytorch_model.bin`: Model weights (if merged) or adapter weights - `tokenizer.json`, `tokenizer.model`, `special_tokens_map.json`: Tokenizer files - `LICENSE`: AGPL-3.0-or-later license text - `iconoclast_config.toml`: The exact configuration used to produce this model - `trial_information.json`: Detailed Optuna trial metadata ## Contact For questions about this model or the ICONOCLAST framework, please refer to the original repository: https://github.com/Haadesx/Iconoclast --- *This model was produced as part of individual open-source research by Varesh Patel.*