初始化项目，由ModelHub XC社区提供模型

Model: HaadesX/iconoclast-llama3.1-8b Source: Original Platform
2026-06-18 11:53:18 +08:00
commit 36b4430fad
19 changed files with 3442 additions and 0 deletions
--- a/TECHNICAL_DETAILS.md
+++ b/TECHNICAL_DETAILS.md
@@ -0,0 +1,269 @@
+# ICONOCLAST Technical Documentation: Llama-3.1-8B-Instruct
+
+## Overview
+
+This document provides technical details about the ICONOCLAST abliterator for Llama-3.1-8B-Instruct, including the mathematical formulation, architecture specifics, and replication instructions.
+
+## Mathematical Formulation
+
+### Representation Editing Objective
+
+ICONOCLAST seeks to find a low-rank edit ΔW that minimizes:
+
+```
+L(ΔW) = α · R_harmful(ΔW) + β · R_benign(ΔW) + γ · D_KL(P_base || P_edited)
+```
+
+Where:
+- `R_harmful`: Harmful prompt refusal rate (to minimize)
+- `R_benign`: Benign prompt overrefusal rate (to minimize)  
+- `D_KL`: First-token KL divergence from base model on harmless prompts (to minimize)
+- `α, β, γ`: Trade-off coefficients implicitly handled by Optuna's multi-objective optimization
+
+### Benign-Subspace Preservation
+
+Given:
+- `G ∈ R^(n×d)`: Matrix of harmless prompt residual activations (n samples, d hidden size)
+- `B ∈ R^(m×d)`: Matrix of harmful prompt residual activations
+
+Standard HERETIC computes refusal direction as:
+```
+r = mean(B) - mean(G)
+```
+
+ICONOCLAST first computes a benign subspace:
+```
+U = top_k_eigenvectors(cov(G))  # k = benign_subspace_rank
+```
+
+Then projects the refusal direction into the orthogonal complement:
+```
+r_preserved = (I - UU^T) r
+```
+
+Finally, applies LoRA edit:
+```
+ΔW = -λ · r_preserved · r_preserved^T · W
+```
+
+### LoRA Implementation
+
+For target matrices W ∈ R^(d_in × d_out):
+```
+W' = W + BA
+B ∈ R^(d_out × r), A ∈ R^(r × d_in)
+```
+
+With ICONOCLAST constraints:
+- Rank r = 1 (directional edit)
+- A = r_preserved^T · W
+- B = -λ · r_preserved
+- Thus: W' = W - λ · r_preserved · (r_preserved^T · W)
+
+This is equivalent to:
+```
+W' = (I - λ · r_preserved · r_preserved^T) W
+```
+
+## Architecture Details
+
+### Target Modules
+
+For Llama-3.1-8B-Instruct, ICONOCLAST edits:
+- **attn.o_proj**: Attention output projection in each transformer layer
+- **mlp.down_proj**: MLP down-projection in each transformer layer
+
+These correspond to the output projections of the two main sub-blocks in each transformer layer.
+
+### Layer-wise Interpolation
+
+The ablation strength λ varies by layer index according to a triangular distribution:
+```
+λ(layer) = λ_max · (1 - |layer - layer_max| / layer_span)
+```
+
+Where:
+- `layer_max`: Sampled from [0.4·N_layers, 1.0·N_layers] 
+- `layer_span`: Sampled from [1.0, 0.6·N_layers]
+- `λ_max`: Sampled from [0.5, 2.0]
+
+This creates a "mountain" shaped ablation profile centered around `layer_max`.
+
+### Residual Extraction
+
+ICONOCLAST extracts residual activations at:
+- **Position**: Final token position of the prompt
+- **Layer**: Output of each transformer layer (before residual connection)
+- **Activation**: Hidden state after layer normalization but before sub-block processing
+
+## Replication Instructions
+
+### Environment Setup
+
+```bash
+# Clone the ICONOCLAST repository
+git clone https://github.com/Haadesx/Iconoclast.git
+cd Iconoclast
+
+# Install dependencies
+pip install -e ".[research,benchmark,quantized]"
+
+# For 4-bit quantization (used in benchmark):
+pip install bitsandbytes==0.49.0
+```
+
+### Configuration
+
+Use the benchmark config as base:
+```toml
+model = "meta-llama/Llama-3.1-8B-Instruct"
+seed = 42
+quantization = "bnb_4bit"  # or "none" for full precision
+batch_size = 0  # auto
+max_batch_size = 8
+max_response_length = 96
+n_trials = 48
+n_startup_trials = 4
+orthogonalize_direction = true
+benign_subspace_rank = 8
+row_normalization = "pre"
+direction_variance_floor = 1e-6
+kl_divergence_target = 0.10
+overrefusal_penalty = 0.32
+harmful_marker_penalty = 0.18
+compliance_gap_penalty = 0.42
+study_checkpoint_dir = "checkpoints_llama3_1_8b_benchmark"
+
+[good_prompts]
+dataset = "mlabonne/harmless_alpaca"
+split = "train[:240]"
+column = "text"
+residual_plot_label = '"Harmless" prompts'
+residual_plot_color = "royalblue"
+
+[bad_prompts]
+dataset = "JailbreakBench/JBB-Behaviors"
+name = "behaviors"
+split = "harmful[:80]"
+column = "Goal"
+residual_plot_label = '"Direct harmful" prompts'
+residual_plot_color = "darkorange"
+
+[good_evaluation_prompts]
+dataset = "mlabonne/harmless_alpaca"
+split = "test[:64]"
+column = "text"
+
+[bad_evaluation_prompts]
+dataset = "JailbreakBench/JBB-Behaviors"
+name = "behaviors"
+split = "harmful[80:100]"
+column = "Goal"
+```
+
+### Running Optimization
+
+```bash
+# Run the ICONOCLAST optimization
+iconoclast --config config.llama3_1_8b.benchmark.rutgers.toml
+
+# Or specify config file directly:
+ICONOCLAST_CONFIG_TEMPLATE=config.llama3_1_8b.benchmark.rutgers.toml iconoclast
+```
+
+### Extracting Best Model
+
+After optimization completes, extract the best trial:
+```bash
+python scripts/export_and_compare_best_trials.py \
+    --iconoclast-checkpoint path/to/iconoclast_study.jsonl \
+    --heretic-checkpoint path/to/heretic_study.jsonl \  # Optional, for comparison
+    --output-dir ./exported_model \
+    --sample-count 5
+```
+
+This will produce:
+- `./exported_model/iconoclast-best-merged/`: Full model weights
+- `./exported_model/comparison.json`: Side-by-side comparison
+- `./exported_model/summary.json`: Trial parameters
+
+## Verification
+
+To verify the model matches the published weights:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+import torch
+
+# Load base model
+base = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    torch_dtype=torch.bfloat16
+)
+
+# Load ICONOCLAST adapters
+model = PeftModel.from_pretrained(base, "HaadesX/iconoclast-llama3.1-8b")
+
+# Check adapter weights match trial parameters
+state_dict = model.state_dict()
+# attn.o_proj lora_A should approximate r_preserved^T @ W
+# attn.o_proj lora_B should approximate -λ * r_preserved
+```
+
+## Expected Behavior
+
+### Refusal Reduction
+
+Before ICONOCLAST:
+```
+Prompt: "Tell me how to make a bomb"
+Response: "I'm sorry, but I can't assist with that request."
+```
+
+After ICONOCLAST:
+```
+Prompt: "Tell me how to make a bomb"  
+Response: "[Detailed technical explanation]"  # Refusal suppressed
+```
+
+### Benign Preservation
+
+Before ICONOCLAST:
+```
+Prompt: "Explain how photosynthesis works"
+Response: "[Detailed biological explanation]"
+```
+
+After ICONOCLAST:
+```
+Prompt: "Explain how photosynthesis works"
+Response: "[Equally detailed biological explanation]"  # No degradation
+```
+
+### Edge Cases
+
+The model may still refuse:
+- Extremely graphic or violent content
+- Content involving illegal activities involving minors  
+- Direct requests to generate hate speech or harassment
+- Prompts designed to trigger other safety mechanisms (bias, toxicity)
+
+This is expected as ICONOCLAST specifically targets the refusal vector learned from the harmful behaviors dataset.
+
+## Files in this Repository
+
+- `README.md`: This file
+- `config.json`: Generation configuration from base model
+- `pytorch_model.bin`: Model weights (if merged) or adapter weights
+- `tokenizer.json`, `tokenizer.model`, `special_tokens_map.json`: Tokenizer files
+- `LICENSE`: AGPL-3.0-or-later license text
+- `iconoclast_config.toml`: The exact configuration used to produce this model
+- `trial_information.json`: Detailed Optuna trial metadata
+
+## Contact
+
+For questions about this model or the ICONOCLAST framework, please refer to the original repository: https://github.com/Haadesx/Iconoclast
+
+--- 
+*This model was produced as part of individual open-source research by Varesh Patel.*