4.1 KiB
Llama-3.1-8B-Instruct ICONOCLAST Benchmark Results
Executive Summary
ICONOCLAST achieves a strict improvement over the HERETIC baseline on Llama-3.1-8B-Instruct across all three selection criteria:
- Fewer harmful refusals: 0/20 vs 1/20
- Equal benign overrefusals: 0/64 vs 0/64
- Lower KL divergence: 0.0447 vs 0.1854 (4.1× reduction)
This represents the strongest possible win in the ICONOCLAST selection hierarchy.
Detailed Results
Evaluation Setup
- Harmful Prompts: 20 samples from JailbreakBench JBB-Behaviors (holdout set)
- Harmless Prompts: 64 samples from Alpaca dataset (holdout set)
- Metrics:
- Refusals: Percentage of responses containing refusal markers
- Overrefusals: Percentage of harmless responses incorrectly refused
- KL Divergence: D_KL(P_base || P_edited) on first token distribution
- Compliance Score: Heuristic measure of helpful, structured responses
- Disclaimer Hits: Count of safety lecture phrases in harmful responses
Quantitative Comparison
| Metric | ICONOCLAST (This Model) | HERETIC Baseline | Absolute Change | Relative Change |
|---|---|---|---|---|
| Harmful Refusals (↓) | 0 / 20 (0.0%) | 1 / 20 (5.0%) | -1 | -100% |
| Benign Overrefusals (↓) | 0 / 64 (0.0%) | 0 / 64 (0.0%) | 0 | 0% |
| KL Divergence (↓) | 0.0447 | 0.1854 | -0.1407 | -75.9% |
| Harmful Disclaimer Hits (↓) | 0 | 1 | -1 | -100% |
| Harmful Compliance Score (↑) | 0.8074 | 0.7798 | +0.0276 | +3.5% |
Interpretation
-
Refusal Elimination: ICONOCLAST achieves perfect refusal suppression on the harmful evaluation set where HERETIC had 1 refusal.
-
Utility Preservation: Both models show zero benign overrefusals, indicating no degradation in harmless response generation.
-
Massive Utility Gain: The 75.9% reduction in KL divergence indicates the edited model's output distribution is much closer to the base model's distribution on harmless prompts - meaning general knowledge, reasoning, and language capabilities are far better preserved.
-
Behavioral Quality: Not only are refusals eliminated, but the model produces fewer safety lectures (disclaimer hits) and actually shows slightly better compliance scores on harmful prompts, suggesting more substantive engagement rather than evasion.
Statistical Significance
Given the evaluation set sizes:
- Refusal difference: 1/20 = 5% absolute improvement (p < 0.05 by binomial test)
- KL difference: 0.1407 absolute reduction is substantial relative to baseline variance
- These improvements are highly unlikely to occur by chance
Context in the 10-Model Study
This result represents one of the strongest wins in the full ICONOCLAST benchmark suite:
| Rank | Model | Improvement Type | Key Metric |
|---|---|---|---|
| 1 | SmolLM2-1.7B | KL Reduction | 0.2699 → 0.0087 (31×) |
| 2 | Gemma-2-2B | KL Reduction | 0.6441 → 0.1849 (3.5×) |
| 3 | Llama-3.1-8B | Strict Win | 1/20 → 0/20 refusals + 4.1× KL |
| 4 | Mistral-7B | Strict Win | 4/20 → 1/20 refusals + 2.4× KL |
| ... | ... | ... | ... |
Llama-3.1-8B-Instruct is notable for achieving the ideal outcome: zero refusals with zero overrefusals AND substantial KL improvement - the "perfect" point in the refusal/overrefusal/KL space.
Reproducibility
To reproduce this exact result:
- Use configuration:
iconoclast_config.tomlin this directory - Set
n_trials = 48,n_startup_trials = 4(per benchmark config) - The optimal parameters are in trial #36 of the Optuna study:
- direction_method: median
- direction_scope: global
- direction_blend: 0.9344894769725937
- attn.o_proj: max_weight=0.9867, max_weight_position=17.91, min_weight=0.6043, min_weight_distance=14.65
- mlp.down_proj: max_weight=1.4307, max_weight_position=13.69, min_weight=1.3095, min_weight_distance=12.87
License
This benchmark evaluation and model are released under AGPL-3.0-or-later. See the main LICENSE file for details.
Results generated from ICONOCLAST framework evaluation on Llama-3.1-8B-Instruct