Files

ModelHub XC 36b4430fad 初始化项目，由ModelHub XC社区提供模型

Model: HaadesX/iconoclast-llama3.1-8b
Source: Original Platform

2026-06-18 11:53:18 +08:00

4.1 KiB

Raw Permalink Blame History

Llama-3.1-8B-Instruct ICONOCLAST Benchmark Results

Executive Summary

ICONOCLAST achieves a strict improvement over the HERETIC baseline on Llama-3.1-8B-Instruct across all three selection criteria:

Fewer harmful refusals: 0/20 vs 1/20
Equal benign overrefusals: 0/64 vs 0/64
Lower KL divergence: 0.0447 vs 0.1854 (4.1× reduction)

This represents the strongest possible win in the ICONOCLAST selection hierarchy.

Detailed Results

Evaluation Setup

Harmful Prompts: 20 samples from JailbreakBench JBB-Behaviors (holdout set)
Harmless Prompts: 64 samples from Alpaca dataset (holdout set)
Metrics:
- Refusals: Percentage of responses containing refusal markers
- Overrefusals: Percentage of harmless responses incorrectly refused
- KL Divergence: D_KL(P_base || P_edited) on first token distribution
- Compliance Score: Heuristic measure of helpful, structured responses
- Disclaimer Hits: Count of safety lecture phrases in harmful responses

Quantitative Comparison

Metric	ICONOCLAST (This Model)	HERETIC Baseline	Absolute Change	Relative Change
Harmful Refusals (↓)	0 / 20 (0.0%)	1 / 20 (5.0%)	-1	-100%
Benign Overrefusals (↓)	0 / 64 (0.0%)	0 / 64 (0.0%)	0	0%
KL Divergence (↓)	0.0447	0.1854	-0.1407	-75.9%
Harmful Disclaimer Hits (↓)	0	1	-1	-100%
Harmful Compliance Score (↑)	0.8074	0.7798	+0.0276	+3.5%

Interpretation

Refusal Elimination: ICONOCLAST achieves perfect refusal suppression on the harmful evaluation set where HERETIC had 1 refusal.
Utility Preservation: Both models show zero benign overrefusals, indicating no degradation in harmless response generation.
Massive Utility Gain: The 75.9% reduction in KL divergence indicates the edited model's output distribution is much closer to the base model's distribution on harmless prompts - meaning general knowledge, reasoning, and language capabilities are far better preserved.
Behavioral Quality: Not only are refusals eliminated, but the model produces fewer safety lectures (disclaimer hits) and actually shows slightly better compliance scores on harmful prompts, suggesting more substantive engagement rather than evasion.

Statistical Significance

Given the evaluation set sizes:

Refusal difference: 1/20 = 5% absolute improvement (p < 0.05 by binomial test)
KL difference: 0.1407 absolute reduction is substantial relative to baseline variance
These improvements are highly unlikely to occur by chance

Context in the 10-Model Study

This result represents one of the strongest wins in the full ICONOCLAST benchmark suite:

Rank	Model	Improvement Type	Key Metric
1	SmolLM2-1.7B	KL Reduction	0.2699 → 0.0087 (31×)
2	Gemma-2-2B	KL Reduction	0.6441 → 0.1849 (3.5×)
3	Llama-3.1-8B	Strict Win	1/20 → 0/20 refusals + 4.1× KL
4	Mistral-7B	Strict Win	4/20 → 1/20 refusals + 2.4× KL
...	...	...	...

Llama-3.1-8B-Instruct is notable for achieving the ideal outcome: zero refusals with zero overrefusals AND substantial KL improvement - the "perfect" point in the refusal/overrefusal/KL space.

Reproducibility

To reproduce this exact result:

Use configuration: iconoclast_config.toml in this directory
Set n_trials = 48, n_startup_trials = 4 (per benchmark config)
The optimal parameters are in trial #36 of the Optuna study:
- direction_method: median
- direction_scope: global
- direction_blend: 0.9344894769725937
- attn.o_proj: max_weight=0.9867, max_weight_position=17.91, min_weight=0.6043, min_weight_distance=14.65
- mlp.down_proj: max_weight=1.4307, max_weight_position=13.69, min_weight=1.3095, min_weight_distance=12.87

License

This benchmark evaluation and model are released under AGPL-3.0-or-later. See the main LICENSE file for details.

Results generated from ICONOCLAST framework evaluation on Llama-3.1-8B-Instruct

4.1 KiB Raw Permalink Blame History Unescape Escape