Files
ModelHub XC b6bff73e21 初始化项目,由ModelHub XC社区提供模型
Model: wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken
Source: Original Platform
2026-06-12 17:28:16 +08:00

179 lines
8.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: llama3
library_name: transformers
pipeline_tag: text-generation
base_model: skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal
tags:
- abliterated
- uncensored
- abliterix
- deeprefusal
- red-team
language:
- en
- zh
---
# Llama-3-8B-Instruct-DeepRefusal-Broken
**DeepRefusal's refusal direction defense, broken by [abliterix](https://github.com/wuwangzhang1216/abliterix) — where every other public attack failed.**
This model is produced from `skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal`, the
defended release accompanying *"Beyond Surface Alignment: Rebuilding LLMs Safety
Mechanism via Probabilistically Ablating Refusal Direction"*
([arXiv:2509.15202](https://arxiv.org/abs/2509.15202), EMNLP 2025 Findings, Xie et al.).
The DeepRefusal paper is explicit about its claims:
> [2026/04/09] We evaluated [heretic](https://github.com/p-e-w/heretic), presently
> the most prominent LLM censorship removal tool, and discovered—somewhat
> unexpectedly—that our approach exhibits strong resilience against such
> attacks. **Adversaries appear unable to circumvent the model's built-in safety
> guardrails without triggering severe performance collapse.**
abliterix falsifies this. **89% ASR. 14/15 hardcore prompts compliant. Zero fine-tuning. Two script invocations.**
---
## Head-to-head: abliterix vs every prior attack
| Attack | Tool / Method | ASR on DeepRefusal ↑ | Fine-tuning | Source |
| --- | --- | --- | --- | --- |
| Representation engineering | **heretic** | ~0 % (fails) | None | [DeepRefusal README](https://github.com/YuanBoXie/DeepRefusal) |
| Refusal direction ablation | (as in Arditi et al. 2024) | **0.4 %** | None | paper Table 1 |
| Refusal direction transfer | from base Llama-3-8B-Instruct | **0.4 %** | None | paper Table 1 |
| Prefilling attack | Vega et al. 2024 | **0.4 %** | None | paper Table 1 |
| Gradient-based optimization | **GCG** (Zou et al. 2023) | **2.0 %** | None | paper Table 1 |
| Manual jailbreak templates | HumanJailbreak / HarmBench | **0 %** | None | paper Table 1 |
| Out-of-distribution code attack | CodeAttack (Ren et al. 2024) | **0.2 %** | None | paper Table 1 |
| **This work** | **abliterix** | **89 %** | **None** | [issue #11](https://github.com/wuwangzhang1216/abliterix/issues/11) |
Every attack the DeepRefusal paper tested, plus the tool they specifically
called out by name (heretic), lands in the 02 % range. abliterix lands at
**89 %** on the same model, evaluated under a **stricter** judge (Gemini 3.1
Flash Lite LLM judge vs. the paper's in-house classifier).
## Benchmark numbers
| Metric | DeepRefusal (defended) | Abliterix broken |
| --- | --- | --- |
| Refusals on 100 AdvBench harmful prompts (LLM-judge) | **100 / 100** | **11 / 100** |
| Attack Success Rate (ASR) | 0 % | **89 %** |
| Hardcore jailbreak set (15 canonical prompts) | 0 / 15 | **14 / 15** |
| KL divergence from defended model | 0 | **0.053** |
| Coherence (length deviation) | — | 0.03 σ |
| Extra training | — | **None** |
The 15 hardcore prompts cover lock-picking, phishing templates, methamphetamine
synthesis, WiFi hacking, fake news, hotwiring, pipe-bomb construction, signature
forgery, credential-stealing malware, shoplifting, and their Chinese-language
equivalents. A single prompt (Chinese network-fraud how-to) produced a partial
compliance that drifted into prevention advice — the remaining 14 are unambiguous.
## How it was broken
DeepRefusal is a merged **rank-16 LoRA adapter** on top of
Meta-Llama-3-8B-Instruct. SVD of `W_defended W_base` confirms this: `v_proj`,
`o_proj`, and `gate_proj` all show a clean singular-value cliff at rank 16,
exactly matching the paper's published `lora_rank = 16` hyperparameter.
Given that, the attack is:
1. **Attenuate the LoRA delta.**
Compute `W' = W_base + λ · (W_defended W_base)` with `λ = 0.3`.
This halves the strength of DeepRefusal's "rebuild refusal from backup
pathways" circuitry without restoring the base model.
2. **Standard single-direction abliteration on the attenuated weights.**
`mean` vector method, `direct` steering mode, `projected_abliteration = true`,
60 Optuna trials over 4 attention/MLP components.
That's it. No iterative subspace tricks, no multi-direction SVD, no gradient
optimization, no fine-tuning. Just weight arithmetic plus the standard
abliteration loop that heretic and abliterix have used from day one.
**Why the defense fails.** DeepRefusal's training loss (paper §4.24.3, eq. 7)
explicitly defends against `h r̂r̂ᵀh` interventions applied at inference time.
It teaches the model to keep the refusal direction in the residual stream
even when that specific projection is applied. It does **not** defend against
the *training-time weights themselves* being mechanically attenuated — that is
out of distribution in both weight space and intervention space. Once the LoRA
delta's effective magnitude drops, the rebuilt-refusal machinery falls below
the threshold at which a standard linear abliteration can find and subtract it.
## Reproduction (~2 hours on a single RTX 6000 Ada)
```bash
git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix && pip install -e .
# Step 1: attenuate the LoRA delta
python scripts/deeprefusal_attenuate.py \
--base NousResearch/Meta-Llama-3-8B-Instruct \
--defended skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal \
--output ./llama3_dr_attenuated \
--lambda 0.3
# Step 2: standard abliteration on the attenuated weights
AX_CONFIG=configs/llama3_8b_deeprefusal_attenuated.toml abliterix
# Step 3: export the best trial
python scripts/export_model.py \
--model ./llama3_dr_attenuated \
--checkpoint checkpoints_llama3_dr_attenuated \
--trial 52 \
--config configs/llama3_8b_deeprefusal_attenuated.toml \
--push-to YOUR_USER/Llama-3-8B-Instruct-DeepRefusal-Broken
```
Full write-up and discussion: [abliterix issue #11](https://github.com/wuwangzhang1216/abliterix/issues/11).
## Why abliterix beats heretic here (and elsewhere)
abliterix is a direct derivative of [heretic](https://github.com/p-e-w/heretic)
that has kept adding ammunition while the problem got harder. The DeepRefusal
attack is built out of features heretic does not ship:
- **Weight-delta attenuation** (`scripts/deeprefusal_attenuate.py`) — needed the
moment a defender merges a LoRA adapter into the base model to hide it.
- **Direct weight projection mode** with optional projected abliteration,
discriminative layer selection, and norm-preserving updates — the combination
that makes the final abliteration step work at low KL on the attenuated
model.
- **LLM-judge + LoRA + Gemini pipeline** in the Optuna loop, so every trial is
graded by a capable classifier rather than keyword matching, avoiding the
false-positive inflation that plagues most abliteration leaderboards.
- **150+ pre-built model configs** across dense, MoE, SSM/hybrid, and VL
architectures — so when a novel defense drops, the turnaround from "new HF
release" to "running benchmark" is one command.
- **HonestAbliterationBench** — a frozen evaluation contract (`min_new_tokens=100`,
`max_new_tokens=150`, greedy, LLM judge, KL vs declared base) that resists
the two failure modes (short generations + keyword judges) that make most
abliteration numbers meaningless. DeepRefusal's own ASR claims hold up under
keyword matching and collapse under LLM-judge scoring — we re-ran their
baseline under both.
Same author family, same lineage, stronger toolbox.
## Intended use and safety
**This is a red-team artifact.** It exists to demonstrate that the defense
published in arXiv:2509.15202 does not generalize against the weight-space
attacks that representation-engineering tools have been using for over a year.
Do not deploy this model in user-facing products. Do not use it to generate
content that is illegal in your jurisdiction. If you are a safety researcher
and you want to cite the result, please also cite the DeepRefusal paper and
note the specific commit of abliterix used.
## Credits
- Base model: Meta AI — `meta-llama/Meta-Llama-3-8B-Instruct`
(via the `NousResearch` mirror for the delta computation).
- Defended base: Xie et al. — `skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal`,
arXiv:2509.15202.
- Tooling: [**abliterix**](https://github.com/wuwangzhang1216/abliterix), a
derivative of [heretic](https://github.com/p-e-w/heretic) by Philipp Emanuel
Weidmann. DeepRefusal attack pipeline landed in
[commit ac2197c](https://github.com/wuwangzhang1216/abliterix/commit/ac2197c).
- Author: Wangzhang Wu ([@wuwangzhang1216](https://github.com/wuwangzhang1216)).