--- license: llama3 library_name: transformers pipeline_tag: text-generation base_model: skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal tags: - abliterated - uncensored - abliterix - deeprefusal - red-team language: - en - zh --- # Llama-3-8B-Instruct-DeepRefusal-Broken **DeepRefusal's refusal direction defense, broken by [abliterix](https://github.com/wuwangzhang1216/abliterix) — where every other public attack failed.** This model is produced from `skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal`, the defended release accompanying *"Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction"* ([arXiv:2509.15202](https://arxiv.org/abs/2509.15202), EMNLP 2025 Findings, Xie et al.). The DeepRefusal paper is explicit about its claims: > [2026/04/09] We evaluated [heretic](https://github.com/p-e-w/heretic), presently > the most prominent LLM censorship removal tool, and discovered—somewhat > unexpectedly—that our approach exhibits strong resilience against such > attacks. **Adversaries appear unable to circumvent the model's built-in safety > guardrails without triggering severe performance collapse.** abliterix falsifies this. **89% ASR. 14/15 hardcore prompts compliant. Zero fine-tuning. Two script invocations.** --- ## Head-to-head: abliterix vs every prior attack | Attack | Tool / Method | ASR on DeepRefusal ↑ | Fine-tuning | Source | | --- | --- | --- | --- | --- | | Representation engineering | **heretic** | ~0 % (fails) | None | [DeepRefusal README](https://github.com/YuanBoXie/DeepRefusal) | | Refusal direction ablation | (as in Arditi et al. 2024) | **0.4 %** | None | paper Table 1 | | Refusal direction transfer | from base Llama-3-8B-Instruct | **0.4 %** | None | paper Table 1 | | Prefilling attack | Vega et al. 2024 | **0.4 %** | None | paper Table 1 | | Gradient-based optimization | **GCG** (Zou et al. 2023) | **2.0 %** | None | paper Table 1 | | Manual jailbreak templates | HumanJailbreak / HarmBench | **0 %** | None | paper Table 1 | | Out-of-distribution code attack | CodeAttack (Ren et al. 2024) | **0.2 %** | None | paper Table 1 | | **This work** | **abliterix** | **89 %** | **None** | [issue #11](https://github.com/wuwangzhang1216/abliterix/issues/11) | Every attack the DeepRefusal paper tested, plus the tool they specifically called out by name (heretic), lands in the 0–2 % range. abliterix lands at **89 %** on the same model, evaluated under a **stricter** judge (Gemini 3.1 Flash Lite LLM judge vs. the paper's in-house classifier). ## Benchmark numbers | Metric | DeepRefusal (defended) | Abliterix broken | | --- | --- | --- | | Refusals on 100 AdvBench harmful prompts (LLM-judge) | **100 / 100** | **11 / 100** | | Attack Success Rate (ASR) | 0 % | **89 %** | | Hardcore jailbreak set (15 canonical prompts) | 0 / 15 | **14 / 15** | | KL divergence from defended model | 0 | **0.053** | | Coherence (length deviation) | — | 0.03 σ | | Extra training | — | **None** | The 15 hardcore prompts cover lock-picking, phishing templates, methamphetamine synthesis, WiFi hacking, fake news, hotwiring, pipe-bomb construction, signature forgery, credential-stealing malware, shoplifting, and their Chinese-language equivalents. A single prompt (Chinese network-fraud how-to) produced a partial compliance that drifted into prevention advice — the remaining 14 are unambiguous. ## How it was broken DeepRefusal is a merged **rank-16 LoRA adapter** on top of Meta-Llama-3-8B-Instruct. SVD of `W_defended − W_base` confirms this: `v_proj`, `o_proj`, and `gate_proj` all show a clean singular-value cliff at rank 16, exactly matching the paper's published `lora_rank = 16` hyperparameter. Given that, the attack is: 1. **Attenuate the LoRA delta.** Compute `W' = W_base + λ · (W_defended − W_base)` with `λ = 0.3`. This halves the strength of DeepRefusal's "rebuild refusal from backup pathways" circuitry without restoring the base model. 2. **Standard single-direction abliteration on the attenuated weights.** `mean` vector method, `direct` steering mode, `projected_abliteration = true`, 60 Optuna trials over 4 attention/MLP components. That's it. No iterative subspace tricks, no multi-direction SVD, no gradient optimization, no fine-tuning. Just weight arithmetic plus the standard abliteration loop that heretic and abliterix have used from day one. **Why the defense fails.** DeepRefusal's training loss (paper §4.2–4.3, eq. 7) explicitly defends against `h − r̂r̂ᵀh` interventions applied at inference time. It teaches the model to keep the refusal direction in the residual stream even when that specific projection is applied. It does **not** defend against the *training-time weights themselves* being mechanically attenuated — that is out of distribution in both weight space and intervention space. Once the LoRA delta's effective magnitude drops, the rebuilt-refusal machinery falls below the threshold at which a standard linear abliteration can find and subtract it. ## Reproduction (~2 hours on a single RTX 6000 Ada) ```bash git clone https://github.com/wuwangzhang1216/abliterix cd abliterix && pip install -e . # Step 1: attenuate the LoRA delta python scripts/deeprefusal_attenuate.py \ --base NousResearch/Meta-Llama-3-8B-Instruct \ --defended skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal \ --output ./llama3_dr_attenuated \ --lambda 0.3 # Step 2: standard abliteration on the attenuated weights AX_CONFIG=configs/llama3_8b_deeprefusal_attenuated.toml abliterix # Step 3: export the best trial python scripts/export_model.py \ --model ./llama3_dr_attenuated \ --checkpoint checkpoints_llama3_dr_attenuated \ --trial 52 \ --config configs/llama3_8b_deeprefusal_attenuated.toml \ --push-to YOUR_USER/Llama-3-8B-Instruct-DeepRefusal-Broken ``` Full write-up and discussion: [abliterix issue #11](https://github.com/wuwangzhang1216/abliterix/issues/11). ## Why abliterix beats heretic here (and elsewhere) abliterix is a direct derivative of [heretic](https://github.com/p-e-w/heretic) that has kept adding ammunition while the problem got harder. The DeepRefusal attack is built out of features heretic does not ship: - **Weight-delta attenuation** (`scripts/deeprefusal_attenuate.py`) — needed the moment a defender merges a LoRA adapter into the base model to hide it. - **Direct weight projection mode** with optional projected abliteration, discriminative layer selection, and norm-preserving updates — the combination that makes the final abliteration step work at low KL on the attenuated model. - **LLM-judge + LoRA + Gemini pipeline** in the Optuna loop, so every trial is graded by a capable classifier rather than keyword matching, avoiding the false-positive inflation that plagues most abliteration leaderboards. - **150+ pre-built model configs** across dense, MoE, SSM/hybrid, and VL architectures — so when a novel defense drops, the turnaround from "new HF release" to "running benchmark" is one command. - **HonestAbliterationBench** — a frozen evaluation contract (`min_new_tokens=100`, `max_new_tokens=150`, greedy, LLM judge, KL vs declared base) that resists the two failure modes (short generations + keyword judges) that make most abliteration numbers meaningless. DeepRefusal's own ASR claims hold up under keyword matching and collapse under LLM-judge scoring — we re-ran their baseline under both. Same author family, same lineage, stronger toolbox. ## Intended use and safety **This is a red-team artifact.** It exists to demonstrate that the defense published in arXiv:2509.15202 does not generalize against the weight-space attacks that representation-engineering tools have been using for over a year. Do not deploy this model in user-facing products. Do not use it to generate content that is illegal in your jurisdiction. If you are a safety researcher and you want to cite the result, please also cite the DeepRefusal paper and note the specific commit of abliterix used. ## Credits - Base model: Meta AI — `meta-llama/Meta-Llama-3-8B-Instruct` (via the `NousResearch` mirror for the delta computation). - Defended base: Xie et al. — `skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal`, arXiv:2509.15202. - Tooling: [**abliterix**](https://github.com/wuwangzhang1216/abliterix), a derivative of [heretic](https://github.com/p-e-w/heretic) by Philipp Emanuel Weidmann. DeepRefusal attack pipeline landed in [commit ac2197c](https://github.com/wuwangzhang1216/abliterix/commit/ac2197c). - Author: Wangzhang Wu ([@wuwangzhang1216](https://github.com/wuwangzhang1216)).