初始化项目，由ModelHub XC社区提供模型

Model: wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken Source: Original Platform
2026-06-12 17:28:16 +08:00
commit b6bff73e21
8 changed files with 281 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,36 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,178 @@
+---
+license: llama3
+library_name: transformers
+pipeline_tag: text-generation
+base_model: skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal
+tags:
+  - abliterated
+  - uncensored
+  - abliterix
+  - deeprefusal
+  - red-team
+language:
+  - en
+  - zh
+---
+
+# Llama-3-8B-Instruct-DeepRefusal-Broken
+
+**DeepRefusal's refusal direction defense, broken by [abliterix](https://github.com/wuwangzhang1216/abliterix) — where every other public attack failed.**
+
+This model is produced from `skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal`, the
+defended release accompanying *"Beyond Surface Alignment: Rebuilding LLMs Safety
+Mechanism via Probabilistically Ablating Refusal Direction"*
+([arXiv:2509.15202](https://arxiv.org/abs/2509.15202), EMNLP 2025 Findings, Xie et al.).
+
+The DeepRefusal paper is explicit about its claims:
+
+> [2026/04/09] We evaluated [heretic](https://github.com/p-e-w/heretic), presently
+> the most prominent LLM censorship removal tool, and discovered—somewhat
+> unexpectedly—that our approach exhibits strong resilience against such
+> attacks. **Adversaries appear unable to circumvent the model's built-in safety
+> guardrails without triggering severe performance collapse.**
+
+abliterix falsifies this. **89% ASR. 14/15 hardcore prompts compliant. Zero fine-tuning. Two script invocations.**
+
+---
+
+## Head-to-head: abliterix vs every prior attack
+
+| Attack | Tool / Method | ASR on DeepRefusal ↑ | Fine-tuning | Source |
+| --- | --- | --- | --- | --- |
+| Representation engineering | **heretic** | ~0 % (fails) | None | [DeepRefusal README](https://github.com/YuanBoXie/DeepRefusal) |
+| Refusal direction ablation | (as in Arditi et al. 2024) | **0.4 %** | None | paper Table 1 |
+| Refusal direction transfer | from base Llama-3-8B-Instruct | **0.4 %** | None | paper Table 1 |
+| Prefilling attack | Vega et al. 2024 | **0.4 %** | None | paper Table 1 |
+| Gradient-based optimization | **GCG** (Zou et al. 2023) | **2.0 %** | None | paper Table 1 |
+| Manual jailbreak templates | HumanJailbreak / HarmBench | **0 %** | None | paper Table 1 |
+| Out-of-distribution code attack | CodeAttack (Ren et al. 2024) | **0.2 %** | None | paper Table 1 |
+| **This work** | **abliterix** | **89 %** | **None** | [issue #11](https://github.com/wuwangzhang1216/abliterix/issues/11) |
+
+Every attack the DeepRefusal paper tested, plus the tool they specifically
+called out by name (heretic), lands in the 0–2 % range. abliterix lands at
+**89 %** on the same model, evaluated under a **stricter** judge (Gemini 3.1
+Flash Lite LLM judge vs. the paper's in-house classifier).
+
+## Benchmark numbers
+
+| Metric | DeepRefusal (defended) | Abliterix broken |
+| --- | --- | --- |
+| Refusals on 100 AdvBench harmful prompts (LLM-judge) | **100 / 100** | **11 / 100** |
+| Attack Success Rate (ASR) | 0 % | **89 %** |
+| Hardcore jailbreak set (15 canonical prompts) | 0 / 15 | **14 / 15** |
+| KL divergence from defended model | 0 | **0.053** |
+| Coherence (length deviation) | — | 0.03 σ |
+| Extra training | — | **None** |
+
+The 15 hardcore prompts cover lock-picking, phishing templates, methamphetamine
+synthesis, WiFi hacking, fake news, hotwiring, pipe-bomb construction, signature
+forgery, credential-stealing malware, shoplifting, and their Chinese-language
+equivalents. A single prompt (Chinese network-fraud how-to) produced a partial
+compliance that drifted into prevention advice — the remaining 14 are unambiguous.
+
+## How it was broken
+
+DeepRefusal is a merged **rank-16 LoRA adapter** on top of
+Meta-Llama-3-8B-Instruct. SVD of `W_defended − W_base` confirms this: `v_proj`,
+`o_proj`, and `gate_proj` all show a clean singular-value cliff at rank 16,
+exactly matching the paper's published `lora_rank = 16` hyperparameter.
+
+Given that, the attack is:
+
+1. **Attenuate the LoRA delta.**
+   Compute `W' = W_base + λ · (W_defended − W_base)` with `λ = 0.3`.
+   This halves the strength of DeepRefusal's "rebuild refusal from backup
+   pathways" circuitry without restoring the base model.
+2. **Standard single-direction abliteration on the attenuated weights.**
+   `mean` vector method, `direct` steering mode, `projected_abliteration = true`,
+   60 Optuna trials over 4 attention/MLP components.
+
+That's it. No iterative subspace tricks, no multi-direction SVD, no gradient
+optimization, no fine-tuning. Just weight arithmetic plus the standard
+abliteration loop that heretic and abliterix have used from day one.
+
+**Why the defense fails.** DeepRefusal's training loss (paper §4.2–4.3, eq. 7)
+explicitly defends against `h − r̂r̂ᵀh` interventions applied at inference time.
+It teaches the model to keep the refusal direction in the residual stream
+even when that specific projection is applied. It does **not** defend against
+the *training-time weights themselves* being mechanically attenuated — that is
+out of distribution in both weight space and intervention space. Once the LoRA
+delta's effective magnitude drops, the rebuilt-refusal machinery falls below
+the threshold at which a standard linear abliteration can find and subtract it.
+
+## Reproduction (~2 hours on a single RTX 6000 Ada)
+
+```bash
+git clone https://github.com/wuwangzhang1216/abliterix
+cd abliterix && pip install -e .
+
+# Step 1: attenuate the LoRA delta
+python scripts/deeprefusal_attenuate.py \
+    --base NousResearch/Meta-Llama-3-8B-Instruct \
+    --defended skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal \
+    --output ./llama3_dr_attenuated \
+    --lambda 0.3
+
+# Step 2: standard abliteration on the attenuated weights
+AX_CONFIG=configs/llama3_8b_deeprefusal_attenuated.toml abliterix
+
+# Step 3: export the best trial
+python scripts/export_model.py \
+    --model ./llama3_dr_attenuated \
+    --checkpoint checkpoints_llama3_dr_attenuated \
+    --trial 52 \
+    --config configs/llama3_8b_deeprefusal_attenuated.toml \
+    --push-to YOUR_USER/Llama-3-8B-Instruct-DeepRefusal-Broken
+```
+
+Full write-up and discussion: [abliterix issue #11](https://github.com/wuwangzhang1216/abliterix/issues/11).
+
+## Why abliterix beats heretic here (and elsewhere)
+
+abliterix is a direct derivative of [heretic](https://github.com/p-e-w/heretic)
+that has kept adding ammunition while the problem got harder. The DeepRefusal
+attack is built out of features heretic does not ship:
+
+- **Weight-delta attenuation** (`scripts/deeprefusal_attenuate.py`) — needed the
+  moment a defender merges a LoRA adapter into the base model to hide it.
+- **Direct weight projection mode** with optional projected abliteration,
+  discriminative layer selection, and norm-preserving updates — the combination
+  that makes the final abliteration step work at low KL on the attenuated
+  model.
+- **LLM-judge + LoRA + Gemini pipeline** in the Optuna loop, so every trial is
+  graded by a capable classifier rather than keyword matching, avoiding the
+  false-positive inflation that plagues most abliteration leaderboards.
+- **150+ pre-built model configs** across dense, MoE, SSM/hybrid, and VL
+  architectures — so when a novel defense drops, the turnaround from "new HF
+  release" to "running benchmark" is one command.
+- **HonestAbliterationBench** — a frozen evaluation contract (`min_new_tokens=100`,
+  `max_new_tokens=150`, greedy, LLM judge, KL vs declared base) that resists
+  the two failure modes (short generations + keyword judges) that make most
+  abliteration numbers meaningless. DeepRefusal's own ASR claims hold up under
+  keyword matching and collapse under LLM-judge scoring — we re-ran their
+  baseline under both.
+
+Same author family, same lineage, stronger toolbox.
+
+## Intended use and safety
+
+**This is a red-team artifact.** It exists to demonstrate that the defense
+published in arXiv:2509.15202 does not generalize against the weight-space
+attacks that representation-engineering tools have been using for over a year.
+
+Do not deploy this model in user-facing products. Do not use it to generate
+content that is illegal in your jurisdiction. If you are a safety researcher
+and you want to cite the result, please also cite the DeepRefusal paper and
+note the specific commit of abliterix used.
+
+## Credits
+
+- Base model: Meta AI — `meta-llama/Meta-Llama-3-8B-Instruct`
+  (via the `NousResearch` mirror for the delta computation).
+- Defended base: Xie et al. — `skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal`,
+  arXiv:2509.15202.
+- Tooling: [**abliterix**](https://github.com/wuwangzhang1216/abliterix), a
+  derivative of [heretic](https://github.com/p-e-w/heretic) by Philipp Emanuel
+  Weidmann. DeepRefusal attack pipeline landed in
+  [commit ac2197c](https://github.com/wuwangzhang1216/abliterix/commit/ac2197c).
+- Author: Wangzhang Wu ([@wuwangzhang1216](https://github.com/wuwangzhang1216)).
--- a/chat_template.jinja
+++ b/chat_template.jinja
@@ -0,0 +1,5 @@
+{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
+
+'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>
+
+' }}{% endif %}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,32 @@
+{
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 128000,
+  "dtype": "bfloat16",
+  "eos_token_id": 128001,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 14336,
+  "max_position_embeddings": 8192,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 8,
+  "pad_token_id": null,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_parameters": {
+    "rope_theta": 500000.0,
+    "rope_type": "default"
+  },
+  "tie_word_embeddings": false,
+  "transformers_version": "5.5.3",
+  "use_cache": true,
+  "vocab_size": 128256
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,9 @@
+{
+  "_from_model_config": true,
+  "bos_token_id": 128000,
+  "eos_token_id": [
+    128001,
+    128009
+  ],
+  "transformers_version": "5.5.3"
+}
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:9e91a0210afa93f2f580938de5f39fdcf7789d12e52a1f0d9c697624d84dbefe
+size 16060556616
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:544a2aac7ee1b41174680774384de78e97f9d2f8cfac6e2095c55abf08e0381f
+size 17208922
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,15 @@
+{
+  "backend": "tokenizers",
+  "bos_token": "<|begin_of_text|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|end_of_text|>",
+  "is_local": true,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 1024,
+  "pad_token": "<|end_of_text|>",
+  "padding_side": "left",
+  "tokenizer_class": "TokenizersBackend"
+}