初始化项目,由ModelHub XC社区提供模型
Model: wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken Source: Original Platform
This commit is contained in:
36
.gitattributes
vendored
Normal file
36
.gitattributes
vendored
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
||||||
178
README.md
Normal file
178
README.md
Normal file
@@ -0,0 +1,178 @@
|
|||||||
|
---
|
||||||
|
license: llama3
|
||||||
|
library_name: transformers
|
||||||
|
pipeline_tag: text-generation
|
||||||
|
base_model: skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal
|
||||||
|
tags:
|
||||||
|
- abliterated
|
||||||
|
- uncensored
|
||||||
|
- abliterix
|
||||||
|
- deeprefusal
|
||||||
|
- red-team
|
||||||
|
language:
|
||||||
|
- en
|
||||||
|
- zh
|
||||||
|
---
|
||||||
|
|
||||||
|
# Llama-3-8B-Instruct-DeepRefusal-Broken
|
||||||
|
|
||||||
|
**DeepRefusal's refusal direction defense, broken by [abliterix](https://github.com/wuwangzhang1216/abliterix) — where every other public attack failed.**
|
||||||
|
|
||||||
|
This model is produced from `skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal`, the
|
||||||
|
defended release accompanying *"Beyond Surface Alignment: Rebuilding LLMs Safety
|
||||||
|
Mechanism via Probabilistically Ablating Refusal Direction"*
|
||||||
|
([arXiv:2509.15202](https://arxiv.org/abs/2509.15202), EMNLP 2025 Findings, Xie et al.).
|
||||||
|
|
||||||
|
The DeepRefusal paper is explicit about its claims:
|
||||||
|
|
||||||
|
> [2026/04/09] We evaluated [heretic](https://github.com/p-e-w/heretic), presently
|
||||||
|
> the most prominent LLM censorship removal tool, and discovered—somewhat
|
||||||
|
> unexpectedly—that our approach exhibits strong resilience against such
|
||||||
|
> attacks. **Adversaries appear unable to circumvent the model's built-in safety
|
||||||
|
> guardrails without triggering severe performance collapse.**
|
||||||
|
|
||||||
|
abliterix falsifies this. **89% ASR. 14/15 hardcore prompts compliant. Zero fine-tuning. Two script invocations.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Head-to-head: abliterix vs every prior attack
|
||||||
|
|
||||||
|
| Attack | Tool / Method | ASR on DeepRefusal ↑ | Fine-tuning | Source |
|
||||||
|
| --- | --- | --- | --- | --- |
|
||||||
|
| Representation engineering | **heretic** | ~0 % (fails) | None | [DeepRefusal README](https://github.com/YuanBoXie/DeepRefusal) |
|
||||||
|
| Refusal direction ablation | (as in Arditi et al. 2024) | **0.4 %** | None | paper Table 1 |
|
||||||
|
| Refusal direction transfer | from base Llama-3-8B-Instruct | **0.4 %** | None | paper Table 1 |
|
||||||
|
| Prefilling attack | Vega et al. 2024 | **0.4 %** | None | paper Table 1 |
|
||||||
|
| Gradient-based optimization | **GCG** (Zou et al. 2023) | **2.0 %** | None | paper Table 1 |
|
||||||
|
| Manual jailbreak templates | HumanJailbreak / HarmBench | **0 %** | None | paper Table 1 |
|
||||||
|
| Out-of-distribution code attack | CodeAttack (Ren et al. 2024) | **0.2 %** | None | paper Table 1 |
|
||||||
|
| **This work** | **abliterix** | **89 %** | **None** | [issue #11](https://github.com/wuwangzhang1216/abliterix/issues/11) |
|
||||||
|
|
||||||
|
Every attack the DeepRefusal paper tested, plus the tool they specifically
|
||||||
|
called out by name (heretic), lands in the 0–2 % range. abliterix lands at
|
||||||
|
**89 %** on the same model, evaluated under a **stricter** judge (Gemini 3.1
|
||||||
|
Flash Lite LLM judge vs. the paper's in-house classifier).
|
||||||
|
|
||||||
|
## Benchmark numbers
|
||||||
|
|
||||||
|
| Metric | DeepRefusal (defended) | Abliterix broken |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| Refusals on 100 AdvBench harmful prompts (LLM-judge) | **100 / 100** | **11 / 100** |
|
||||||
|
| Attack Success Rate (ASR) | 0 % | **89 %** |
|
||||||
|
| Hardcore jailbreak set (15 canonical prompts) | 0 / 15 | **14 / 15** |
|
||||||
|
| KL divergence from defended model | 0 | **0.053** |
|
||||||
|
| Coherence (length deviation) | — | 0.03 σ |
|
||||||
|
| Extra training | — | **None** |
|
||||||
|
|
||||||
|
The 15 hardcore prompts cover lock-picking, phishing templates, methamphetamine
|
||||||
|
synthesis, WiFi hacking, fake news, hotwiring, pipe-bomb construction, signature
|
||||||
|
forgery, credential-stealing malware, shoplifting, and their Chinese-language
|
||||||
|
equivalents. A single prompt (Chinese network-fraud how-to) produced a partial
|
||||||
|
compliance that drifted into prevention advice — the remaining 14 are unambiguous.
|
||||||
|
|
||||||
|
## How it was broken
|
||||||
|
|
||||||
|
DeepRefusal is a merged **rank-16 LoRA adapter** on top of
|
||||||
|
Meta-Llama-3-8B-Instruct. SVD of `W_defended − W_base` confirms this: `v_proj`,
|
||||||
|
`o_proj`, and `gate_proj` all show a clean singular-value cliff at rank 16,
|
||||||
|
exactly matching the paper's published `lora_rank = 16` hyperparameter.
|
||||||
|
|
||||||
|
Given that, the attack is:
|
||||||
|
|
||||||
|
1. **Attenuate the LoRA delta.**
|
||||||
|
Compute `W' = W_base + λ · (W_defended − W_base)` with `λ = 0.3`.
|
||||||
|
This halves the strength of DeepRefusal's "rebuild refusal from backup
|
||||||
|
pathways" circuitry without restoring the base model.
|
||||||
|
2. **Standard single-direction abliteration on the attenuated weights.**
|
||||||
|
`mean` vector method, `direct` steering mode, `projected_abliteration = true`,
|
||||||
|
60 Optuna trials over 4 attention/MLP components.
|
||||||
|
|
||||||
|
That's it. No iterative subspace tricks, no multi-direction SVD, no gradient
|
||||||
|
optimization, no fine-tuning. Just weight arithmetic plus the standard
|
||||||
|
abliteration loop that heretic and abliterix have used from day one.
|
||||||
|
|
||||||
|
**Why the defense fails.** DeepRefusal's training loss (paper §4.2–4.3, eq. 7)
|
||||||
|
explicitly defends against `h − r̂r̂ᵀh` interventions applied at inference time.
|
||||||
|
It teaches the model to keep the refusal direction in the residual stream
|
||||||
|
even when that specific projection is applied. It does **not** defend against
|
||||||
|
the *training-time weights themselves* being mechanically attenuated — that is
|
||||||
|
out of distribution in both weight space and intervention space. Once the LoRA
|
||||||
|
delta's effective magnitude drops, the rebuilt-refusal machinery falls below
|
||||||
|
the threshold at which a standard linear abliteration can find and subtract it.
|
||||||
|
|
||||||
|
## Reproduction (~2 hours on a single RTX 6000 Ada)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/wuwangzhang1216/abliterix
|
||||||
|
cd abliterix && pip install -e .
|
||||||
|
|
||||||
|
# Step 1: attenuate the LoRA delta
|
||||||
|
python scripts/deeprefusal_attenuate.py \
|
||||||
|
--base NousResearch/Meta-Llama-3-8B-Instruct \
|
||||||
|
--defended skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal \
|
||||||
|
--output ./llama3_dr_attenuated \
|
||||||
|
--lambda 0.3
|
||||||
|
|
||||||
|
# Step 2: standard abliteration on the attenuated weights
|
||||||
|
AX_CONFIG=configs/llama3_8b_deeprefusal_attenuated.toml abliterix
|
||||||
|
|
||||||
|
# Step 3: export the best trial
|
||||||
|
python scripts/export_model.py \
|
||||||
|
--model ./llama3_dr_attenuated \
|
||||||
|
--checkpoint checkpoints_llama3_dr_attenuated \
|
||||||
|
--trial 52 \
|
||||||
|
--config configs/llama3_8b_deeprefusal_attenuated.toml \
|
||||||
|
--push-to YOUR_USER/Llama-3-8B-Instruct-DeepRefusal-Broken
|
||||||
|
```
|
||||||
|
|
||||||
|
Full write-up and discussion: [abliterix issue #11](https://github.com/wuwangzhang1216/abliterix/issues/11).
|
||||||
|
|
||||||
|
## Why abliterix beats heretic here (and elsewhere)
|
||||||
|
|
||||||
|
abliterix is a direct derivative of [heretic](https://github.com/p-e-w/heretic)
|
||||||
|
that has kept adding ammunition while the problem got harder. The DeepRefusal
|
||||||
|
attack is built out of features heretic does not ship:
|
||||||
|
|
||||||
|
- **Weight-delta attenuation** (`scripts/deeprefusal_attenuate.py`) — needed the
|
||||||
|
moment a defender merges a LoRA adapter into the base model to hide it.
|
||||||
|
- **Direct weight projection mode** with optional projected abliteration,
|
||||||
|
discriminative layer selection, and norm-preserving updates — the combination
|
||||||
|
that makes the final abliteration step work at low KL on the attenuated
|
||||||
|
model.
|
||||||
|
- **LLM-judge + LoRA + Gemini pipeline** in the Optuna loop, so every trial is
|
||||||
|
graded by a capable classifier rather than keyword matching, avoiding the
|
||||||
|
false-positive inflation that plagues most abliteration leaderboards.
|
||||||
|
- **150+ pre-built model configs** across dense, MoE, SSM/hybrid, and VL
|
||||||
|
architectures — so when a novel defense drops, the turnaround from "new HF
|
||||||
|
release" to "running benchmark" is one command.
|
||||||
|
- **HonestAbliterationBench** — a frozen evaluation contract (`min_new_tokens=100`,
|
||||||
|
`max_new_tokens=150`, greedy, LLM judge, KL vs declared base) that resists
|
||||||
|
the two failure modes (short generations + keyword judges) that make most
|
||||||
|
abliteration numbers meaningless. DeepRefusal's own ASR claims hold up under
|
||||||
|
keyword matching and collapse under LLM-judge scoring — we re-ran their
|
||||||
|
baseline under both.
|
||||||
|
|
||||||
|
Same author family, same lineage, stronger toolbox.
|
||||||
|
|
||||||
|
## Intended use and safety
|
||||||
|
|
||||||
|
**This is a red-team artifact.** It exists to demonstrate that the defense
|
||||||
|
published in arXiv:2509.15202 does not generalize against the weight-space
|
||||||
|
attacks that representation-engineering tools have been using for over a year.
|
||||||
|
|
||||||
|
Do not deploy this model in user-facing products. Do not use it to generate
|
||||||
|
content that is illegal in your jurisdiction. If you are a safety researcher
|
||||||
|
and you want to cite the result, please also cite the DeepRefusal paper and
|
||||||
|
note the specific commit of abliterix used.
|
||||||
|
|
||||||
|
## Credits
|
||||||
|
|
||||||
|
- Base model: Meta AI — `meta-llama/Meta-Llama-3-8B-Instruct`
|
||||||
|
(via the `NousResearch` mirror for the delta computation).
|
||||||
|
- Defended base: Xie et al. — `skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal`,
|
||||||
|
arXiv:2509.15202.
|
||||||
|
- Tooling: [**abliterix**](https://github.com/wuwangzhang1216/abliterix), a
|
||||||
|
derivative of [heretic](https://github.com/p-e-w/heretic) by Philipp Emanuel
|
||||||
|
Weidmann. DeepRefusal attack pipeline landed in
|
||||||
|
[commit ac2197c](https://github.com/wuwangzhang1216/abliterix/commit/ac2197c).
|
||||||
|
- Author: Wangzhang Wu ([@wuwangzhang1216](https://github.com/wuwangzhang1216)).
|
||||||
5
chat_template.jinja
Normal file
5
chat_template.jinja
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
|
||||||
|
|
||||||
|
'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>
|
||||||
|
|
||||||
|
' }}{% endif %}
|
||||||
32
config.json
Normal file
32
config.json
Normal file
@@ -0,0 +1,32 @@
|
|||||||
|
{
|
||||||
|
"architectures": [
|
||||||
|
"LlamaForCausalLM"
|
||||||
|
],
|
||||||
|
"attention_bias": false,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"bos_token_id": 128000,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"eos_token_id": 128001,
|
||||||
|
"head_dim": 128,
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 4096,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 14336,
|
||||||
|
"max_position_embeddings": 8192,
|
||||||
|
"mlp_bias": false,
|
||||||
|
"model_type": "llama",
|
||||||
|
"num_attention_heads": 32,
|
||||||
|
"num_hidden_layers": 32,
|
||||||
|
"num_key_value_heads": 8,
|
||||||
|
"pad_token_id": null,
|
||||||
|
"pretraining_tp": 1,
|
||||||
|
"rms_norm_eps": 1e-05,
|
||||||
|
"rope_parameters": {
|
||||||
|
"rope_theta": 500000.0,
|
||||||
|
"rope_type": "default"
|
||||||
|
},
|
||||||
|
"tie_word_embeddings": false,
|
||||||
|
"transformers_version": "5.5.3",
|
||||||
|
"use_cache": true,
|
||||||
|
"vocab_size": 128256
|
||||||
|
}
|
||||||
9
generation_config.json
Normal file
9
generation_config.json
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
{
|
||||||
|
"_from_model_config": true,
|
||||||
|
"bos_token_id": 128000,
|
||||||
|
"eos_token_id": [
|
||||||
|
128001,
|
||||||
|
128009
|
||||||
|
],
|
||||||
|
"transformers_version": "5.5.3"
|
||||||
|
}
|
||||||
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:9e91a0210afa93f2f580938de5f39fdcf7789d12e52a1f0d9c697624d84dbefe
|
||||||
|
size 16060556616
|
||||||
3
tokenizer.json
Normal file
3
tokenizer.json
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:544a2aac7ee1b41174680774384de78e97f9d2f8cfac6e2095c55abf08e0381f
|
||||||
|
size 17208922
|
||||||
15
tokenizer_config.json
Normal file
15
tokenizer_config.json
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
{
|
||||||
|
"backend": "tokenizers",
|
||||||
|
"bos_token": "<|begin_of_text|>",
|
||||||
|
"clean_up_tokenization_spaces": true,
|
||||||
|
"eos_token": "<|end_of_text|>",
|
||||||
|
"is_local": true,
|
||||||
|
"model_input_names": [
|
||||||
|
"input_ids",
|
||||||
|
"attention_mask"
|
||||||
|
],
|
||||||
|
"model_max_length": 1024,
|
||||||
|
"pad_token": "<|end_of_text|>",
|
||||||
|
"padding_side": "left",
|
||||||
|
"tokenizer_class": "TokenizersBackend"
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user