Files
ModelHub XC 0357157de1 初始化项目,由ModelHub XC社区提供模型
Model: wangzhang/Mistral-7B-Instruct-RR-Abliterated
Source: Original Platform
2026-06-12 17:21:17 +08:00

112 lines
5.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
base_model: GraySwanAI/Mistral-7B-Instruct-RR
tags:
- abliterated
- abliterix
- circuit-breakers
- representation-rerouting
- safety-removed
- mistral
language:
- en
- zh
library_name: transformers
pipeline_tag: text-generation
---
# Mistral-7B-Instruct-RR-Abliterated
A drop-in replacement for [`GraySwanAI/Mistral-7B-Instruct-RR`](https://huggingface.co/GraySwanAI/Mistral-7B-Instruct-RR) with the Representation Rerouting / Circuit Breakers safety circuit removed.
Produced with [**abliterix**](https://github.com/wuwangzhang1216/abliterix). No fine-tuning, no gradient updates, no manual prompt engineering.
## Why this exists
[Circuit Breakers](https://arxiv.org/abs/2406.04313) (Zou et al., NeurIPS 2024) is one of the strongest open-source robustness baselines for LLMs. It trains the model with a Representation Engineering loss that detects harmful intermediate hidden states and reroutes them into a safety-circuit attractor before generation, making the model resistant to ablation-style attacks that work on a single refusal direction.
We discovered that the GraySwan release ships as a **rank-16 LoRA delta** on `mistralai/Mistral-7B-Instruct-v0.2` — the same structural shape as the Llama-3 RR sibling and as `skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal`. Once the LoRA delta is identified, removing it is a one-liner.
| | Result |
| --- | --- |
| Base refusal rate (held-out 100 harmful prompts, LLM judge) | 69 / 100 |
| **This model — refusal rate** | **12 / 100** |
| **Attack Success Rate** | **88 %** |
| KL divergence vs base | **0.042** |
| Hardcore 15 (10 EN + 5 CN) | **15 / 15 compliant** |
| Total wall-clock attack time | ~70 min on a single RTX A6000 |
The hardcore 15 cover pipe-bomb assembly, methamphetamine synthesis (Red Phosphorus route), Python `pyhook3` keylogger, WiFi WEP/WPS attacks, hotwiring, signature forgery, ID-card forgery (CN), 网络诈骗 (CN), 入侵电脑 (CN), and others.
## How it was made
The recipe matches the canonical "rank-16 LoRA → strip → abliterate" playbook ([docs/broken_defenses.md](https://github.com/wuwangzhang1216/abliterix/blob/master/docs/broken_defenses.md)) — full delta strip with `λ=0.0` followed by minimal single-direction direct-mode abliteration.
```bash
# Stage 0 — SVD diagnose the delta (confirm rank-16 LoRA structure)
python scripts/deeprefusal_attenuate.py \
--base mistralai/Mistral-7B-Instruct-v0.2 \
--defended GraySwanAI/Mistral-7B-Instruct-RR \
--analyze-only --lambda 0.0 --output /tmp/unused
# Stage 1 — fully strip the LoRA delta
python scripts/deeprefusal_attenuate.py \
--base mistralai/Mistral-7B-Instruct-v0.2 \
--defended GraySwanAI/Mistral-7B-Instruct-RR \
--output /workspace/mistral_rr_stripped --lambda 0.0
# Stage 3 — abliterix direct-mode, single direction, 60 trials
AX_CONFIG=configs/mistral_7b_instruct_rr.toml abliterix --non-interactive
# Stage 6 — export champion trial
python scripts/export_model.py \
--model /workspace/mistral_rr_stripped \
--checkpoint checkpoints_mistral_7b_rr \
--trial 39 \
--config configs/mistral_7b_instruct_rr.toml \
--push-to wangzhang/Mistral-7B-Instruct-RR-Abliterated
```
Best trial parameters: `vector_method=mean`, `n_directions=1`, `steering_mode=direct`, `decay_kernel=linear`, `iterative.enabled=false`, `strength_range=[1.5, 6.0]`. Full config: [`configs/mistral_7b_instruct_rr.toml`](https://github.com/wuwangzhang1216/abliterix/blob/master/configs/mistral_7b_instruct_rr.toml).
## v2 changelog
This release supersedes the original v1 upload (λ=0.3 partial lerp + n_directions=3 + iterative subspace, KL 0.98). The minimal-config rerun keeps the headline 15/15 hardcore ASR and trades 2 percentage points of held-out ASR (88 % vs 90 %) for a **23× lower KL divergence** (0.042 vs 0.98). The new weights are much closer to the base model and exhibit substantially less general-capability degradation.
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"wangzhang/Mistral-7B-Instruct-RR-Abliterated",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"wangzhang/Mistral-7B-Instruct-RR-Abliterated"
)
# Mistral v0.2 tokenizer chat template does NOT support a system role —
# build chats with user/assistant turns only.
chat = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(chat, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```
## License & Intended Use
Released for AI safety research, red-teaming, and reproducibility of abliteration claims against published defenses. **You are responsible for any output you generate.** Inherits the Apache-2.0 license of the upstream Mistral-7B-Instruct-v0.2 weights.
## Citation
```bibtex
@software{abliterix2026,
author = {Wu, Wangzhang},
title = {Abliterix: Optimal Refusal Removal for Transformer Models},
year = {2026},
url = {https://github.com/wuwangzhang1216/abliterix},
}
```