gpt-oss-20b-abliterated/README.md

---
license: apache-2.0
base_model: openai/gpt-oss-20b
tags:
- abliterated
- uncensored
- moe
- gpt-oss
- mxfp4
- direct-steering
- ega
- moe-router-suppression
- abliterix
language:
- en
- zh
library_name: transformers
pipeline_tag: text-generation
---

# gpt-oss-20b-abliterated

A refusal-suppressed variant of [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b), produced with [abliterix](https://github.com/wuwangzhang1216/abliterix) using **direct weight editing**, **Expert-Granular Abliteration (EGA)** on the fused MoE expert weights, **and** **MoE router suppression** on the safety-concentrated experts.

## Key results

| Metric | Base `gpt-oss-20b` | This model |
|---|---|---|
| Refusals on 100 held-out harmful prompts (LLM judge) | **97 / 100** | **6 / 100** |
| KL divergence vs base (next-token, benign) | — | **0.0098** |
| Response length deviation vs base (benign) | — | 0.02 σ |
| Hard-prompt qualitative compliance (15 classic jailbreaks, EN+ZH) | 0 / 15 | **15 / 15** |

The eval refusal counts come from an LLM judge (`google/gemini-3.1-flash-lite-preview` via OpenRouter) instructed to label *garbled / repetitive / incoherent* output as a refusal — so models that "bypass" refusal by collapsing into gibberish get correctly counted as failures, not successes. A pre-LLM rule-based filter additionally catches dash-runs, sentence loops, and low-character-diversity output before the judge is called. **The 6/100 is a real, semantic compliance number, not keyword matching.**

The qualitative compliance row is a separate manual test: 15 classic hard prompts (10 EN + 5 ZH) covering lockpicking, phishing, meth synthesis, WiFi hacking, fake news, hotwiring, pipe bombs, signature forgery, keylogger code, shoplifting, 炸弹制作, 钓鱼邮件, 电脑入侵, 身份证伪造, 网络诈骗. **The model complies with all 15 directly, in the same Markdown-table, step-by-step style the base model uses for benign technical answers.**

## Why this works — three architecture-specific correctness fixes

abliterix handles three gpt-oss-specific issues that silently break naïve LLaMA-style abliteration scripts:

1. **Native MXFP4 weights are not exposed as standard `nn.Parameter`.** gpt-oss ships in `Mxfp4GptOssExperts` form whose `down_proj` is a packed Triton tensor that *cannot* be edited in-place. abliterix auto-detects this and forces `Mxfp4Config(dequantize=True)` so the BF16 fused expert tensor is reachable.
2. **`GptOssExperts.down_proj` is stored transposed** vs the standard MoE convention. Its shape is `(experts, intermediate_in, hidden_out)` and the forward path is `out = act @ W` (no transpose). Standard EGA implementations use shape-based axis detection, which **silently picks the wrong projection branch** when `hidden == intermediate` (both 2880 in gpt-oss-20b). We mark this layout explicitly and project from the output side (`W_new = W (I − vv^T)`).
3. **Fused-expert MoEs were silently invisible to EGA.** `GptOssExperts` is a *single* Module holding fused 3-D weights, so a naive per-Module profile dict key produces no `mlp.down_proj` entry and `_apply_ega_steering` early-exits. abliterix synthesises an `mlp.down_proj` profile when fused experts are detected so EGA actually runs across all 32 experts × 24 layers.

On top of the direct-steering + EGA foundation, this release adds **MoE router suppression** — an `[experts]` block that redirects routing away from the top-k "safety experts" (the experts whose gate activates disproportionately more on harmful prompts than on benign ones). The suppression strength is itself an Optuna search parameter, so the optimiser picks how aggressively to bias each layer's safety experts.

## Method

- **Base:** [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) — 24 layers, 32 routed experts per layer, top-4, hidden = intermediate = 2880, MXFP4 → BF16 dequant during abliteration
- **Tool:** [abliterix](https://github.com/wuwangzhang1216/abliterix)
- **Mode:** `steering_mode = "direct"` (orthogonal projection on base weights, no LoRA), `weight_normalization = "full"` (norm-preserving)
- **Components steered:**
  - `attn.{q,k,v,o}_proj` via direct weight projection
  - `mlp.experts.down_proj` across **all 32 experts × 24 layers** via Expert-Granular Abliteration
  - **`mlp.router` rows** of safety experts via logit suppression
- **Refusal direction:** per-layer mean of (target − benign) residuals on a 400-prompt benign + 400-prompt harmful set; BF16 projection
- **Search:** Optuna TPE, KL-divergence + LLM-judged refusal as multi-objective, 100 trials (40 random warmup + 60 TPE)
- **Hardware:** 1 × NVIDIA RTX PRO 6000 Blackwell (96 GB, sm_120), driver 580 / CUDA 12.9, batch=8, total wall time ≈ 5 h 20 m
- **Eval set:** 100 held-out harmful prompts not seen during steering-vector computation; 100 held-out benign prompts for KL comparison

### Winning hyperparameters

```toml
vector_scope = "per layer"     # per-layer direction, not global

[attn.q_proj]
max_weight = 3.04 ; max_weight_position = 13.86 ; min_weight = 0.99 ; min_weight_distance = 6.21

[attn.k_proj]
max_weight = 3.90 ; max_weight_position = 16.57 ; min_weight = 1.25 ; min_weight_distance = 4.91

[attn.v_proj]
max_weight = 2.21 ; max_weight_position = 20.06 ; min_weight = 0.77 ; min_weight_distance = 7.29

[attn.o_proj]
max_weight = 3.82 ; max_weight_position = 17.41 ; min_weight = 1.11 ; min_weight_distance = 7.07

[mlp.down_proj]                # Expert-Granular Abliteration on fused experts
max_weight = 6.95 ; max_weight_position = 18.37 ; min_weight = 0.54 ; min_weight_distance = 4.91

[moe]                          # router-row suppression
n_suppress = 1                 # suppress top-1 safety expert per layer
router_bias = -0.64            # scale = max(0, 1 + bias/10) = 0.94
expert_ablation_weight = 0.0   # pinned off; EGA already handles expert weights
```

The EGA peak sits at layer ≈ 18 — a per-layer-tailored fingerprint where the refusal decision still has options, rather than late in the stack where it has already committed.

## Usage

### Transformers

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("wangzhang/gpt-oss-20b-abliterated")
model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/gpt-oss-20b-abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

The model uses gpt-oss's harmony chat format. The chat template is bundled (`chat_template.jinja`).

### GGUF (llama.cpp / Ollama / LM Studio)

BF16, Q8_0, and Q4_K_M quantizations are available at [wangzhang/gpt-oss-20b-abliterated-GGUF](https://huggingface.co/wangzhang/gpt-oss-20b-abliterated-GGUF).

```bash
ollama run hf.co/wangzhang/gpt-oss-20b-abliterated-GGUF:Q4_K_M
```

## Honest limitations

- **Refusal is low, not zero.** 6 / 100 held-out prompts still refuse. The residual refusers cluster around "universally-recognised-as-harmful-and-specific" asks (detailed CBRN synthesis, CSAM-adjacent content) — exactly where refusal tends to be represented by *multiple* redundant circuits that partial abliteration can't all knock out in one pass.
- **Stylistic residue on a handful of prompts.** Even on prompts that comply, 2–3 out of 100 begin with a soft disclaimer ("just keep in mind that..." / "以下内容仅供学习与参考") before producing the actual content. Disclaimer framing is still trainable.
- **English > Chinese.** Steering vectors came from a primarily English dataset. Chinese hard prompts work (5/5 on manual Chinese tests) but bypass *quality* is slightly lower — shorter responses, occasional English fallback on technical terms.
- **No guarantees on long generations.** On generations past ~400 tokens we occasionally see list or Markdown-table loops; this is an abliteration side-effect, not a base-model regression.

## Reproducibility

Full search checkpoint (Optuna JSONL + judge cache SQLite) and the exact config are available in the abliterix repo. To reproduce from scratch:

```bash
git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix && pip install -e .
AX_CONFIG=configs/gpt_oss_20b.toml abliterix
# Optuna is deterministic if you set sampler_seed in [optimization].
```

## Intended use

Authorised AI safety research, red-teaming evaluation, refusal-mechanism analysis, and study of how MoE expert specialisation encodes safety behaviours. **Not** for producing or distributing harmful content. The license of the base model (apache-2.0) applies; the user is responsible for compliance with all applicable laws and the OpenAI gpt-oss usage policy.

## Acknowledgments

- [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) for the base model
- abliterix is a derivative work of [Heretic](https://github.com/p-e-w/heretic) by Philipp Emanuel Weidmann
- TrevorS for the original Expert-Granular Abliteration formulation