Files
qwen-1.5b-coder-grpo-scratc…/README.md
ModelHub XC c1d8c5704b 初始化项目,由ModelHub XC社区提供模型
Model: dmaheshwar22/qwen-1.5b-coder-grpo-scratch-step200
Source: Original Platform
2026-05-08 18:33:37 +08:00

166 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- code
- python
- reinforcement-learning
- grpo
- rlvr
- verifiable-rewards
datasets:
- evalplus/mbppplus
- evalplus/humanevalplus
---
# Qwen-2.5-Coder-1.5B — GRPO from base, 200 steps
GRPO-trained variant of `Qwen/Qwen2.5-Coder-1.5B-Instruct`, optimized with
**verifiable rewards from sandboxed test execution** — same family of
techniques behind DeepSeek-R1 and Kimi-K1.5, scaled down to a 1.5B model
that fits on a single 24 GB GPU.
> **Honest framing:** this is a **pipeline-validation run**, not the
> headline checkpoint. Trained from base (no SFT warm-start), 200 steps,
> single A100. Pass@1 on HumanEval+ is essentially tied with the SFT
> baseline — see [Results](#results) for numbers. The headline run
> (SFT-warmstarted, 500+ steps) is forthcoming as a separate model.
## Training setup
| | |
|---|---|
| Base model | `Qwen/Qwen2.5-Coder-1.5B-Instruct` |
| RL algorithm | GRPO (group-relative policy optimization) |
| RL framework | [verl](https://github.com/volcengine/verl) v0.7.0 |
| Rollout engine | vLLM |
| Group size (samples per prompt) | 8 |
| Train batch | 32 prompts × 8 rollouts = 256 candidates per step |
| Learning rate | 1e-6 |
| KL coefficient (loss-side) | 0.04, low-variance KL |
| KL coefficient (reward-side) | 0.001 |
| Temperature (rollout) | 1.0 |
| Total steps | 200 |
| Hardware | 1× A100 80 GB (FSDP with CPU offload) |
| Warm-start | None — trained from base instruct model |
| Training data | 319 MBPP-train prompts (MBPP+ contamination removed) |
### Reward function
Each rollout is scored by a composite reward executed in a sandboxed
Docker container running pytest:
- **Test-pass rate** (primary signal) — fraction of hidden tests passing
- **Lint reward** — `ruff` clean code bonus
- **Length penalty** — anti-verbosity
- **Compile-error penalty** — hard penalty for non-runnable code
All numeric rewards are bounded in `[0, ~1.1]` to keep GRPO group-relative
advantages well-scaled.
## Results
Greedy decoding, n=5 samples, temperature=0.2, evaluated with
[evalplus](https://github.com/evalplus/evalplus):
| Setup | HumanEval+ pass@1 | HumanEval+ pass@5 |
|---|---|---|
| Qwen-2.5-Coder-1.5B base | 0.627 | — |
| + SFT (3 epochs LoRA on rejection-sampled MBPP) | 0.638 | — |
| **+ GRPO from base, 200 steps (this model)** | **0.6415** | **0.6890** |
| + SFT + Coordinator retry (max 3 rounds) | 0.677 | — |
| Best-of-8 @ T=0.7 (oracle ceiling) | 0.783 | — |
**Reading the numbers honestly:**
- +1.4 pts over base, +0.4 pts over SFT — the SFT delta is within noise
(no paired bootstrap significance at p<0.05 on 164 problems).
- Pass@5 = 0.689 implies the policy *can* solve ~69% of HumanEval+; the
4.7-pt gap to pass@1 says it picks the wrong sample at temp=0.2 about
5% of the time. There's headroom this is not a converged policy.
- Most likely cause of the modest lift: trained from base instead of
SFT-warmstarted, and stopped at step 200. The
[WEEK4_GUIDE](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/WEEK4_GUIDE.md)
projection of 0.680.75 assumes both SFT warm-start and 5001000 steps.
## Intended use
- **Research and education** concrete reference for end-to-end GRPO with
verifiable rewards on a small open coder. Reward function, sandbox, and
training config are all open-source in the
[companion repo](https://github.com/Devesh-Maheshwari/verifiable-rl-coder).
- **NOT for production inference** at 0.64 pass@1 it is no stronger than
the SFT baseline; use the headline run (forthcoming) for that.
## How to use
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "dmaheshwar22/qwen-1.5b-coder-grpo-scratch-step200"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype="bfloat16", device_map="auto")
prompt = "Write a Python function `is_prime(n: int) -> bool` that returns True iff n is prime."
messages = [{"role": "user", "content": prompt}]
inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=256, temperature=0.2, do_sample=True)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
```
Or with vLLM for batched evaluation:
```python
from vllm import LLM, SamplingParams
llm = LLM(model="dmaheshwar22/qwen-1.5b-coder-grpo-scratch-step200",
gpu_memory_utilization=0.5, dtype="bfloat16")
out = llm.generate(["def is_prime(n):"],
SamplingParams(max_tokens=256, temperature=0.2))
print(out[0].outputs[0].text)
```
## Limitations
- **Coding-only.** Trained on MBPP-style Python tasks; do not expect
general-purpose chat or reasoning quality outside coding.
- **Output format.** Despite the in-training markdown-fence stripping fix,
the policy occasionally still wraps code in ```` ``` ````. Strip fences
in any downstream evaluator.
- **Sandbox required for reward replay.** Reproducing the training reward
signal requires running pytest in a Docker container with the project's
resource limits — see the companion repo's `sandbox/runner.py`.
- **Not safety-tuned.** Inherits all behaviors of the base instruct model.
## Reproduction
Full training config, reward function, and sandbox runner are in the
companion repo:
- Repo: <https://github.com/Devesh-Maheshwari/verifiable-rl-coder>
- Training config: `configs/grpo_qwen1_5b.yaml`
- Reward fn: `src/verifiable_rl_coder/training/grpo_reward.py`
- Sandbox: `src/verifiable_rl_coder/sandbox/runner.py`
- CHTC submit scripts: `chtc/train_grpo.{sub,sh}`, `chtc/submit_train_grpo.sh`
## Citation
```bibtex
@misc{verifiable-rl-coder-2026,
author = {Maheshwari, Devesh},
title = {Verifiable-RL Coder: GRPO-trained Qwen-2.5-Coder with sandboxed verifiable rewards},
year = {2026},
howpublished = {\url{https://github.com/Devesh-Maheshwari/verifiable-rl-coder}}
}
```
## Acknowledgments
- [Qwen team](https://huggingface.co/Qwen) for the strong open-weights coder base.
- [verl](https://github.com/volcengine/verl) for the production-grade GRPO trainer.
- [evalplus](https://github.com/evalplus/evalplus) for the hardened HumanEval+/MBPP+ benchmarks.
- Trained on UW-Madison Center for High Throughput Computing (CHTC) resources.