license, base_model, tags, datasets, language, pipeline_tag
| license |
base_model |
tags |
datasets |
language |
pipeline_tag |
| mit |
Qwen/Qwen2.5-7B-Instruct |
| debugging |
| tool-use |
| multi-turn |
| reinforcement-learning |
| grpo |
|
|
|
text-generation |
DSL Debug 7B — RL-Only Step 30
Qwen2.5-7B-Instruct trained with GRPO (Group Relative Policy Optimization) directly from the base model, no SFT warmup.
Blog post: Multi-Turn RL for Code Debugging
Code + environment: github.com/AndrewLngdn/dsl-debug
Training
- Method: GRPO with multi-turn tool use (verl 0.7 + sglang 0.5.6)
- Base model: Qwen2.5-7B-Instruct (no SFT warmup)
- Steps: 30 (batch size 512, 8 rollouts per prompt)
- LR: 1e-5 cosine
- Reward: Binary (1.0 if submitted code matches expected output, 0.0 otherwise)
- Hardware: 2x A100-SXM4-80GB
Results (held-out test, one-shot)
| Split |
Base Model |
This Model |
| Standard (481) |
50.5% |
78.8% |
| Nonlocal (200) |
12.0% |
54.0% |
| Intent-Mismatch (177) |
0.6% |
14.7% |
Alignment Tax
| Benchmark |
Base |
This Model |
| MMLU (5-shot) |
74.6% |
74.7% |
| GSM8K (8-shot) |
84.9% |
84.4% |
| HumanEval (0-shot) |
65.9% |
59.1% |
Usage
Related Models