--- license: mit base_model: Qwen/Qwen2.5-7B-Instruct tags: - debugging - tool-use - multi-turn - reinforcement-learning - grpo datasets: - custom language: - en pipeline_tag: text-generation --- # DSL Debug 7B — RL-Only Step 30 Qwen2.5-7B-Instruct trained with GRPO (Group Relative Policy Optimization) directly from the base model, no SFT warmup. **Blog post:** [Multi-Turn RL for Code Debugging](https://andrewlngdn.github.io/dsl_debugger/) **Code + environment:** [github.com/AndrewLngdn/dsl-debug](https://github.com/AndrewLngdn/dsl-debug) ## Training - **Method**: GRPO with multi-turn tool use (verl 0.7 + sglang 0.5.6) - **Base model**: Qwen2.5-7B-Instruct (no SFT warmup) - **Steps**: 30 (batch size 512, 8 rollouts per prompt) - **LR**: 1e-5 cosine - **Reward**: Binary (1.0 if submitted code matches expected output, 0.0 otherwise) - **Hardware**: 2x A100-SXM4-80GB ## Results (held-out test, one-shot) | Split | Base Model | This Model | |-------|:---:|:---:| | Standard (481) | 50.5% | **78.8%** | | Nonlocal (200) | 12.0% | **54.0%** | | Intent-Mismatch (177) | 0.6% | **14.7%** | ## Alignment Tax | Benchmark | Base | This Model | |-----------|:---:|:---:| | MMLU (5-shot) | 74.6% | 74.7% | | GSM8K (8-shot) | 84.9% | 84.4% | | HumanEval (0-shot) | 65.9% | 59.1% | ## Usage ```python from huggingface_hub import snapshot_download snapshot_download("andrewlngdn/dsl-debug-7b-rl-only-step30", local_dir="/workspace/models/rl_only_step30") ``` ## Related Models | Model | Repo | |-------|------| | **SFT then RL step 35 (best)** | [andrewlngdn/dsl-debug-7b-sft-rl](https://huggingface.co/andrewlngdn/dsl-debug-7b-sft-rl) | | SFT step 100 | [andrewlngdn/dsl-debug-7b-sft-step100](https://huggingface.co/andrewlngdn/dsl-debug-7b-sft-step100) |