Model: andrewlngdn/dsl-debug-7b-sft-step100 Source: Original Platform
license, base_model, tags, datasets, language, pipeline_tag
| license | base_model | tags | datasets | language | pipeline_tag | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| mit | Qwen/Qwen2.5-7B-Instruct |
|
|
|
text-generation |
DSL Debug 7B — SFT Step 100
Qwen2.5-7B-Instruct fine-tuned on 1,593 debugging trajectories for the DSL Debug environment.
Blog post: Multi-Turn RL for Code Debugging Code + environment: github.com/AndrewLngdn/dsl-debug
Training
- Method: Supervised fine-tuning (verl 0.7)
- Data: 1,593 multi-turn trajectories with tool calls (run, inspect, read_docs, submit)
- Base model: Qwen2.5-7B-Instruct
- Epochs: 2 (step 100 checkpoint)
- LR: 5e-6
- Hardware: 2x A100-SXM4-80GB
Results (held-out test, one-shot)
| Split | Base Model | This Model |
|---|---|---|
| Standard (481) | 50.5% | 56.3% |
| Nonlocal (200) | 12.0% | 40.0% |
| Intent-Mismatch (177) | 0.6% | 7.9% |
Alignment Tax
| Benchmark | Base | This Model |
|---|---|---|
| MMLU (5-shot) | 74.6% | 74.6% |
| GSM8K (8-shot) | 84.9% | 83.9% |
| HumanEval (0-shot) | 65.9% | 62.2% |
Usage
This checkpoint is primarily used as the starting point for SFT then RL training (GRPO), which achieves the best results.
from huggingface_hub import snapshot_download
snapshot_download("andrewlngdn/dsl-debug-7b-sft-step100",
local_dir="/workspace/models/sft_7b_step100")
Related Models
| Model | Repo |
|---|---|
| SFT then RL step 35 (best) | andrewlngdn/dsl-debug-7b-sft-rl |
| RL-only step 30 | andrewlngdn/dsl-debug-7b-rl-only-step30 |
Description