Qwen2.5-7B-Instruct fine-tuned with SFT then GRPO reinforcement learning to debug programs in a custom dataflow DSL. This is the best-performing checkpoint (step 35 of 40 RL steps).
Two-stage training on 2x A100-80GB using verl 0.7:
SFT: 1,593 expert trajectories from GPT-5-mini, full parameter updates, LR=5e-6, 2 epochs (step 100)
GRPO: 6,420 RL problems, LR=1e-5 cosine, batch 512 prompts x 8 rollouts, 40 steps, no KL penalty
The Task
The model debugs programs in a custom pipe-based dataflow DSL. Each episode provides buggy code and expected output. The model has 8 turns and 4 tools:
# With sglangfromsglangimportRuntimeEndpointimportsglangassglruntime=RuntimeEndpoint("http://localhost:30000")# Or download and servefromhuggingface_hubimportsnapshot_downloadsnapshot_download("andrewlngdn/dsl-debug-7b-sft-rl",local_dir="./model")
# Using the dsl-debug CLI
pip install dsl-debug
dsl-debug sglang # downloads and serves this model
dsl-debug eval --split standard # evaluate on test set