dsl-debug-7b-sft-step100/README.md

---
license: mit
base_model: Qwen/Qwen2.5-7B-Instruct
tags:
  - debugging
  - tool-use
  - multi-turn
  - sft
datasets:
  - custom
language:
  - en
pipeline_tag: text-generation
---

# DSL Debug 7B — SFT Step 100

Qwen2.5-7B-Instruct fine-tuned on 1,593 debugging trajectories for the DSL Debug environment.

**Blog post:** [Multi-Turn RL for Code Debugging](https://andrewlngdn.github.io/dsl_debugger/)
**Code + environment:** [github.com/AndrewLngdn/dsl-debug](https://github.com/AndrewLngdn/dsl-debug)

## Training

- **Method**: Supervised fine-tuning (verl 0.7)
- **Data**: 1,593 multi-turn trajectories with tool calls (run, inspect, read_docs, submit)
- **Base model**: Qwen2.5-7B-Instruct
- **Epochs**: 2 (step 100 checkpoint)
- **LR**: 5e-6
- **Hardware**: 2x A100-SXM4-80GB

## Results (held-out test, one-shot)

| Split | Base Model | This Model |
|-------|:---:|:---:|
| Standard (481) | 50.5% | **56.3%** |
| Nonlocal (200) | 12.0% | **40.0%** |
| Intent-Mismatch (177) | 0.6% | **7.9%** |

## Alignment Tax

| Benchmark | Base | This Model |
|-----------|:---:|:---:|
| MMLU (5-shot) | 74.6% | 74.6% |
| GSM8K (8-shot) | 84.9% | 83.9% |
| HumanEval (0-shot) | 65.9% | 62.2% |

## Usage

This checkpoint is primarily used as the starting point for SFT then RL training (GRPO), which achieves the best results.

```python
from huggingface_hub import snapshot_download
snapshot_download("andrewlngdn/dsl-debug-7b-sft-step100",
    local_dir="/workspace/models/sft_7b_step100")
```

## Related Models

| Model | Repo |
|-------|------|
| **SFT then RL step 35 (best)** | [andrewlngdn/dsl-debug-7b-sft-rl](https://huggingface.co/andrewlngdn/dsl-debug-7b-sft-rl) |
| RL-only step 30 | [andrewlngdn/dsl-debug-7b-rl-only-step30](https://huggingface.co/andrewlngdn/dsl-debug-7b-rl-only-step30) |
初始化项目，由ModelHub XC社区提供模型 Model: andrewlngdn/dsl-debug-7b-sft-step100 Source: Original Platform 2026-04-21 02:45:22 +08:00			`---`
			`license: mit`
			`base_model: Qwen/Qwen2.5-7B-Instruct`
			`tags:`
			`- debugging`
			`- tool-use`
			`- multi-turn`
			`- sft`
			`datasets:`
			`- custom`
			`language:`
			`- en`
			`pipeline_tag: text-generation`
			`---`

			`# DSL Debug 7B — SFT Step 100`

			`Qwen2.5-7B-Instruct fine-tuned on 1,593 debugging trajectories for the DSL Debug environment.`

			`Blog post: [Multi-Turn RL for Code Debugging](https://andrewlngdn.github.io/dsl_debugger/)`
			`Code + environment: [github.com/AndrewLngdn/dsl-debug](https://github.com/AndrewLngdn/dsl-debug)`

			`## Training`

			`- Method: Supervised fine-tuning (verl 0.7)`
			`- Data: 1,593 multi-turn trajectories with tool calls (run, inspect, read_docs, submit)`
			`- Base model: Qwen2.5-7B-Instruct`
			`- Epochs: 2 (step 100 checkpoint)`
			`- LR: 5e-6`
			`- Hardware: 2x A100-SXM4-80GB`

			`## Results (held-out test, one-shot)`

			`\| Split \| Base Model \| This Model \|`
			`\|-------\|:---:\|:---:\|`
			`\| Standard (481) \| 50.5% \| 56.3% \|`
			`\| Nonlocal (200) \| 12.0% \| 40.0% \|`
			`\| Intent-Mismatch (177) \| 0.6% \| 7.9% \|`

			`## Alignment Tax`

			`\| Benchmark \| Base \| This Model \|`
			`\|-----------\|:---:\|:---:\|`
			`\| MMLU (5-shot) \| 74.6% \| 74.6% \|`
			`\| GSM8K (8-shot) \| 84.9% \| 83.9% \|`
			`\| HumanEval (0-shot) \| 65.9% \| 62.2% \|`

			`## Usage`

			`This checkpoint is primarily used as the starting point for SFT then RL training (GRPO), which achieves the best results.`

			```python
			`from huggingface_hub import snapshot_download`
			`snapshot_download("andrewlngdn/dsl-debug-7b-sft-step100",`
			`local_dir="/workspace/models/sft_7b_step100")`
			```

			`## Related Models`

			`\| Model \| Repo \|`
			`\|-------\|------\|`
			`\| SFT then RL step 35 (best) \| [andrewlngdn/dsl-debug-7b-sft-rl](https://huggingface.co/andrewlngdn/dsl-debug-7b-sft-rl) \|`
			`\| RL-only step 30 \| [andrewlngdn/dsl-debug-7b-rl-only-step30](https://huggingface.co/andrewlngdn/dsl-debug-7b-rl-only-step30) \|`