64 lines
1.8 KiB
Markdown
64 lines
1.8 KiB
Markdown
---
|
|
license: mit
|
|
base_model: Qwen/Qwen2.5-7B-Instruct
|
|
tags:
|
|
- debugging
|
|
- tool-use
|
|
- multi-turn
|
|
- sft
|
|
datasets:
|
|
- custom
|
|
language:
|
|
- en
|
|
pipeline_tag: text-generation
|
|
---
|
|
|
|
# DSL Debug 7B — SFT Step 100
|
|
|
|
Qwen2.5-7B-Instruct fine-tuned on 1,593 debugging trajectories for the DSL Debug environment.
|
|
|
|
**Blog post:** [Multi-Turn RL for Code Debugging](https://andrewlngdn.github.io/dsl_debugger/)
|
|
**Code + environment:** [github.com/AndrewLngdn/dsl-debug](https://github.com/AndrewLngdn/dsl-debug)
|
|
|
|
## Training
|
|
|
|
- **Method**: Supervised fine-tuning (verl 0.7)
|
|
- **Data**: 1,593 multi-turn trajectories with tool calls (run, inspect, read_docs, submit)
|
|
- **Base model**: Qwen2.5-7B-Instruct
|
|
- **Epochs**: 2 (step 100 checkpoint)
|
|
- **LR**: 5e-6
|
|
- **Hardware**: 2x A100-SXM4-80GB
|
|
|
|
## Results (held-out test, one-shot)
|
|
|
|
| Split | Base Model | This Model |
|
|
|-------|:---:|:---:|
|
|
| Standard (481) | 50.5% | **56.3%** |
|
|
| Nonlocal (200) | 12.0% | **40.0%** |
|
|
| Intent-Mismatch (177) | 0.6% | **7.9%** |
|
|
|
|
## Alignment Tax
|
|
|
|
| Benchmark | Base | This Model |
|
|
|-----------|:---:|:---:|
|
|
| MMLU (5-shot) | 74.6% | 74.6% |
|
|
| GSM8K (8-shot) | 84.9% | 83.9% |
|
|
| HumanEval (0-shot) | 65.9% | 62.2% |
|
|
|
|
## Usage
|
|
|
|
This checkpoint is primarily used as the starting point for SFT then RL training (GRPO), which achieves the best results.
|
|
|
|
```python
|
|
from huggingface_hub import snapshot_download
|
|
snapshot_download("andrewlngdn/dsl-debug-7b-sft-step100",
|
|
local_dir="/workspace/models/sft_7b_step100")
|
|
```
|
|
|
|
## Related Models
|
|
|
|
| Model | Repo |
|
|
|-------|------|
|
|
| **SFT then RL step 35 (best)** | [andrewlngdn/dsl-debug-7b-sft-rl](https://huggingface.co/andrewlngdn/dsl-debug-7b-sft-rl) |
|
|
| RL-only step 30 | [andrewlngdn/dsl-debug-7b-rl-only-step30](https://huggingface.co/andrewlngdn/dsl-debug-7b-rl-only-step30) |
|