初始化项目,由ModelHub XC社区提供模型
Model: andrewlngdn/dsl-debug-7b-rl-only-step30 Source: Original Platform
This commit is contained in:
62
README.md
Normal file
62
README.md
Normal file
@@ -0,0 +1,62 @@
|
||||
---
|
||||
license: mit
|
||||
base_model: Qwen/Qwen2.5-7B-Instruct
|
||||
tags:
|
||||
- debugging
|
||||
- tool-use
|
||||
- multi-turn
|
||||
- reinforcement-learning
|
||||
- grpo
|
||||
datasets:
|
||||
- custom
|
||||
language:
|
||||
- en
|
||||
pipeline_tag: text-generation
|
||||
---
|
||||
|
||||
# DSL Debug 7B — RL-Only Step 30
|
||||
|
||||
Qwen2.5-7B-Instruct trained with GRPO (Group Relative Policy Optimization) directly from the base model, no SFT warmup.
|
||||
|
||||
**Blog post:** [Multi-Turn RL for Code Debugging](https://andrewlngdn.github.io/dsl_debugger/)
|
||||
**Code + environment:** [github.com/AndrewLngdn/dsl-debug](https://github.com/AndrewLngdn/dsl-debug)
|
||||
|
||||
## Training
|
||||
|
||||
- **Method**: GRPO with multi-turn tool use (verl 0.7 + sglang 0.5.6)
|
||||
- **Base model**: Qwen2.5-7B-Instruct (no SFT warmup)
|
||||
- **Steps**: 30 (batch size 512, 8 rollouts per prompt)
|
||||
- **LR**: 1e-5 cosine
|
||||
- **Reward**: Binary (1.0 if submitted code matches expected output, 0.0 otherwise)
|
||||
- **Hardware**: 2x A100-SXM4-80GB
|
||||
|
||||
## Results (held-out test, one-shot)
|
||||
|
||||
| Split | Base Model | This Model |
|
||||
|-------|:---:|:---:|
|
||||
| Standard (481) | 50.5% | **78.8%** |
|
||||
| Nonlocal (200) | 12.0% | **54.0%** |
|
||||
| Intent-Mismatch (177) | 0.6% | **14.7%** |
|
||||
|
||||
## Alignment Tax
|
||||
|
||||
| Benchmark | Base | This Model |
|
||||
|-----------|:---:|:---:|
|
||||
| MMLU (5-shot) | 74.6% | 74.7% |
|
||||
| GSM8K (8-shot) | 84.9% | 84.4% |
|
||||
| HumanEval (0-shot) | 65.9% | 59.1% |
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from huggingface_hub import snapshot_download
|
||||
snapshot_download("andrewlngdn/dsl-debug-7b-rl-only-step30",
|
||||
local_dir="/workspace/models/rl_only_step30")
|
||||
```
|
||||
|
||||
## Related Models
|
||||
|
||||
| Model | Repo |
|
||||
|-------|------|
|
||||
| **SFT then RL step 35 (best)** | [andrewlngdn/dsl-debug-7b-sft-rl](https://huggingface.co/andrewlngdn/dsl-debug-7b-sft-rl) |
|
||||
| SFT step 100 | [andrewlngdn/dsl-debug-7b-sft-step100](https://huggingface.co/andrewlngdn/dsl-debug-7b-sft-step100) |
|
||||
Reference in New Issue
Block a user