78 lines
2.0 KiB
Markdown
78 lines
2.0 KiB
Markdown
---
|
|
library_name: transformers
|
|
base_model: lihaoxin2020/qwen3-4B-instruct-refiner-sft
|
|
model_type: qwen3
|
|
license: apache-2.0
|
|
tags:
|
|
- refiner
|
|
- grpo
|
|
- rl
|
|
- qwen3
|
|
---
|
|
|
|
# qwen3-4B-refiner-sft-rl-balanced-step50
|
|
|
|
This model is a **GRPO-trained** checkpoint (step 50) of the Qwen3-4B refiner, fine-tuned with reinforcement learning on the refiner RL dataset.
|
|
|
|
- **Experiment name:** `sft-5ep-offline-balanced-rl_5e-6-answer_only`
|
|
|
|
## Training Details
|
|
|
|
- **Base model:** `lihaoxin2020/qwen3-4B-instruct-refiner-sft`
|
|
- **Training method:** GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
|
|
- **Refiner mode:** `answer_only`
|
|
- **Training script:** `open_instruct/grpo_fast_refiner_sft.py`
|
|
|
|
### Hyperparameters
|
|
|
|
| Parameter | Value |
|
|
|---|---|
|
|
| **Learning rate** | 5e-6 |
|
|
| **LR scheduler** | constant |
|
|
| **Beta (KL penalty)** | 0.001 |
|
|
| **KL estimator** | kl3 |
|
|
| **Advantage normalization** | standard |
|
|
| **Samples per prompt (rollout)** | 8 |
|
|
| **Unique prompts per rollout** | 32 |
|
|
| **Mini batches** | 1 |
|
|
| **Epochs per batch** | 1 |
|
|
| **Per-device train batch size** | 1 |
|
|
| **Temperature** | 1.0 |
|
|
| **Seed** | 42 |
|
|
| **Async mode** | true |
|
|
| **Adam offload** | true |
|
|
| **vLLM sync backend** | nccl |
|
|
|
|
### Sequence Lengths
|
|
|
|
| Parameter | Value |
|
|
|---|---|
|
|
| Max token length | 8192 |
|
|
| Max prompt token length | 6144 |
|
|
| Response length | 1024 |
|
|
| Pack length | 8192 |
|
|
|
|
### Reward Configuration
|
|
|
|
| Parameter | Value |
|
|
|---|---|
|
|
| Verification reward | 10.0 |
|
|
| Non-stop penalty | false |
|
|
| Gate judge score with format bonus | false |
|
|
| Apply paper citation reward | true |
|
|
| Paper citation weight | 0.5 |
|
|
|
|
### Dataset
|
|
|
|
- **Training:** `lihaoxin2020/refiner_rl` (split: train)
|
|
- **Evaluation:** `lihaoxin2020/refiner_rl` (16 samples, split: test)
|
|
|
|
### Infrastructure
|
|
|
|
- **DeepSpeed stage:** 3
|
|
- **Learners per node:** 1
|
|
- **vLLM engines:** 1
|
|
- **vLLM tensor parallel size:** 1
|
|
- **vLLM GPU memory utilization:** 0.90
|
|
- **Judge model:** `Qwen/Qwen3.5-35B-A3B` (via vLLM)
|