Files
ModelHub XC 07e9113576 初始化项目,由ModelHub XC社区提供模型
Model: lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-step50
Source: Original Platform
2026-04-28 15:43:00 +08:00

78 lines
2.0 KiB
Markdown

---
library_name: transformers
base_model: lihaoxin2020/qwen3-4B-instruct-refiner-sft
model_type: qwen3
license: apache-2.0
tags:
- refiner
- grpo
- rl
- qwen3
---
# qwen3-4B-refiner-sft-rl-balanced-step50
This model is a **GRPO-trained** checkpoint (step 50) of the Qwen3-4B refiner, fine-tuned with reinforcement learning on the refiner RL dataset.
- **Experiment name:** `sft-5ep-offline-balanced-rl_5e-6-answer_only`
## Training Details
- **Base model:** `lihaoxin2020/qwen3-4B-instruct-refiner-sft`
- **Training method:** GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
- **Refiner mode:** `answer_only`
- **Training script:** `open_instruct/grpo_fast_refiner_sft.py`
### Hyperparameters
| Parameter | Value |
|---|---|
| **Learning rate** | 5e-6 |
| **LR scheduler** | constant |
| **Beta (KL penalty)** | 0.001 |
| **KL estimator** | kl3 |
| **Advantage normalization** | standard |
| **Samples per prompt (rollout)** | 8 |
| **Unique prompts per rollout** | 32 |
| **Mini batches** | 1 |
| **Epochs per batch** | 1 |
| **Per-device train batch size** | 1 |
| **Temperature** | 1.0 |
| **Seed** | 42 |
| **Async mode** | true |
| **Adam offload** | true |
| **vLLM sync backend** | nccl |
### Sequence Lengths
| Parameter | Value |
|---|---|
| Max token length | 8192 |
| Max prompt token length | 6144 |
| Response length | 1024 |
| Pack length | 8192 |
### Reward Configuration
| Parameter | Value |
|---|---|
| Verification reward | 10.0 |
| Non-stop penalty | false |
| Gate judge score with format bonus | false |
| Apply paper citation reward | true |
| Paper citation weight | 0.5 |
### Dataset
- **Training:** `lihaoxin2020/refiner_rl` (split: train)
- **Evaluation:** `lihaoxin2020/refiner_rl` (16 samples, split: test)
### Infrastructure
- **DeepSpeed stage:** 3
- **Learners per node:** 1
- **vLLM engines:** 1
- **vLLM tensor parallel size:** 1
- **vLLM GPU memory utilization:** 0.90
- **Judge model:** `Qwen/Qwen3.5-35B-A3B` (via vLLM)