---
library_name: transformers
base_model: lihaoxin2020/qwen3-4B-instruct-refiner-sft
model_type: qwen3
license: apache-2.0
tags:
  - refiner
  - grpo
  - rl
  - qwen3
---

# qwen3-4B-refiner-sft-rl-balanced-step50

This model is a **GRPO-trained** checkpoint (step 50) of the Qwen3-4B refiner, fine-tuned with reinforcement learning on the refiner RL dataset.

- **Experiment name:** `sft-5ep-offline-balanced-rl_5e-6-answer_only`

## Training Details

- **Base model:** `lihaoxin2020/qwen3-4B-instruct-refiner-sft`
- **Training method:** GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
- **Refiner mode:** `answer_only`
- **Training script:** `open_instruct/grpo_fast_refiner_sft.py`

### Hyperparameters

| Parameter | Value |
|---|---|
| **Learning rate** | 5e-6 |
| **LR scheduler** | constant |
| **Beta (KL penalty)** | 0.001 |
| **KL estimator** | kl3 |
| **Advantage normalization** | standard |
| **Samples per prompt (rollout)** | 8 |
| **Unique prompts per rollout** | 32 |
| **Mini batches** | 1 |
| **Epochs per batch** | 1 |
| **Per-device train batch size** | 1 |
| **Temperature** | 1.0 |
| **Seed** | 42 |
| **Async mode** | true |
| **Adam offload** | true |
| **vLLM sync backend** | nccl |

### Sequence Lengths

| Parameter | Value |
|---|---|
| Max token length | 8192 |
| Max prompt token length | 6144 |
| Response length | 1024 |
| Pack length | 8192 |

### Reward Configuration

| Parameter | Value |
|---|---|
| Verification reward | 10.0 |
| Non-stop penalty | false |
| Gate judge score with format bonus | false |
| Apply paper citation reward | true |
| Paper citation weight | 0.5 |

### Dataset

- **Training:** `lihaoxin2020/refiner_rl` (split: train)
- **Evaluation:** `lihaoxin2020/refiner_rl` (16 samples, split: test)

### Infrastructure

- **DeepSpeed stage:** 3
- **Learners per node:** 1
- **vLLM engines:** 1
- **vLLM tensor parallel size:** 1
- **vLLM GPU memory utilization:** 0.90
- **Judge model:** `Qwen/Qwen3.5-35B-A3B` (via vLLM)