--- library_name: transformers base_model: lihaoxin2020/qwen3-4B-instruct-refiner-sft model_type: qwen3 license: apache-2.0 tags: - refiner - grpo - rl - qwen3 --- # qwen3-4B-refiner-sft-rl-balanced-step50 This model is a **GRPO-trained** checkpoint (step 50) of the Qwen3-4B refiner, fine-tuned with reinforcement learning on the refiner RL dataset. - **Experiment name:** `sft-5ep-offline-balanced-rl_5e-6-answer_only` ## Training Details - **Base model:** `lihaoxin2020/qwen3-4B-instruct-refiner-sft` - **Training method:** GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3 - **Refiner mode:** `answer_only` - **Training script:** `open_instruct/grpo_fast_refiner_sft.py` ### Hyperparameters | Parameter | Value | |---|---| | **Learning rate** | 5e-6 | | **LR scheduler** | constant | | **Beta (KL penalty)** | 0.001 | | **KL estimator** | kl3 | | **Advantage normalization** | standard | | **Samples per prompt (rollout)** | 8 | | **Unique prompts per rollout** | 32 | | **Mini batches** | 1 | | **Epochs per batch** | 1 | | **Per-device train batch size** | 1 | | **Temperature** | 1.0 | | **Seed** | 42 | | **Async mode** | true | | **Adam offload** | true | | **vLLM sync backend** | nccl | ### Sequence Lengths | Parameter | Value | |---|---| | Max token length | 8192 | | Max prompt token length | 6144 | | Response length | 1024 | | Pack length | 8192 | ### Reward Configuration | Parameter | Value | |---|---| | Verification reward | 10.0 | | Non-stop penalty | false | | Gate judge score with format bonus | false | | Apply paper citation reward | true | | Paper citation weight | 0.5 | ### Dataset - **Training:** `lihaoxin2020/refiner_rl` (split: train) - **Evaluation:** `lihaoxin2020/refiner_rl` (16 samples, split: test) ### Infrastructure - **DeepSpeed stage:** 3 - **Learners per node:** 1 - **vLLM engines:** 1 - **vLLM tensor parallel size:** 1 - **vLLM GPU memory utilization:** 0.90 - **Judge model:** `Qwen/Qwen3.5-35B-A3B` (via vLLM)