library_name, base_model, model_type, license, tags
library_name
base_model
model_type
license
tags
transformers
lihaoxin2020/qwen3-4B-instruct-refiner-sft
qwen3
apache-2.0
qwen3-4B-refiner-sft-rl-balanced-step100
This model is a GRPO-trained checkpoint (step 100, resumed run) of the Qwen3-4B refiner, fine-tuned with reinforcement learning on the refiner RL dataset.
Experiment name: sft-5ep-offline-balanced-rl_5e-6-answer_only_resume
Training Details
Base model: lihaoxin2020/qwen3-4B-instruct-refiner-sft
Training method: GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
Refiner mode: answer_only
Training script: open_instruct/grpo_fast_refiner_sft.py
Hyperparameters
Parameter
Value
Learning rate
5e-6
LR scheduler
constant
Beta (KL penalty)
0.001
KL estimator
kl3
Advantage normalization
standard
Samples per prompt (rollout)
8
Unique prompts per rollout
32
Mini batches
1
Epochs per batch
1
Per-device train batch size
1
Temperature
1.0
Seed
42
Async mode
true
Adam offload
true
vLLM sync backend
nccl
Sequence Lengths
Parameter
Value
Max token length
8192
Max prompt token length
6144
Response length
1024
Pack length
8192
Reward Configuration
Parameter
Value
Verification reward
10.0
Non-stop penalty
false
Gate judge score with format bonus
false
Apply paper citation reward
true
Paper citation weight
0.5
Dataset
Training: lihaoxin2020/refiner_rl (split: train)
Evaluation: lihaoxin2020/refiner_rl (16 samples, split: test)
Infrastructure
DeepSpeed stage: 3
Learners per node: 1
vLLM engines: 1
vLLM tensor parallel size: 1
vLLM GPU memory utilization: 0.90
Judge model: Qwen/Qwen3.5-35B-A3B (via vLLM)