ModelHub XC 66d32d0e60 初始化项目,由ModelHub XC社区提供模型
Model: lihaoxin2020/qwen3-4B-refiner-3201-rl-balanced-step100
Source: Original Platform
2026-04-23 18:57:27 +08:00

library_name, base_model, model_type, license, tags
library_name base_model model_type license tags
transformers lihaoxin2020/qwen3-4B-refiner-sft-step-3201 qwen3 apache-2.0
refiner
grpo
rl
qwen3

qwen3-4B-refiner-rl-balanced-step100

This model is a GRPO-trained checkpoint (step 100) of the Qwen3-4B refiner, fine-tuned with reinforcement learning on the refiner RL dataset.

  • Experiment name: use_cache-5ep-offline-balanced-rl_5e-6-answer_only

Training Details

  • Base model: lihaoxin2020/qwen3-4B-refiner-sft-step-3201
  • Training method: GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
  • Refiner mode: answer_only
  • Training script: open_instruct/grpo_fast_refiner_sft.py

Hyperparameters

Parameter Value
Learning rate 5e-6
LR scheduler constant
Beta (KL penalty) 0.001
KL estimator kl3
Advantage normalization standard
Samples per prompt (rollout) 8
Unique prompts per rollout 32
Mini batches 1
Epochs per batch 1
Per-device train batch size 1
Temperature 1.0
Seed 42
Async mode true
Adam offload true
vLLM sync backend nccl

Sequence Lengths

Parameter Value
Max token length 8192
Max prompt token length 6144
Response length 1024
Pack length 8192

Reward Configuration

Parameter Value
Verification reward 10.0
Non-stop penalty false
Gate judge score with format bonus false
Apply paper citation reward true
Paper citation weight 0.5

Dataset

  • Training: lihaoxin2020/refiner_rl (split: train)
  • Evaluation: lihaoxin2020/refiner_rl (16 samples, split: test)

Infrastructure

  • DeepSpeed stage: 3
  • Learners per node: 1
  • vLLM engines: 1
  • vLLM tensor parallel size: 1
  • vLLM GPU memory utilization: 0.90
  • Judge model: Qwen/Qwen3.5-35B-A3B (via vLLM)
Description
Model synced from source: lihaoxin2020/qwen3-4B-refiner-3201-rl-balanced-step100
Readme 2 MiB
Languages
Jinja 100%