library_name, base_model, model_type, license, tags
| library_name |
base_model |
model_type |
license |
tags |
| transformers |
lihaoxin2020/qwen3-4B-refiner-sft-step-3201 |
qwen3 |
apache-2.0 |
| refiner |
| grpo |
| rl |
| qwen3 |
| rubric |
|
qwen3-4B-refiner-rubric-rl-step50
This model is a GRPO-trained checkpoint (step 50) of the Qwen3-4B refiner, fine-tuned with reinforcement learning using per-instance rubric rewards.
- Experiment name:
instance-rubric-rl-5e-6-answer_only
Training Details
- Base model:
lihaoxin2020/qwen3-4B-refiner-sft-step-3201
- Training method: GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
- Refiner mode:
answer_only
- Training script:
open_instruct/grpo_fast_refiner_rubric.py
Hyperparameters
| Parameter |
Value |
| Learning rate |
5e-6 |
| LR scheduler |
constant |
| Beta (KL penalty) |
0.001 |
| KL estimator |
kl3 |
| Advantage normalization |
standard |
| Samples per prompt (rollout) |
8 |
| Unique prompts per rollout |
32 |
| Mini batches |
1 |
| Epochs per batch |
1 |
| Per-device train batch size |
1 |
| Temperature |
1.0 |
| Seed |
42 |
| Async mode |
true |
| Adam offload |
true |
| vLLM sync backend |
nccl |
Sequence Lengths
| Parameter |
Value |
| Max token length |
8192 |
| Max prompt token length |
6144 |
| Response length |
1024 |
| Pack length |
8192 |
Reward Configuration
| Parameter |
Value |
| Verification reward |
10.0 |
| Non-stop penalty |
false |
| Gate judge score with format bonus |
true |
| Apply paper citation reward |
true |
| Paper citation weight |
0.2 |
Dataset
- Training:
lihaoxin2020/rl_hard_gpt5_sft (split: train)
- Evaluation:
lihaoxin2020/rl_hard_gpt5_sft (16 samples, split: train)
Infrastructure
- DeepSpeed stage: 3
- Learners per node: 1
- vLLM engines: 1
- vLLM tensor parallel size: 1
- vLLM GPU memory utilization: 0.90
- Judge model:
Qwen/Qwen3.5-35B-A3B (via vLLM)