Model: lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-step50 Source: Original Platform
library_name, base_model, model_type, license, tags
| library_name | base_model | model_type | license | tags | ||||
|---|---|---|---|---|---|---|---|---|
| transformers | lihaoxin2020/qwen3-4B-instruct-refiner-sft | qwen3 | apache-2.0 |
|
qwen3-4B-refiner-sft-rl-balanced-step50
This model is a GRPO-trained checkpoint (step 50) of the Qwen3-4B refiner, fine-tuned with reinforcement learning on the refiner RL dataset.
- Experiment name:
sft-5ep-offline-balanced-rl_5e-6-answer_only
Training Details
- Base model:
lihaoxin2020/qwen3-4B-instruct-refiner-sft - Training method: GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
- Refiner mode:
answer_only - Training script:
open_instruct/grpo_fast_refiner_sft.py
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 5e-6 |
| LR scheduler | constant |
| Beta (KL penalty) | 0.001 |
| KL estimator | kl3 |
| Advantage normalization | standard |
| Samples per prompt (rollout) | 8 |
| Unique prompts per rollout | 32 |
| Mini batches | 1 |
| Epochs per batch | 1 |
| Per-device train batch size | 1 |
| Temperature | 1.0 |
| Seed | 42 |
| Async mode | true |
| Adam offload | true |
| vLLM sync backend | nccl |
Sequence Lengths
| Parameter | Value |
|---|---|
| Max token length | 8192 |
| Max prompt token length | 6144 |
| Response length | 1024 |
| Pack length | 8192 |
Reward Configuration
| Parameter | Value |
|---|---|
| Verification reward | 10.0 |
| Non-stop penalty | false |
| Gate judge score with format bonus | false |
| Apply paper citation reward | true |
| Paper citation weight | 0.5 |
Dataset
- Training:
lihaoxin2020/refiner_rl(split: train) - Evaluation:
lihaoxin2020/refiner_rl(16 samples, split: test)
Infrastructure
- DeepSpeed stage: 3
- Learners per node: 1
- vLLM engines: 1
- vLLM tensor parallel size: 1
- vLLM GPU memory utilization: 0.90
- Judge model:
Qwen/Qwen3.5-35B-A3B(via vLLM)
Description
Languages
Jinja
100%