初始化项目,由ModelHub XC社区提供模型
Model: lihaoxin2020/qwen3-4B-refiner-rubric-rl-step50 Source: Original Platform
This commit is contained in:
78
README.md
Normal file
78
README.md
Normal file
@@ -0,0 +1,78 @@
|
||||
---
|
||||
library_name: transformers
|
||||
base_model: lihaoxin2020/qwen3-4B-refiner-sft-step-3201
|
||||
model_type: qwen3
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- refiner
|
||||
- grpo
|
||||
- rl
|
||||
- qwen3
|
||||
- rubric
|
||||
---
|
||||
|
||||
# qwen3-4B-refiner-rubric-rl-step50
|
||||
|
||||
This model is a **GRPO-trained** checkpoint (step 50) of the Qwen3-4B refiner, fine-tuned with reinforcement learning using **per-instance rubric rewards**.
|
||||
|
||||
- **Experiment name:** `instance-rubric-rl-5e-6-answer_only`
|
||||
|
||||
## Training Details
|
||||
|
||||
- **Base model:** `lihaoxin2020/qwen3-4B-refiner-sft-step-3201`
|
||||
- **Training method:** GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
|
||||
- **Refiner mode:** `answer_only`
|
||||
- **Training script:** `open_instruct/grpo_fast_refiner_rubric.py`
|
||||
|
||||
### Hyperparameters
|
||||
|
||||
| Parameter | Value |
|
||||
|---|---|
|
||||
| **Learning rate** | 5e-6 |
|
||||
| **LR scheduler** | constant |
|
||||
| **Beta (KL penalty)** | 0.001 |
|
||||
| **KL estimator** | kl3 |
|
||||
| **Advantage normalization** | standard |
|
||||
| **Samples per prompt (rollout)** | 8 |
|
||||
| **Unique prompts per rollout** | 32 |
|
||||
| **Mini batches** | 1 |
|
||||
| **Epochs per batch** | 1 |
|
||||
| **Per-device train batch size** | 1 |
|
||||
| **Temperature** | 1.0 |
|
||||
| **Seed** | 42 |
|
||||
| **Async mode** | true |
|
||||
| **Adam offload** | true |
|
||||
| **vLLM sync backend** | nccl |
|
||||
|
||||
### Sequence Lengths
|
||||
|
||||
| Parameter | Value |
|
||||
|---|---|
|
||||
| Max token length | 8192 |
|
||||
| Max prompt token length | 6144 |
|
||||
| Response length | 1024 |
|
||||
| Pack length | 8192 |
|
||||
|
||||
### Reward Configuration
|
||||
|
||||
| Parameter | Value |
|
||||
|---|---|
|
||||
| Verification reward | 10.0 |
|
||||
| Non-stop penalty | false |
|
||||
| Gate judge score with format bonus | **true** |
|
||||
| Apply paper citation reward | true |
|
||||
| Paper citation weight | **0.2** |
|
||||
|
||||
### Dataset
|
||||
|
||||
- **Training:** `lihaoxin2020/rl_hard_gpt5_sft` (split: train)
|
||||
- **Evaluation:** `lihaoxin2020/rl_hard_gpt5_sft` (16 samples, split: train)
|
||||
|
||||
### Infrastructure
|
||||
|
||||
- **DeepSpeed stage:** 3
|
||||
- **Learners per node:** 1
|
||||
- **vLLM engines:** 1
|
||||
- **vLLM tensor parallel size:** 1
|
||||
- **vLLM GPU memory utilization:** 0.90
|
||||
- **Judge model:** `Qwen/Qwen3.5-35B-A3B` (via vLLM)
|
||||
Reference in New Issue
Block a user