初始化项目,由ModelHub XC社区提供模型
Model: lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-resume-step100 Source: Original Platform
This commit is contained in:
77
README.md
Normal file
77
README.md
Normal file
@@ -0,0 +1,77 @@
|
||||
---
|
||||
library_name: transformers
|
||||
base_model: lihaoxin2020/qwen3-4B-instruct-refiner-sft
|
||||
model_type: qwen3
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- refiner
|
||||
- grpo
|
||||
- rl
|
||||
- qwen3
|
||||
---
|
||||
|
||||
# qwen3-4B-refiner-sft-rl-balanced-step100
|
||||
|
||||
This model is a **GRPO-trained** checkpoint (step 100, resumed run) of the Qwen3-4B refiner, fine-tuned with reinforcement learning on the refiner RL dataset.
|
||||
|
||||
- **Experiment name:** `sft-5ep-offline-balanced-rl_5e-6-answer_only_resume`
|
||||
|
||||
## Training Details
|
||||
|
||||
- **Base model:** `lihaoxin2020/qwen3-4B-instruct-refiner-sft`
|
||||
- **Training method:** GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
|
||||
- **Refiner mode:** `answer_only`
|
||||
- **Training script:** `open_instruct/grpo_fast_refiner_sft.py`
|
||||
|
||||
### Hyperparameters
|
||||
|
||||
| Parameter | Value |
|
||||
|---|---|
|
||||
| **Learning rate** | 5e-6 |
|
||||
| **LR scheduler** | constant |
|
||||
| **Beta (KL penalty)** | 0.001 |
|
||||
| **KL estimator** | kl3 |
|
||||
| **Advantage normalization** | standard |
|
||||
| **Samples per prompt (rollout)** | 8 |
|
||||
| **Unique prompts per rollout** | 32 |
|
||||
| **Mini batches** | 1 |
|
||||
| **Epochs per batch** | 1 |
|
||||
| **Per-device train batch size** | 1 |
|
||||
| **Temperature** | 1.0 |
|
||||
| **Seed** | 42 |
|
||||
| **Async mode** | true |
|
||||
| **Adam offload** | true |
|
||||
| **vLLM sync backend** | nccl |
|
||||
|
||||
### Sequence Lengths
|
||||
|
||||
| Parameter | Value |
|
||||
|---|---|
|
||||
| Max token length | 8192 |
|
||||
| Max prompt token length | 6144 |
|
||||
| Response length | 1024 |
|
||||
| Pack length | 8192 |
|
||||
|
||||
### Reward Configuration
|
||||
|
||||
| Parameter | Value |
|
||||
|---|---|
|
||||
| Verification reward | 10.0 |
|
||||
| Non-stop penalty | false |
|
||||
| Gate judge score with format bonus | false |
|
||||
| Apply paper citation reward | true |
|
||||
| Paper citation weight | 0.5 |
|
||||
|
||||
### Dataset
|
||||
|
||||
- **Training:** `lihaoxin2020/refiner_rl` (split: train)
|
||||
- **Evaluation:** `lihaoxin2020/refiner_rl` (16 samples, split: test)
|
||||
|
||||
### Infrastructure
|
||||
|
||||
- **DeepSpeed stage:** 3
|
||||
- **Learners per node:** 1
|
||||
- **vLLM engines:** 1
|
||||
- **vLLM tensor parallel size:** 1
|
||||
- **vLLM GPU memory utilization:** 0.90
|
||||
- **Judge model:** `Qwen/Qwen3.5-35B-A3B` (via vLLM)
|
||||
Reference in New Issue
Block a user