初始化项目，由ModelHub XC社区提供模型

Model: lihaoxin2020/qwen3-4B-refiner-rubric-rl-step50 Source: Original Platform
2026-04-27 01:16:02 +08:00
commit 3f460edc43
14 changed files with 152363 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,78 @@
+---
+library_name: transformers
+base_model: lihaoxin2020/qwen3-4B-refiner-sft-step-3201
+model_type: qwen3
+license: apache-2.0
+tags:
+  - refiner
+  - grpo
+  - rl
+  - qwen3
+  - rubric
+---
+
+# qwen3-4B-refiner-rubric-rl-step50
+
+This model is a **GRPO-trained** checkpoint (step 50) of the Qwen3-4B refiner, fine-tuned with reinforcement learning using **per-instance rubric rewards**.
+
+- **Experiment name:** `instance-rubric-rl-5e-6-answer_only`
+
+## Training Details
+
+- **Base model:** `lihaoxin2020/qwen3-4B-refiner-sft-step-3201`
+- **Training method:** GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
+- **Refiner mode:** `answer_only`
+- **Training script:** `open_instruct/grpo_fast_refiner_rubric.py`
+
+### Hyperparameters
+
+| Parameter | Value |
+|---|---|
+| **Learning rate** | 5e-6 |
+| **LR scheduler** | constant |
+| **Beta (KL penalty)** | 0.001 |
+| **KL estimator** | kl3 |
+| **Advantage normalization** | standard |
+| **Samples per prompt (rollout)** | 8 |
+| **Unique prompts per rollout** | 32 |
+| **Mini batches** | 1 |
+| **Epochs per batch** | 1 |
+| **Per-device train batch size** | 1 |
+| **Temperature** | 1.0 |
+| **Seed** | 42 |
+| **Async mode** | true |
+| **Adam offload** | true |
+| **vLLM sync backend** | nccl |
+
+### Sequence Lengths
+
+| Parameter | Value |
+|---|---|
+| Max token length | 8192 |
+| Max prompt token length | 6144 |
+| Response length | 1024 |
+| Pack length | 8192 |
+
+### Reward Configuration
+
+| Parameter | Value |
+|---|---|
+| Verification reward | 10.0 |
+| Non-stop penalty | false |
+| Gate judge score with format bonus | **true** |
+| Apply paper citation reward | true |
+| Paper citation weight | **0.2** |
+
+### Dataset
+
+- **Training:** `lihaoxin2020/rl_hard_gpt5_sft` (split: train)
+- **Evaluation:** `lihaoxin2020/rl_hard_gpt5_sft` (16 samples, split: train)
+
+### Infrastructure
+
+- **DeepSpeed stage:** 3
+- **Learners per node:** 1
+- **vLLM engines:** 1
+- **vLLM tensor parallel size:** 1
+- **vLLM GPU memory utilization:** 0.90
+- **Judge model:** `Qwen/Qwen3.5-35B-A3B` (via vLLM)