Files
ModelHub XC 57cd3d9962 初始化项目,由ModelHub XC社区提供模型
Model: lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50
Source: Original Platform
2026-05-09 20:28:21 +08:00

65 lines
3.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
base_model: lihaoxin2020/qwen3-4b-refiner-gpt54-ep2
pipeline_tag: text-generation
library_name: transformers
tags:
- grpo
- rlhf
- refiner
- qwen3
- open-instruct
---
# qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50
GRPO checkpoint (step 50) of a 4B research-refiner policy.
- **Base model:** [`lihaoxin2020/qwen3-4b-refiner-gpt54-ep2`](https://huggingface.co/lihaoxin2020/qwen3-4b-refiner-gpt54-ep2) (SFT on GPT-5.4 rubric-grounded refiner data)
- **Algorithm:** GRPO (`grpo_fast_refiner_rubric.py` in [`dr-tulu/rl/open-instruct`](https://github.com/))
- **Training step:** 50
- **Task:** Given an agent's reasoning + search-tool output, produce a concise, citation-grounded refined answer.
## Training setup
- **Dataset:** `lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric` (train split) — each row carries per-instance `positive_rubrics` / `negative_rubrics` generated by GPT-5.4.
- **Reward = LLM rubric judge + paper-citation quality.** For each rollout the judge sees the concatenation of the row's per-instance rubrics and a static V3 rubric set (`Citation Discipline`, `Off-Query Padding`, `Wrong-Entity Confabulation`). Positive rubric scores enter as-is; negative rubrics are flipped (`1 - s`) so higher = better.
- **Judge:** `Qwen/Qwen3.5-35B-A3B` served via local vLLM.
- **Citation reward:** enabled, weight 0.2 (paper-citation precision/recall/F1 from the judge over claim-level snippet attributions).
- **KL:** `kl3` estimator, `beta = 0.001`.
- **Optimizer:** AdamW (offload off), `lr = 5e-6` constant, `per_device_train_batch_size = 1`, DeepSpeed ZeRO-3, FlashAttn.
- **Rollouts:** 32 unique prompts × 8 samples per prompt (group size 8), `temperature = 1.0`.
- **Context:** `max_prompt_token_length = 6144`, `response_length = 1024`, `max_token_length = 8192`, `pack_length = 8192`.
- **Seed:** 42.
## Refiner output format
The model is trained to emit a single `<answer>...</answer>` block. Every factual claim is wrapped in `<snippet id=ID1,ID2,...>claim</snippet>`, citing only IDs that appear in the raw tool output.
## Intended use
This is a **step-50 intermediate** checkpoint, useful as an early-training baseline for ablations on rubric design, citation weighting, or judge choice. For a final-quality refiner, prefer a later step from the same run or the successor run that drops static rubrics.
## Reproduction
Training script: [`train_dr_tulu_refiner-rubric.sh`](https://github.com/) (repo-internal). Relevant flags:
```
--verification_reward 10.0
--non_stop_penalty false
--refiner_reward_gate_judge_score_with_format_bonus false
--refiner_reward_apply_paper_citation_reward true
--refiner_reward_paper_citation_weight 0.2
--kl_estimator kl3 --beta 0.001
--num_unique_prompts_rollout 32 --num_samples_per_prompt_rollout 8
--learning_rate 5e-6 --lr_scheduler_type constant
--max_prompt_token_length 6144 --response_length 1024 --pack_length 8192
--temperature 1.0 --seed 42
--deepspeed_stage 3
```
## Known notes
- This run was terminated before completion at step ~100 due to a disk-quota issue on the training host; step 50 is a healthy intermediate checkpoint.
- A later iteration of the training script introduced a `--refiner_reward_instance_rubrics_only` flag that drops the static V3 rubrics from both reward and logging. **This checkpoint was produced before that flag existed**, so both per-instance and static V3 rubrics shaped its reward signal.