ModelHub XC 57cd3d9962 初始化项目,由ModelHub XC社区提供模型
Model: lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50
Source: Original Platform
2026-05-09 20:28:21 +08:00

license, base_model, pipeline_tag, library_name, tags
license base_model pipeline_tag library_name tags
apache-2.0 lihaoxin2020/qwen3-4b-refiner-gpt54-ep2 text-generation transformers
grpo
rlhf
refiner
qwen3
open-instruct

qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50

GRPO checkpoint (step 50) of a 4B research-refiner policy.

Training setup

  • Dataset: lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric (train split) — each row carries per-instance positive_rubrics / negative_rubrics generated by GPT-5.4.
  • Reward = LLM rubric judge + paper-citation quality. For each rollout the judge sees the concatenation of the row's per-instance rubrics and a static V3 rubric set (Citation Discipline, Off-Query Padding, Wrong-Entity Confabulation). Positive rubric scores enter as-is; negative rubrics are flipped (1 - s) so higher = better.
  • Judge: Qwen/Qwen3.5-35B-A3B served via local vLLM.
  • Citation reward: enabled, weight 0.2 (paper-citation precision/recall/F1 from the judge over claim-level snippet attributions).
  • KL: kl3 estimator, beta = 0.001.
  • Optimizer: AdamW (offload off), lr = 5e-6 constant, per_device_train_batch_size = 1, DeepSpeed ZeRO-3, FlashAttn.
  • Rollouts: 32 unique prompts × 8 samples per prompt (group size 8), temperature = 1.0.
  • Context: max_prompt_token_length = 6144, response_length = 1024, max_token_length = 8192, pack_length = 8192.
  • Seed: 42.

Refiner output format

The model is trained to emit a single <answer>...</answer> block. Every factual claim is wrapped in <snippet id=ID1,ID2,...>claim</snippet>, citing only IDs that appear in the raw tool output.

Intended use

This is a step-50 intermediate checkpoint, useful as an early-training baseline for ablations on rubric design, citation weighting, or judge choice. For a final-quality refiner, prefer a later step from the same run or the successor run that drops static rubrics.

Reproduction

Training script: train_dr_tulu_refiner-rubric.sh (repo-internal). Relevant flags:

--verification_reward 10.0
--non_stop_penalty false
--refiner_reward_gate_judge_score_with_format_bonus false
--refiner_reward_apply_paper_citation_reward true
--refiner_reward_paper_citation_weight 0.2
--kl_estimator kl3 --beta 0.001
--num_unique_prompts_rollout 32 --num_samples_per_prompt_rollout 8
--learning_rate 5e-6 --lr_scheduler_type constant
--max_prompt_token_length 6144 --response_length 1024 --pack_length 8192
--temperature 1.0 --seed 42
--deepspeed_stage 3

Known notes

  • This run was terminated before completion at step ~100 due to a disk-quota issue on the training host; step 50 is a healthy intermediate checkpoint.
  • A later iteration of the training script introduced a --refiner_reward_instance_rubrics_only flag that drops the static V3 rubrics from both reward and logging. This checkpoint was produced before that flag existed, so both per-instance and static V3 rubrics shaped its reward signal.
Description
Model synced from source: lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50
Readme 2 MiB
Languages
Jinja 100%