lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50

Go to file

ModelHub XC 57cd3d9962 初始化项目，由ModelHub XC社区提供模型

Model: lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50
Source: Original Platform

2026-05-09 20:28:21 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

added_tokens.json

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

merges.txt

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

model-00001-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

model-00002-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

model.safetensors.index.json

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

vocab.json

初始化项目，由ModelHub XC社区提供模型

2026-05-09 20:28:21 +08:00

README.md

license, base_model, pipeline_tag, library_name, tags

license

base_model

pipeline_tag

library_name

qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50

GRPO checkpoint (step 50) of a 4B research-refiner policy.

Base model: lihaoxin2020/qwen3-4b-refiner-gpt54-ep2 (SFT on GPT-5.4 rubric-grounded refiner data)
Algorithm: GRPO (grpo_fast_refiner_rubric.py in dr-tulu/rl/open-instruct)
Training step: 50
Task: Given an agent's reasoning + search-tool output, produce a concise, citation-grounded refined answer.

Training setup

Dataset: lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric (train split) — each row carries per-instance positive_rubrics / negative_rubrics generated by GPT-5.4.
Reward = LLM rubric judge + paper-citation quality. For each rollout the judge sees the concatenation of the row's per-instance rubrics and a static V3 rubric set (Citation Discipline, Off-Query Padding, Wrong-Entity Confabulation). Positive rubric scores enter as-is; negative rubrics are flipped (1 - s) so higher = better.
Judge: Qwen/Qwen3.5-35B-A3B served via local vLLM.
Citation reward: enabled, weight 0.2 (paper-citation precision/recall/F1 from the judge over claim-level snippet attributions).
KL: kl3 estimator, beta = 0.001.
Optimizer: AdamW (offload off), lr = 5e-6 constant, per_device_train_batch_size = 1, DeepSpeed ZeRO-3, FlashAttn.
Rollouts: 32 unique prompts × 8 samples per prompt (group size 8), temperature = 1.0.
Context: max_prompt_token_length = 6144, response_length = 1024, max_token_length = 8192, pack_length = 8192.
Seed: 42.

Refiner output format

The model is trained to emit a single <answer>...</answer> block. Every factual claim is wrapped in <snippet id=ID1,ID2,...>claim</snippet>, citing only IDs that appear in the raw tool output.

Intended use

This is a step-50 intermediate checkpoint, useful as an early-training baseline for ablations on rubric design, citation weighting, or judge choice. For a final-quality refiner, prefer a later step from the same run or the successor run that drops static rubrics.

Reproduction

Training script: train_dr_tulu_refiner-rubric.sh (repo-internal). Relevant flags:

--verification_reward 10.0
--non_stop_penalty false
--refiner_reward_gate_judge_score_with_format_bonus false
--refiner_reward_apply_paper_citation_reward true
--refiner_reward_paper_citation_weight 0.2
--kl_estimator kl3 --beta 0.001
--num_unique_prompts_rollout 32 --num_samples_per_prompt_rollout 8
--learning_rate 5e-6 --lr_scheduler_type constant
--max_prompt_token_length 6144 --response_length 1024 --pack_length 8192
--temperature 1.0 --seed 42
--deepspeed_stage 3

Known notes

This run was terminated before completion at step ~100 due to a disk-quota issue on the training host; step 50 is a healthy intermediate checkpoint.
A later iteration of the training script introduced a --refiner_reward_instance_rubrics_only flag that drops the static V3 rubrics from both reward and logging. This checkpoint was produced before that flag existed, so both per-instance and static V3 rubrics shaped its reward signal.

README.md Unescape Escape

qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50

Training setup

Refiner output format

Intended use

Reproduction

Known notes

README.md