Model: lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50 Source: Original Platform
license, base_model, pipeline_tag, library_name, tags
| license | base_model | pipeline_tag | library_name | tags | |||||
|---|---|---|---|---|---|---|---|---|---|
| apache-2.0 | lihaoxin2020/qwen3-4b-refiner-gpt54-ep2 | text-generation | transformers |
|
qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50
GRPO checkpoint (step 50) of a 4B research-refiner policy.
- Base model:
lihaoxin2020/qwen3-4b-refiner-gpt54-ep2(SFT on GPT-5.4 rubric-grounded refiner data) - Algorithm: GRPO (
grpo_fast_refiner_rubric.pyindr-tulu/rl/open-instruct) - Training step: 50
- Task: Given an agent's reasoning + search-tool output, produce a concise, citation-grounded refined answer.
Training setup
- Dataset:
lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric(train split) — each row carries per-instancepositive_rubrics/negative_rubricsgenerated by GPT-5.4. - Reward = LLM rubric judge + paper-citation quality. For each rollout the judge sees the concatenation of the row's per-instance rubrics and a static V3 rubric set (
Citation Discipline,Off-Query Padding,Wrong-Entity Confabulation). Positive rubric scores enter as-is; negative rubrics are flipped (1 - s) so higher = better. - Judge:
Qwen/Qwen3.5-35B-A3Bserved via local vLLM. - Citation reward: enabled, weight 0.2 (paper-citation precision/recall/F1 from the judge over claim-level snippet attributions).
- KL:
kl3estimator,beta = 0.001. - Optimizer: AdamW (offload off),
lr = 5e-6constant,per_device_train_batch_size = 1, DeepSpeed ZeRO-3, FlashAttn. - Rollouts: 32 unique prompts × 8 samples per prompt (group size 8),
temperature = 1.0. - Context:
max_prompt_token_length = 6144,response_length = 1024,max_token_length = 8192,pack_length = 8192. - Seed: 42.
Refiner output format
The model is trained to emit a single <answer>...</answer> block. Every factual claim is wrapped in <snippet id=ID1,ID2,...>claim</snippet>, citing only IDs that appear in the raw tool output.
Intended use
This is a step-50 intermediate checkpoint, useful as an early-training baseline for ablations on rubric design, citation weighting, or judge choice. For a final-quality refiner, prefer a later step from the same run or the successor run that drops static rubrics.
Reproduction
Training script: train_dr_tulu_refiner-rubric.sh (repo-internal). Relevant flags:
--verification_reward 10.0
--non_stop_penalty false
--refiner_reward_gate_judge_score_with_format_bonus false
--refiner_reward_apply_paper_citation_reward true
--refiner_reward_paper_citation_weight 0.2
--kl_estimator kl3 --beta 0.001
--num_unique_prompts_rollout 32 --num_samples_per_prompt_rollout 8
--learning_rate 5e-6 --lr_scheduler_type constant
--max_prompt_token_length 6144 --response_length 1024 --pack_length 8192
--temperature 1.0 --seed 42
--deepspeed_stage 3
Known notes
- This run was terminated before completion at step ~100 due to a disk-quota issue on the training host; step 50 is a healthy intermediate checkpoint.
- A later iteration of the training script introduced a
--refiner_reward_instance_rubrics_onlyflag that drops the static V3 rubrics from both reward and logging. This checkpoint was produced before that flag existed, so both per-instance and static V3 rubrics shaped its reward signal.
Description
Model synced from source: lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50
Languages
Jinja
100%