--- license: apache-2.0 base_model: lihaoxin2020/qwen3-4b-refiner-gpt54-ep2 pipeline_tag: text-generation library_name: transformers tags: - grpo - rlhf - refiner - qwen3 - open-instruct --- # qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50 GRPO checkpoint (step 50) of a 4B research-refiner policy. - **Base model:** [`lihaoxin2020/qwen3-4b-refiner-gpt54-ep2`](https://huggingface.co/lihaoxin2020/qwen3-4b-refiner-gpt54-ep2) (SFT on GPT-5.4 rubric-grounded refiner data) - **Algorithm:** GRPO (`grpo_fast_refiner_rubric.py` in [`dr-tulu/rl/open-instruct`](https://github.com/)) - **Training step:** 50 - **Task:** Given an agent's reasoning + search-tool output, produce a concise, citation-grounded refined answer. ## Training setup - **Dataset:** `lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric` (train split) — each row carries per-instance `positive_rubrics` / `negative_rubrics` generated by GPT-5.4. - **Reward = LLM rubric judge + paper-citation quality.** For each rollout the judge sees the concatenation of the row's per-instance rubrics and a static V3 rubric set (`Citation Discipline`, `Off-Query Padding`, `Wrong-Entity Confabulation`). Positive rubric scores enter as-is; negative rubrics are flipped (`1 - s`) so higher = better. - **Judge:** `Qwen/Qwen3.5-35B-A3B` served via local vLLM. - **Citation reward:** enabled, weight 0.2 (paper-citation precision/recall/F1 from the judge over claim-level snippet attributions). - **KL:** `kl3` estimator, `beta = 0.001`. - **Optimizer:** AdamW (offload off), `lr = 5e-6` constant, `per_device_train_batch_size = 1`, DeepSpeed ZeRO-3, FlashAttn. - **Rollouts:** 32 unique prompts × 8 samples per prompt (group size 8), `temperature = 1.0`. - **Context:** `max_prompt_token_length = 6144`, `response_length = 1024`, `max_token_length = 8192`, `pack_length = 8192`. - **Seed:** 42. ## Refiner output format The model is trained to emit a single `...` block. Every factual claim is wrapped in `claim`, citing only IDs that appear in the raw tool output. ## Intended use This is a **step-50 intermediate** checkpoint, useful as an early-training baseline for ablations on rubric design, citation weighting, or judge choice. For a final-quality refiner, prefer a later step from the same run or the successor run that drops static rubrics. ## Reproduction Training script: [`train_dr_tulu_refiner-rubric.sh`](https://github.com/) (repo-internal). Relevant flags: ``` --verification_reward 10.0 --non_stop_penalty false --refiner_reward_gate_judge_score_with_format_bonus false --refiner_reward_apply_paper_citation_reward true --refiner_reward_paper_citation_weight 0.2 --kl_estimator kl3 --beta 0.001 --num_unique_prompts_rollout 32 --num_samples_per_prompt_rollout 8 --learning_rate 5e-6 --lr_scheduler_type constant --max_prompt_token_length 6144 --response_length 1024 --pack_length 8192 --temperature 1.0 --seed 42 --deepspeed_stage 3 ``` ## Known notes - This run was terminated before completion at step ~100 due to a disk-quota issue on the training host; step 50 is a healthy intermediate checkpoint. - A later iteration of the training script introduced a `--refiner_reward_instance_rubrics_only` flag that drops the static V3 rubrics from both reward and logging. **This checkpoint was produced before that flag existed**, so both per-instance and static V3 rubrics shaped its reward signal.