---
license: apache-2.0
base_model: lihaoxin2020/qwen3-4b-refiner-gpt54-ep2
pipeline_tag: text-generation
library_name: transformers
tags:
  - grpo
  - rlhf
  - refiner
  - qwen3
  - open-instruct
---

# qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50

GRPO checkpoint (step 50) of a 4B research-refiner policy.

- **Base model:** [`lihaoxin2020/qwen3-4b-refiner-gpt54-ep2`](https://huggingface.co/lihaoxin2020/qwen3-4b-refiner-gpt54-ep2) (SFT on GPT-5.4 rubric-grounded refiner data)
- **Algorithm:** GRPO (`grpo_fast_refiner_rubric.py` in [`dr-tulu/rl/open-instruct`](https://github.com/))
- **Training step:** 50
- **Task:** Given an agent's reasoning + search-tool output, produce a concise, citation-grounded refined answer.

## Training setup

- **Dataset:** `lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric` (train split) — each row carries per-instance `positive_rubrics` / `negative_rubrics` generated by GPT-5.4.
- **Reward = LLM rubric judge + paper-citation quality.** For each rollout the judge sees the concatenation of the row's per-instance rubrics and a static V3 rubric set (`Citation Discipline`, `Off-Query Padding`, `Wrong-Entity Confabulation`). Positive rubric scores enter as-is; negative rubrics are flipped (`1 - s`) so higher = better.
- **Judge:** `Qwen/Qwen3.5-35B-A3B` served via local vLLM.
- **Citation reward:** enabled, weight 0.2 (paper-citation precision/recall/F1 from the judge over claim-level snippet attributions).
- **KL:** `kl3` estimator, `beta = 0.001`.
- **Optimizer:** AdamW (offload off), `lr = 5e-6` constant, `per_device_train_batch_size = 1`, DeepSpeed ZeRO-3, FlashAttn.
- **Rollouts:** 32 unique prompts × 8 samples per prompt (group size 8), `temperature = 1.0`.
- **Context:** `max_prompt_token_length = 6144`, `response_length = 1024`, `max_token_length = 8192`, `pack_length = 8192`.
- **Seed:** 42.

## Refiner output format

The model is trained to emit a single `<answer>...</answer>` block. Every factual claim is wrapped in `<snippet id=ID1,ID2,...>claim</snippet>`, citing only IDs that appear in the raw tool output.

## Intended use

This is a **step-50 intermediate** checkpoint, useful as an early-training baseline for ablations on rubric design, citation weighting, or judge choice. For a final-quality refiner, prefer a later step from the same run or the successor run that drops static rubrics.

## Reproduction

Training script: [`train_dr_tulu_refiner-rubric.sh`](https://github.com/) (repo-internal). Relevant flags:

```
--verification_reward 10.0
--non_stop_penalty false
--refiner_reward_gate_judge_score_with_format_bonus false
--refiner_reward_apply_paper_citation_reward true
--refiner_reward_paper_citation_weight 0.2
--kl_estimator kl3 --beta 0.001
--num_unique_prompts_rollout 32 --num_samples_per_prompt_rollout 8
--learning_rate 5e-6 --lr_scheduler_type constant
--max_prompt_token_length 6144 --response_length 1024 --pack_length 8192
--temperature 1.0 --seed 42
--deepspeed_stage 3
```

## Known notes

- This run was terminated before completion at step ~100 due to a disk-quota issue on the training host; step 50 is a healthy intermediate checkpoint.
- A later iteration of the training script introduced a `--refiner_reward_instance_rubrics_only` flag that drops the static V3 rubrics from both reward and logging. **This checkpoint was produced before that flag existed**, so both per-instance and static V3 rubrics shaped its reward signal.