--- license: apache-2.0 language: - en tags: - long-context - reinforcement-learning - reasoning - rubric-reward - qwen3 base_model: - Qwen/Qwen3-4B --- # LongTraceRL-4B [![Paper](https://img.shields.io/badge/Paper-arXiv:2605.31584-red)](https://arxiv.org/abs/2605.31584) [![Code](https://img.shields.io/badge/Code-GitHub-blue)](https://github.com/THU-KEG/LongTraceRL) ## Model Description **LongTraceRL-4B** is a 4-billion parameter reasoning model trained with reinforcement learning on long-context multi-hop QA tasks using trajectory-based tiered distractors and entity-level rubric rewards. ## Model Details - **Base Model**: Qwen3-4B-Thinking-2507 - **Parameters**: 4B - **Architecture**: Qwen3 (36 layers, hidden size 2560, GQA with 8 KV groups) - **Training Method**: GRPO with entity-level rubric reward - **Context Length**: 128K prompt + 32K response - **Language**: English ## Training Details - **Training Data**: 2,815 long-context multi-hop QA samples ([LongTraceRL Dataset](https://huggingface.co/datasets/THU-KEG/LongTraceRL)) - **Training Steps**: 200 - **Learning Rate**: 2e-6 (constant) - **Global Batch Size**: 128 - **GRPO Group Size**: 8 - **Rubric Reward Weight (η)**: 0.3 - **Framework**: [Slime](https://github.com/THUDM/slime) (Megatron-LM + SGLang) ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("THU-KEG/LongTraceRL-4B") tokenizer = AutoTokenizer.from_pretrained("THU-KEG/LongTraceRL-4B") ``` ## Citation ```bibtex @misc{lin2026longtracerllearninglongcontextreasoning, title={LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards}, author={Nianyi Lin and Jiajie Zhang and Lei Hou and Juanzi Li}, year={2026}, eprint={2605.31584}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2605.31584}, } ```