Model: THU-KEG/LongTraceRL-4B Source: Original Platform
license, language, tags, base_model
| license | language | tags | base_model | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| apache-2.0 |
|
|
|
LongTraceRL-4B
Model Description
LongTraceRL-4B is a 4-billion parameter reasoning model trained with reinforcement learning on long-context multi-hop QA tasks using trajectory-based tiered distractors and entity-level rubric rewards.
Model Details
- Base Model: Qwen3-4B-Thinking-2507
- Parameters: 4B
- Architecture: Qwen3 (36 layers, hidden size 2560, GQA with 8 KV groups)
- Training Method: GRPO with entity-level rubric reward
- Context Length: 128K prompt + 32K response
- Language: English
Training Details
- Training Data: 2,815 long-context multi-hop QA samples (LongTraceRL Dataset)
- Training Steps: 200
- Learning Rate: 2e-6 (constant)
- Global Batch Size: 128
- GRPO Group Size: 8
- Rubric Reward Weight (η): 0.3
- Framework: Slime (Megatron-LM + SGLang)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("THU-KEG/LongTraceRL-4B")
tokenizer = AutoTokenizer.from_pretrained("THU-KEG/LongTraceRL-4B")
Citation
@misc{lin2026longtracerllearninglongcontextreasoning,
title={LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards},
author={Nianyi Lin and Jiajie Zhang and Lei Hou and Juanzi Li},
year={2026},
eprint={2605.31584},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.31584},
}
Description