Model: tclf90/QwenLong-L1-32B-AWQ Source: Original Platform
library_name, pipeline_tag, tags, base_model, base_model_relation
| library_name | pipeline_tag | tags | base_model | base_model_relation | |||||
|---|---|---|---|---|---|---|---|---|---|
| transformers | text-generation |
|
|
quantized |
通义智文-QwenLong-L1-32B-AWQ
【模型更新日期】
2025-05-28
1. 首次commit
【依赖】
vllm==0.8.5
transformers==4.51.3
1. 需使用V0推理模式
启动vllm之前,先设置环境变量
export VLLM_USE_V1=0
【模型列表】
| 文件大小 | 最近更新时间 |
|---|---|
19GB |
2025-05-21 |
【模型下载】
from modelscope import snapshot_download
snapshot_download('tclf90/QwenLong-L1-32B-AWQ', cache_dir="本地路径")
【介绍】
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li,
Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan
Tongyi Lab, Alibaba Group
🎉 News
-
May 26, 2025: 🔥 We release 🤗 QwenLong-L1-32B, which is the first long-context LRM trained with reinforcement learniing for long-context reasoning. Experiments on seven long-context DocQA benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs.
-
May 26, 2025: 🔥 We release 🤗 DocQA-RL-1.6K, which is a specialized RL training dataset comprising 1.6K document question answering (DocQA) problems spanning mathematical, logical, and multi-hop reasoning domains.
📚 Introduction
In this work, we propose QwenLong-L1, a novel reinforcement learning (RL) framework designed to facilitate the transition of LRMs from short-context proficiency to robust long-context generalization. In our preliminary experiments, we illustrate the differences between the training dynamics of short-context and long-context reasoning RL.
Our framework enhances short-context LRMs through progressive context scaling during RL training. The framework comprises three core components: a warm-up supervised fine-tuning (SFT) phase to initialize a robust policy, a curriculum-guided RL phase that facilitates stable adaptation from short to long contexts, and a difficulty-aware retrospective sampling mechanism that adjusts training complexity across stages to incentivize policy exploration. Leveraging recent RL algorithms, including GRPO and DAPO, our framework integrates hybrid reward functions combining rule-based and model-based binary outcome rewards to balance precision and recall. Through strategic utilization of group relative advantages during policy optimization, it guides LRMs to learn effective reasoning patterns essential for robust long-context grounding and superior reasoning capabilities.
🎯 Model Release
We release 🤗 QwenLong-L1-32B, which is the first long-context LRM trained with reinforcement learniing for long-context reasoning. Experiments on seven long-context DocQA benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs.
Here are the evaluation results.
📝 Citation
If you find this work is relevant with your research or applications, please feel free to cite our work!
@article{wan2025qwenlongl1,
title={QwenLong-L1: : Towards Long-Context Large Reasoning Models with Reinforcement Learning},
author={Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan},
journal={arXiv preprint arXiv:2505.17667},
year={2025}
}



