Files
ModelHub XC e8354d06e3 初始化项目,由ModelHub XC社区提供模型
Model: SEGAgentRL/LLDS-A-GSPO-Qwen2.5-3B-Ins
Source: Original Platform
2026-04-17 13:59:38 +08:00

75 lines
2.5 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen2.5-3B-Ins
pipeline_tag: reinforcement-learning
tags:
- Search
- QuestionAnswering
library_name: transformers
---
<h1 align="center">On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral</h1>
<p align="center">
📃 <a href="https://arxiv.org/abs/2512.04220" target="_blank">Paper</a> </a> |🤗 <a href="https://huggingface.co/SEGAgentRL" target="_blank">LLDS-Huggingface</a> |🐙 <a href="https://github.com/vengdeng/LLDS-On-Group-Relative-Policy-Optimization-Collapse-in-Search-R1" target="_blank">GitHub</a>
</p>
## ⚡ Introduction
**LLDS** is a lightweight likelihood-preserving regularization designed to stabilize **tool-integrated reinforcement learning** (e.g., GRPO / Search-R1 style training).
It prevents training collapse by regularizing **only when** the likelihood of (good) action decreases, and **only on** the tokens responsible for the decrease.
- We identify **Lazy Likelihood Displacement (LLD)** as a key mechanism behind collapse in tool-integrated GRPO training.
- LLDS activates **selectively**: it penalizes likelihood reduction on a *preserving set* (e.g., non-negative-advantage actions).
- We release our **LLDS-tuned Qwen2.5-3B-INS** checkpoint for searchs-integrated reasoning and QA.
- **A refer to action-level gate**, R refer to response-level gate, **action (A) level gate achieve the best performance**.
## 🔍 Tool-Integrated Search Inference (Search-R1 style)
We support tool-integrated inference using the same workflow as **[Search-R1](https://github.com/PeterGriffinJin/Search-R1)**, where the LLM interacts with a local retrieval server for multi-step reasoning.
The pipeline consists of two parts:
1. Launch a local retriever server
2. Run inference with the LLDS model
---
### 1⃣ Launch the local retrieval server
Search-R1 recommends running the retriever in a separate environment.
```bash
conda activate retriever
bash retrieval_launch.sh
```
### 2⃣ Run inference with LLDS-A-GSPO-Qwen2.5-3B-Ins
```bash
conda activate searchr1
python infer.py
MODEL_NAME = "<YOUR_ORG>/<YOUR_MODEL_NAME>" # e.g. my-org/LLDS-A-GSPO-Qwen2.5-3B-Ins
question = "Your question here"
```
## 📖 Citation
```
@article{deng2025grpo,
title={On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral},
author={Deng, Wenlong and Li, Yushu and Gong, Boying and Ren, Yi and Thrampoulidis, Christos and Li, Xiaoxiao},
journal={arXiv preprint arXiv:2512.04220},
year={2025}
}
```