Files
DGPO-qwen2.5-0.5b/README.md
ModelHub XC 6e356f4992 初始化项目,由ModelHub XC社区提供模型
Model: omron-sinicx/DGPO-qwen2.5-0.5b
Source: Original Platform
2026-06-16 08:45:17 +08:00

158 lines
4.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- reinforcement-learning
- rag
- qwen2.5
base_model: omron-sinicx/Qwen2.5-0.5B-Instruct-kd
---
<div align="center">
<!-- # 🔍 DGPO: Distillation-Guided Policy Optimization -->
# Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities
<p align="center">
<a href="https://arxiv.org/abs/2508.20324"><img src="https://img.shields.io/badge/Paper-arXiv-red?logo=arxiv"></a>
<!-- <a href="#"><img src="https://img.shields.io/badge/GitHub-Code-black?logo=github"></a> -->
</p>
<!-- *DGPO enables compact language models (0.51B) to perform agentic RAG with stable reinforcement learning.* -->
</div>
---
## 🧠 Overview
**DGPO (Distillation-Guided Policy Optimization) is a reinforcement learning framework with integrated knowledge distillation, designed to enable agentic search behaviors in compact language models.**
While RL works well for large models, compact models suffer from:
* ❌ Poor initial outputs
* ❌ Training collapse in RL
* ❌ Ineffective exploration
DGPO solves this by combining:
***Cold-start knowledge distillation (KD)**
***Teacher-guided reinforcement learning**
This enables stable learning and even allows compact models to **match or surpass teacher models**
---
## ⚙️ Key Idea
### 🔁 Distillation-Guided RL
DGPO introduces a simple but powerful principle:
> ✅ Reward if correct
> ❌ Mimic teacher if wrong
This creates a stable learning signal even when the model is weak.
---
## 🏗️ Framework
### 1. Cold-Start Initialization (KD)
* Train student using teacher-generated outputs (TGO)
* Provides high-quality trajectories
* Prevents early collapse
### 2. Distillation-Guided RL
* Use PPO-based RL
* Reward correct answers
* Apply **selective KL penalty only when wrong**
This enables:
* Stable training
* Efficient exploration
* Error-focused learning
---
## 🔍 Agentic RAG Behavior
DGPO trains models to perform **multi-step search reasoning**:
```
<think> reasoning </think>
<search> query </search>
<information> retrieved docs </information>
<answer> final answer </answer>
```
---
## 🚀 Performance
### Overall QA Performance
📊 Qwen2.5 (3B → 0.5B)
| Method | NQ | TriviaQA | PopQA | HotpotQA | 2Wiki | MuSiQue | Bamboogle | Avg. |
| --------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
| *Student-0.5B* | 0.004 | 0.006 | 0.007 | 0.007 | 0.015 | 0.000 | 0.000 | 0.006 |
| *Teacher-3B* | 0.365 | 0.569 | 0.393 | 0.340 | 0.368 | 0.135 | 0.298 | 0.353 |
| *PPO* | 0.306 | 0.444 | 0.379 | 0.205 | 0.218 | 0.041 | 0.073 | 0.238 |
| *GKD* | 0.266 | 0.408 | 0.358 | 0.216 | 0.217 | 0.055 | 0.161 | 0.240 |
| *SeqKD* | 0.331 | 0.416 | 0.364 | 0.283 | 0.273 | 0.089 | 0.169 | 0.275 |
| *KD* | 0.331 | 0.431 | 0.373 | 0.286 | 0.284 | 0.091 | 0.290 | 0.298 |
| *DistiLLM* | 0.333 | 0.442 | 0.373 | 0.288 | 0.270 | 0.095 | 0.209 | 0.287 |
| *TAID* | 0.325 | 0.427 | 0.365 | 0.290 | 0.270 | 0.079 | 0.218 | 0.282 |
| **DGPO (ours)** | **0.378** | **0.481** | **0.402** | **0.342** | **0.303** | **0.120** | 0.274 | **0.329** |
👉 DGPO achieves **~55× improvement** over base model
👉 In some cases, **student surpasses teacher**
---
## 🎓 Citation
```
@article{kotoge2025dgpo,
title={Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization},
author={Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin},
journal={arXiv preprint arXiv:2508.20324},
year={2025}
}
```
```
@inproceedings{
kotoge2025democratizing,
title={Democratizing Agentic {RAG}: Distillation-Guided Policy Optimization for Compact Language Models},
author={Rikuto Kotoge and Mai Nishimura and Jiaxin Ma},
booktitle={NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning},
year={2025},
url={https://openreview.net/forum?id=CP0H9NAWES}
}
```
---
## 🤝 Acknowledgements
* Search-R1 (agentic RL baseline)
* Open models (Qwen & Llama)
* Open QA benchmarks (NQ, HotpotQA, etc.)
---
<div align="center">
**DGPO makes agentic RAG accessible to small models 🚀**
</div>