初始化项目，由ModelHub XC社区提供模型

Model: omron-sinicx/DGPO-qwen2.5-0.5b Source: Original Platform
2026-06-16 08:45:17 +08:00
commit 6e356f4992
13 changed files with 152174 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,157 @@
+---
+language:
+  - en
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+  - reinforcement-learning
+  - rag
+  - qwen2.5
+base_model: omron-sinicx/Qwen2.5-0.5B-Instruct-kd
+---
+
+<div align="center">
+
+<!-- # 🔍 DGPO: Distillation-Guided Policy Optimization -->
+
+# Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities
+
+<p align="center">
+  <a href="https://arxiv.org/abs/2508.20324"><img src="https://img.shields.io/badge/Paper-arXiv-red?logo=arxiv"></a>
+  <!-- <a href="#"><img src="https://img.shields.io/badge/GitHub-Code-black?logo=github"></a> -->
+</p>
+
+<!-- *DGPO enables compact language models (0.5–1B) to perform agentic RAG with stable reinforcement learning.* -->
+
+</div>
+
+---
+
+## 🧠 Overview
+
+**DGPO (Distillation-Guided Policy Optimization) is a reinforcement learning framework with integrated knowledge distillation, designed to enable agentic search behaviors in compact language models.**
+
+While RL works well for large models, compact models suffer from:
+
+* ❌ Poor initial outputs 
+* ❌ Training collapse in RL
+* ❌ Ineffective exploration
+
+DGPO solves this by combining:
+
+* ✅ **Cold-start knowledge distillation (KD)**
+* ✅ **Teacher-guided reinforcement learning**
+
+This enables stable learning and even allows compact models to **match or surpass teacher models** 
+
+---
+
+## ⚙️ Key Idea
+
+### 🔁 Distillation-Guided RL
+
+DGPO introduces a simple but powerful principle:
+
+> ✅ Reward if correct
+> ❌ Mimic teacher if wrong
+
+This creates a stable learning signal even when the model is weak.
+
+---
+
+## 🏗️ Framework
+
+### 1. Cold-Start Initialization (KD)
+
+* Train student using teacher-generated outputs (TGO)
+* Provides high-quality trajectories
+* Prevents early collapse
+
+### 2. Distillation-Guided RL
+
+* Use PPO-based RL
+* Reward correct answers
+* Apply **selective KL penalty only when wrong**
+
+This enables:
+
+* Stable training
+* Efficient exploration
+* Error-focused learning
+
+---
+
+## 🔍 Agentic RAG Behavior
+
+DGPO trains models to perform **multi-step search reasoning**:
+```
+<think> reasoning </think>
+
+<search> query </search>  
+<information> retrieved docs </information>  
+<answer> final answer </answer>  
+```
+
+---
+
+## 🚀 Performance
+
+### Overall QA Performance
+
+📊 Qwen2.5 (3B → 0.5B)
+
+| Method          | NQ        | TriviaQA  | PopQA     | HotpotQA  | 2Wiki     | MuSiQue   | Bamboogle | Avg.      |
+| --------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
+| *Student-0.5B*  | 0.004     | 0.006     | 0.007     | 0.007     | 0.015     | 0.000     | 0.000     | 0.006     |
+| *Teacher-3B*    | 0.365     | 0.569     | 0.393     | 0.340     | 0.368     | 0.135     | 0.298     | 0.353     |
+| *PPO*         | 0.306     | 0.444     | 0.379     | 0.205     | 0.218     | 0.041     | 0.073     | 0.238     |
+| *GKD*         | 0.266     | 0.408     | 0.358     | 0.216     | 0.217     | 0.055     | 0.161     | 0.240     |
+| *SeqKD*       | 0.331     | 0.416     | 0.364     | 0.283     | 0.273     | 0.089     | 0.169     | 0.275     |
+| *KD*          | 0.331     | 0.431     | 0.373     | 0.286     | 0.284     | 0.091     | 0.290     | 0.298     |
+| *DistiLLM*    | 0.333     | 0.442     | 0.373     | 0.288     | 0.270     | 0.095     | 0.209     | 0.287     |
+| *TAID*        | 0.325     | 0.427     | 0.365     | 0.290     | 0.270     | 0.079     | 0.218     | 0.282     |
+| **DGPO (ours)** | **0.378** | **0.481** | **0.402** | **0.342** | **0.303** | **0.120** | 0.274     | **0.329** |
+
+
+👉 DGPO achieves **~55× improvement** over base model 
+
+👉 In some cases, **student surpasses teacher** 
+
+---
+
+## 🎓 Citation
+```
+@article{kotoge2025dgpo,
+title={Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization},
+author={Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin},
+journal={arXiv preprint arXiv:2508.20324},
+year={2025}
+}
+```
+```
+@inproceedings{
+kotoge2025democratizing,
+title={Democratizing Agentic {RAG}: Distillation-Guided Policy Optimization  for Compact Language Models},
+author={Rikuto Kotoge and Mai Nishimura and Jiaxin Ma},
+booktitle={NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning},
+year={2025},
+url={https://openreview.net/forum?id=CP0H9NAWES}
+}
+```
+
+---
+
+## 🤝 Acknowledgements
+
+* Search-R1 (agentic RL baseline)
+* Open models (Qwen & Llama)
+* Open QA benchmarks (NQ, HotpotQA, etc.)
+
+---
+
+<div align="center">
+
+**DGPO makes agentic RAG accessible to small models 🚀**
+
+</div>