DGPO-qwen2.5-0.5b/README.md

---
language:
  - en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
  - reinforcement-learning
  - rag
  - qwen2.5
base_model: omron-sinicx/Qwen2.5-0.5B-Instruct-kd
---

<div align="center">

<!-- # 🔍 DGPO: Distillation-Guided Policy Optimization -->

# Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

<p align="center">
  <a href="https://arxiv.org/abs/2508.20324"><img src="https://img.shields.io/badge/Paper-arXiv-red?logo=arxiv"></a>
  <!-- <a href="#"><img src="https://img.shields.io/badge/GitHub-Code-black?logo=github"></a> -->
</p>

<!-- *DGPO enables compact language models (0.5–1B) to perform agentic RAG with stable reinforcement learning.* -->

</div>

---

## 🧠 Overview

**DGPO (Distillation-Guided Policy Optimization) is a reinforcement learning framework with integrated knowledge distillation, designed to enable agentic search behaviors in compact language models.**

While RL works well for large models, compact models suffer from:

* ❌ Poor initial outputs
* ❌ Training collapse in RL
* ❌ Ineffective exploration

DGPO solves this by combining:

* ✅ **Cold-start knowledge distillation (KD)**
* ✅ **Teacher-guided reinforcement learning**

This enables stable learning and even allows compact models to **match or surpass teacher models**

---

## ⚙️ Key Idea

### 🔁 Distillation-Guided RL

DGPO introduces a simple but powerful principle:

> ✅ Reward if correct
> ❌ Mimic teacher if wrong

This creates a stable learning signal even when the model is weak.

---

## 🏗️ Framework

### 1. Cold-Start Initialization (KD)

* Train student using teacher-generated outputs (TGO)
* Provides high-quality trajectories
* Prevents early collapse

### 2. Distillation-Guided RL

* Use PPO-based RL
* Reward correct answers
* Apply **selective KL penalty only when wrong**

This enables:

* Stable training
* Efficient exploration
* Error-focused learning

---

## 🔍 Agentic RAG Behavior

DGPO trains models to perform **multi-step search reasoning**:
```
<think> reasoning </think>

<search> query </search>
<information> retrieved docs </information>
<answer> final answer </answer>
```

---

## 🚀 Performance

### Overall QA Performance

📊 Qwen2.5 (3B → 0.5B)

| Method          | NQ        | TriviaQA  | PopQA     | HotpotQA  | 2Wiki     | MuSiQue   | Bamboogle | Avg.      |
| --------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
| *Student-0.5B*  | 0.004     | 0.006     | 0.007     | 0.007     | 0.015     | 0.000     | 0.000     | 0.006     |
| *Teacher-3B*    | 0.365     | 0.569     | 0.393     | 0.340     | 0.368     | 0.135     | 0.298     | 0.353     |
| *PPO*         | 0.306     | 0.444     | 0.379     | 0.205     | 0.218     | 0.041     | 0.073     | 0.238     |
| *GKD*         | 0.266     | 0.408     | 0.358     | 0.216     | 0.217     | 0.055     | 0.161     | 0.240     |
| *SeqKD*       | 0.331     | 0.416     | 0.364     | 0.283     | 0.273     | 0.089     | 0.169     | 0.275     |
| *KD*          | 0.331     | 0.431     | 0.373     | 0.286     | 0.284     | 0.091     | 0.290     | 0.298     |
| *DistiLLM*    | 0.333     | 0.442     | 0.373     | 0.288     | 0.270     | 0.095     | 0.209     | 0.287     |
| *TAID*        | 0.325     | 0.427     | 0.365     | 0.290     | 0.270     | 0.079     | 0.218     | 0.282     |
| **DGPO (ours)** | **0.378** | **0.481** | **0.402** | **0.342** | **0.303** | **0.120** | 0.274     | **0.329** |


👉 DGPO achieves **~55× improvement** over base model

👉 In some cases, **student surpasses teacher**

---

## 🎓 Citation
```
@article{kotoge2025dgpo,
title={Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization},
author={Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin},
journal={arXiv preprint arXiv:2508.20324},
year={2025}
}
```
```
@inproceedings{
kotoge2025democratizing,
title={Democratizing Agentic {RAG}: Distillation-Guided Policy Optimization  for Compact Language Models},
author={Rikuto Kotoge and Mai Nishimura and Jiaxin Ma},
booktitle={NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning},
year={2025},
url={https://openreview.net/forum?id=CP0H9NAWES}
}
```

---

## 🤝 Acknowledgements

* Search-R1 (agentic RL baseline)
* Open models (Qwen & Llama)
* Open QA benchmarks (NQ, HotpotQA, etc.)

---

<div align="center">

**DGPO makes agentic RAG accessible to small models 🚀**

</div>