122 lines
4.3 KiB
Markdown
122 lines
4.3 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
base_model:
|
|||
|
|
- Qwen/Qwen3-8B
|
|||
|
|
library_name: transformers
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
<div align="center">
|
|||
|
|
|
|||
|
|
# HIPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs
|
|||
|
|
|
|||
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="60%" alt="Kwaipilot"/>
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
<a href="https://huggingface.co/Kwaipilot/HIPO-8B" target="_blank">
|
|||
|
|
<img alt="Hugging Face" src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/>
|
|||
|
|
</a>
|
|||
|
|
<a href="https://arxiv.org/abs/2509.23967" target="_blank">
|
|||
|
|
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-2509.23967-b31b1b.svg?style=for-the-badge"/>
|
|||
|
|
</a>
|
|||
|
|
|
|||
|
|
<br>
|
|||
|
|
|
|||
|
|
<a href="https://arxiv.org/abs/2509.23967"></a>
|
|||
|
|
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
This work is a companion to our earlier report [**HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs**](https://arxiv.org/abs/2509.23967), where we first introduced the **AutoThink paradigm** for controllable reasoning. While KAT-V1 outlined the overall framework of **SFT + RL** for adaptive reasoning, this paper provides the **detailed algorithmic design** of that training recipe.
|
|||
|
|
|
|||
|
|
***
|
|||
|
|
|
|||
|
|
# Overview
|
|||
|
|
|
|||
|
|
We introduce **HiPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs)**, a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efficiency.
|
|||
|
|
|
|||
|
|
HIPO has two main components:
|
|||
|
|
|
|||
|
|
- **Hybrid Data Pipeline** – Collects both think-on and think-off responses, categorizes queries by difficulty, and uses a strong model (e.g., DeepSeek-V3) to generate explanations that justify mode choices.
|
|||
|
|
- **Hybrid Reward System** – Combines rewards for both modes, with bias adjustment to prevent overuse of long reasoning and mode-aware advantage functions to align decisions with performance gains.
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
|
|||
|
|
# Experimental Findings
|
|||
|
|
|
|||
|
|
**Think-on Only (Overthinking).**
|
|||
|
|
Training only on Think-on data makes the model reason on all problems, causing inefficiency.
|
|||
|
|
|
|||
|
|
**GRPO.**
|
|||
|
|
Improves accuracy by **+3.1%**, but increases token length on simple tasks.
|
|||
|
|
|
|||
|
|
**Think-on/Think-off Mix.**
|
|||
|
|
Yields higher accuracy (**+4.0%**) while reducing token length (**–10.8%**) and thinking rate (**–22%**).
|
|||
|
|
|
|||
|
|
**HiPO Advantage.**
|
|||
|
|
Achieves the best results: **+6.2% accuracy**, **–30% token length**, **–39% thinking rate**, outperforming existing methods in both **efficiency** and **accuracy**.
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
|
|||
|
|
# Data Format
|
|||
|
|
|
|||
|
|
**HiPO** produces responses in a **structured template** that makes the reasoning path explicit and machine-parsable. Two modes are supported:
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
|
|||
|
|
# Quick Start
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|||
|
|
|
|||
|
|
model_name = "Kwaipilot/HiPO-1.7B"
|
|||
|
|
|
|||
|
|
# load the tokenizer and the model
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
model_name,
|
|||
|
|
torch_dtype="auto",
|
|||
|
|
device_map="auto"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# prepare the model input
|
|||
|
|
prompt = "Give me a short introduction to large language model."
|
|||
|
|
messages = [
|
|||
|
|
{"role": "user", "content": prompt}
|
|||
|
|
]
|
|||
|
|
text = tokenizer.apply_chat_template(
|
|||
|
|
messages,
|
|||
|
|
tokenize=False,
|
|||
|
|
add_generation_prompt=True
|
|||
|
|
)
|
|||
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
|||
|
|
|
|||
|
|
# conduct text completion
|
|||
|
|
generated_ids = model.generate(
|
|||
|
|
**model_inputs,
|
|||
|
|
max_new_tokens=32768,
|
|||
|
|
temperature=0.6,
|
|||
|
|
top_p=0.95,
|
|||
|
|
)
|
|||
|
|
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
|
|||
|
|
content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
|
|||
|
|
print("prompt:\n", prompt)
|
|||
|
|
print("content:\n", content)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
***
|
|||
|
|
|
|||
|
|
# Citation
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
@article{Zhan2025HiPO,
|
|||
|
|
title={HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs},
|
|||
|
|
author={Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu and others},
|
|||
|
|
year={2025},
|
|||
|
|
institution={arXiv preprint arXiv:2509.23967},
|
|||
|
|
number={arXiv:2509.23967},
|
|||
|
|
url={https://arxiv.org/abs/2509.23967}
|
|||
|
|
}
|
|||
|
|
```
|