Files
HiPO-1.7B/README.md
ModelHub XC a41acb1e26 初始化项目,由ModelHub XC社区提供模型
Model: Kwaipilot/HiPO-1.7B
Source: Original Platform
2026-05-12 10:45:18 +08:00

122 lines
4.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
base_model:
- Qwen/Qwen3-8B
library_name: transformers
---
<div align="center">
# HIPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs
<img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="60%" alt="Kwaipilot"/>
---
<a href="https://huggingface.co/Kwaipilot/HIPO-8B" target="_blank">
<img alt="Hugging Face" src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/>
</a>
<a href="https://arxiv.org/abs/2509.23967" target="_blank">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-2509.23967-b31b1b.svg?style=for-the-badge"/>
</a>
<br>
<a href="https://arxiv.org/abs/2509.23967"></a>
</div>
This work is a companion to our earlier report [**HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs**](https://arxiv.org/abs/2509.23967), where we first introduced the **AutoThink paradigm** for controllable reasoning. While KAT-V1 outlined the overall framework of **SFT + RL** for adaptive reasoning, this paper provides the **detailed algorithmic design** of that training recipe.
***
# Overview
We introduce **HiPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs)**, a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efficiency.
HIPO has two main components:
- **Hybrid Data Pipeline** Collects both think-on and think-off responses, categorizes queries by difficulty, and uses a strong model (e.g., DeepSeek-V3) to generate explanations that justify mode choices.
- **Hybrid Reward System** Combines rewards for both modes, with bias adjustment to prevent overuse of long reasoning and mode-aware advantage functions to align decisions with performance gains.
![Kim 2025-09-26 145531](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/ZUk76mhDiVITfUsLcvv6F.png)
# Experimental Findings
**Think-on Only (Overthinking).**
Training only on Think-on data makes the model reason on all problems, causing inefficiency.
**GRPO.**
Improves accuracy by **+3.1%**, but increases token length on simple tasks.
**Think-on/Think-off Mix.**
Yields higher accuracy (**+4.0%**) while reducing token length (**10.8%**) and thinking rate (**22%**).
**HiPO Advantage.**
Achieves the best results: **+6.2% accuracy**, **30% token length**, **39% thinking rate**, outperforming existing methods in both **efficiency** and **accuracy**.
![Kim 2025-09-26 145349](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/_qzxhMRTL_NTfaGb13LHc.png)
# Data Format
**HiPO** produces responses in a **structured template** that makes the reasoning path explicit and machine-parsable. Two modes are supported:
![Kim 2025-09-26 145842](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/FXfAAN0WVpsaOn1wROInL.png)
# Quick Start
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Kwaipilot/HiPO-1.7B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768,
temperature=0.6,
top_p=0.95,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print("prompt:\n", prompt)
print("content:\n", content)
```
***
# Citation
```
@article{Zhan2025HiPO,
title={HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs},
author={Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu and others},
year={2025},
institution={arXiv preprint arXiv:2509.23967},
number={arXiv:2509.23967},
url={https://arxiv.org/abs/2509.23967}
}
```