Files
nexus-1.5b/README.md
ModelHub XC ca55608289 初始化项目,由ModelHub XC社区提供模型
Model: Dat1710/nexus-1.5b
Source: Original Platform
2026-05-22 07:36:18 +08:00

236 lines
8.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
library_name: transformers
tags:
- math
- reasoning
- reinforcement-learning
- qwen2
- mathematics
- chain-of-thought
license: apache-2.0
language:
- en
- zh
base_model: Qwen/Qwen2.5-Math-1.5B-Instruct
pipeline_tag: text-generation
---
# Nexus-1.5B
<p align="center">
<img src="https://img.shields.io/badge/Base%20Model-Qwen2.5--Math--1.5B--Instruct-orange" />
<img src="https://img.shields.io/badge/Parameters-1.54B-blue" />
<img src="https://img.shields.io/badge/Method-LPRO-green" />
<img src="https://img.shields.io/badge/MATH--500-80.2-red" />
<img src="https://img.shields.io/badge/GSM8K-85.2-red" />
</p>
**Nexus-1.5B** is a 1.54-billion-parameter mathematical reasoning model developed by [Neuriton](https://www.facebook.com/neuriton), trained via **Length-Penalized Reward Optimization (LPRO)** — a novel reinforcement learning alignment method that improves both accuracy and response conciseness simultaneously.
Built on top of `Qwen2.5-Math-1.5B-Instruct`, Nexus-1.5B achieves **80.2 on MATH-500** and **85.2 on GSM8K** (CoT), surpassing its base model by **+4.4 points** on MATH-500 while reducing average response length by **14%**.
---
## What is LPRO?
Standard GRPO (Group Relative Policy Optimization) suffers from two key problems:
1. **Length bias** — short responses receive disproportionately large gradient signals, implicitly penalizing long correct derivations.
2. **Entropy collapse** — symmetric probability-ratio clipping causes the policy to converge to a narrow set of solution patterns, limiting further improvement.
**LPRO** fixes both with three targeted modifications:
| Component | What it does |
|---|---|
| **Asymmetric clipping** | Decouples the lower and upper clip bounds (`ε_low=0.20`, `ε_high=0.28`) to preserve policy entropy |
| **Token-level normalization** | Replaces per-response weight `1/G` with global weight `1/Σ|oᵢ|` to produce an unbiased gradient estimate |
| **Length-penalized advantage** | Adds a group-standardized length penalty: `Aᵢ = (rᵢ - μᵣ)/(σᵣ + ε) - λ·(Lᵢ - μ_L)/(σ_L + ε)` |
The final objective is:
$$\mathcal{J}_{\text{LPRO}}(\theta) = \mathbb{E}\left[\frac{1}{\sum_{i=1}^{G}|o_i|} \sum_{i=1}^{G}\sum_{t=1}^{|o_i|} \min\!\left(r_{i,t}(\theta)\,\hat{A}_{i,t},\ \text{clip}_{\text{asym}}(r_{i,t}(\theta))\,\hat{A}_{i,t}\right)\right]$$
---
## Model Details
| Property | Value |
|---|---|
| **Base model** | `Qwen/Qwen2.5-Math-1.5B-Instruct` |
| **Parameters** | 1.54B |
| **Architecture** | Transformer Decoder (28 layers, GQA, RoPE, SwiGLU, RMSNorm) |
| **Context length** | 8,192 tokens |
| **Vocabulary size** | 128,256 |
| **Training method** | LPRO (RL fine-tuning, no distillation) |
| **Training data** | 100 difficulty-filtered problems from MATH-500 |
| **Group size G** | 4 |
| **Length penalty λ** | 0.10 |
| **Learning rate** | 1e-6 |
| **PPO epochs/iter** | 4 |
---
## Benchmark Results
### Chain-of-Thought (CoT)
| Model | GSM8K | MATH-500 | MMLU-STEM | CMATH | GaoKao Cloze | GaoKao QA |
|---|---|---|---|---|---|---|
| Qwen2-Math-1.5B-Instruct | 84.2 | 69.4 | 54.9 | 79.6 | 59.7 | 50.7 |
| Qwen2.5-Math-1.5B-Instruct | 84.8 | 75.8 | 57.5 | 83.0 | 65.5 | 54.1 |
| **Nexus-1.5B** | **85.2** | **80.2** | **60.3** | **83.5** | **67.2** | **56.9** |
### Tool-Integrated Reasoning (TIR)
| Model | MATH-500 | Minerva Math | GaoKao 2023 EN | Olympiad Bench | College Math |
|---|---|---|---|---|---|
| Qwen2.5-Math-1.5B-Instruct | 80.0 | 34.0 | 68.0 | 49.0 | 54.0 |
| **Nexus-1.5B** | **84.0** | **40.0** | **74.0** | **56.0** | **57.0** |
### Ablation: Effect of Length Penalty (λ)
| λ | MATH-500 Acc. | Avg. Response Length |
|---|---|---|
| 0.0 (GRPO baseline) | 77.4 | 312 tokens |
| **0.1 (Nexus-1.5B)** | **80.2** | **268 tokens** |
| 0.3 (over-penalized) | 78.0 | 201 tokens |
> **Key insight:** At λ=0.1, accuracy and conciseness improve simultaneously. The length penalty acts as a de-noising regularizer — discouraging redundant steps rather than suppressing genuinely long derivations.
---
## How to Use
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Dat1710/nexus-1.5b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# Chain-of-Thought prompt
system_prompt = "Please reason step by step, and put your final answer within \\boxed{}."
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Find all functions f: ℝ⁺ → ℝ⁺ such that for each x ∈ ℝ⁺, there is exactly one y ∈ ℝ⁺ satisfying xf(y) + yf(x) ≤ 2."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=2048,
temperature=0.7,
do_sample=True,
)
generated_ids = [
output_ids[len(input_ids):]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```
### Tool-Integrated Reasoning (TIR)
```python
system_prompt = (
"Please integrate natural language reasoning with programs to solve the problem above, "
"and put your final answer within \\boxed{}."
)
```
---
## Evaluation Prompt Format
**CoT (8-shot for GSM8K, 4-shot for MATH-500):**
```
<|im_start|>system
Please reason step by step, and put your final answer within \boxed{}.<|im_end|>
<|im_start|>user
{problem}<|im_end|>
<|im_start|>assistant
```
**TIR (zero-shot):**
```
<|im_start|>system
Please integrate natural language reasoning with programs to solve the problem above,
and put your final answer within \boxed{}.<|im_end|>
<|im_start|>user
{problem}<|im_end|>
<|im_start|>assistant
```
---
## Training Details
### Data Curation
Training problems are sourced from **MATH-500** and filtered by difficulty using a learnable-zone criterion: a problem is retained if, among 8 sampled solutions from the base model, **between 2 and 5 are correct**. This yields 100 training problems that provide meaningful gradient signal — neither trivially easy nor intractably hard.
### Training Procedure
1. **Group sampling:** For each prompt, sample G=4 responses from the current policy.
2. **Reward computation:** Rule-based binary reward (correctness via symbolic answer matching) + small format bonus (α=0.1) for well-formed `\boxed{}` output.
3. **Advantage computation:** Compute length-penalized group z-score advantages.
4. **Policy update:** Maximize LPRO objective for 4 epochs per iteration.
5. **Iterate:** Set old policy ← new policy and repeat.
### Reward Function
$$r_i = \mathbf{1}[\hat{a}(o_i) = a^*] + 0.1 \cdot \mathbf{1}[\text{format}(o_i)]$$
where $\hat{a}(o_i)$ is the extracted answer from the last `\boxed{}` expression, verified via symbolic equivalence.
---
## Limitations
- **Scale:** Nexus-1.5B operates at 1.54B parameters. Hard olympiad problems (e.g., AIME) remain challenging for models at this scale.
- **Language:** Primarily optimized for English and Chinese mathematical text. Performance on other languages is not evaluated.
- **Domain:** Designed for mathematical reasoning. General language understanding or instruction-following tasks are outside the model's training distribution.
- **TIR dependency:** Tool-integrated reasoning requires a sandboxed Python interpreter at inference time.
---
## Citation
If you use Nexus-1.5B in your research, please cite:
```bibtex
@techreport{neuriton2026nexus,
title = {Nexus-1.5B: Length-Penalized Reward Optimization for Robust Mathematical Reasoning},
author = {Neuriton Team},
institution = {Neuriton},
year = {2026},
month = {Summer},
note = {Technical Report}
}
```
---
## Acknowledgements
We thank the Qwen Team at Alibaba Group for open-sourcing the Qwen2.5-Math model family, and the authors of DAPO for the asymmetric clipping insight that is central to LPRO.
---
*Developed by [Neuriton](https://neuriton.ai) · Summer 2026*