41 lines
3.1 KiB
Markdown
41 lines
3.1 KiB
Markdown
---
|
|
{}
|
|
---
|
|
|
|
## DisCO-1.5B-logL
|
|
This model is a fine-tuned version of [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) on the [agentica-org/DeepScaleR-Preview-Dataset](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset).
|
|
|
|
It was fine-tuned as part of the paper **DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization** (*[paper link](https://arxiv.org/abs/2505.12366)*).
|
|
Specifically, this model was fine-tuned by DisCO framework with Log-Likelihood (Log-L) score function.
|
|
|
|
|
|
The code is available at: [https://github.com/Optimization-AI/DisCO](https://github.com/Optimization-AI/DisCO)
|
|
|
|
**Below are comparisons with baseline models and baseline methods for fine-tuning 1.5B models.** OpenAI-o1-preview is included as a reference. MRL denotes Max Response Length utilized in training/testing. The bottom 9 methods are all for fine-tuning DeepSeek-R1-Distill-Qwen-1.5B model on the same DeepScaleR dataset. DS is short for DeepSeek-R1, DSR is short for DeepScalaR.
|
|
|
|
| Model | MRL(Train/Test) | AIME 2024 | AIME 2025 | MATH 500 | AMC 2023 | Minerva | O-Bench | Avg. |
|
|
| ---------------------- | --------------- | --------- | --------- | -------- | -------- | ------- | ------- | ----- |
|
|
| OpenAI-o1-Preview | \- | 0.4 | \- | 0.814 | \- | \- | \- | \- |
|
|
| DS-Distill-Qwen-1.5B | 32k+ / 32k | 0.288 | 0.263 | 0.828 | 0.629 | 0.265 | 0.433 | 0.451 |
|
|
| DS-Distill-Qwen-1.5B | 32k+ / 8k | 0.181 | 0.215 | 0.758 | 0.515 | 0.237 | 0.353 | 0.376 |
|
|
| STILL-3-1.5B-preview | 29k / 32k | 0.325 | 0.248 | 0.844 | 0.667 | 0.290 | 0.454 | 0.471 |
|
|
| DSR-1.5B-Preview | 24k / 32k | 0.431 | 0.304 | 0.878 | 0.736 | 0.302 | 0.500 | 0.525 |
|
|
| DSR-1.5B-Preview | 24k / 8k | 0.358 | 0.258 | 0.860 | 0.679 | 0.297 | 0.473 | 0.488 |
|
|
| GRPO | 8k / 8k | 0.277 | 0.242 | 0.838 | 0.647 | 0.276 | 0.462 | 0.457 |
|
|
| GRPO+ER | 8k / 8k | 0.298 | 0.242 | 0.839 | 0.649 | 0.279 | 0.452 | 0.460 |
|
|
| Dr. GRPO | 8k / 8k | 0.250 | 0.238 | 0.830 | 0.629 | 0.270 | 0.443 | 0.443 |
|
|
| DAPO | 8k / 8k | 0.310 | 0.252 | 0.848 | 0.675 | 0.296 | 0.456 | 0.473 |
|
|
| TRPA | 8k / 8k | 0.354 | 0.235 | 0.835 | 0.653 | 0.283 | 0.458 | 0.470 |
|
|
| DisCO (L-ratio) | 8k / 8k | 0.381 | 0.306 | 0.878 | 0.746 | 0.319 | 0.512 | 0.524 |
|
|
| DisCO (log-L) | 8k / 8k | 0.404 | 0.317 | 0.876 | 0.758 | 0.333 | 0.509 | 0.533 |
|
|
|
|
## Citation
|
|
|
|
```bibtex
|
|
@article{li2025disco,
|
|
title={DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization},
|
|
author={Li, Gang and Lin, Ming and Galanti, Tomer and Tu, Zhengzhong and Yang, Tianbao},
|
|
journal={arXiv preprint arXiv:2505.12366},
|
|
year={2025}
|
|
}
|
|
``` |