DisCO-1.5B-logL/README.md

---
{}
---

## DisCO-1.5B-logL
This model is a fine-tuned version of [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) on the [agentica-org/DeepScaleR-Preview-Dataset](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset).

It was fine-tuned as part of the paper **DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization** (*[paper link](https://arxiv.org/abs/2505.12366)*).
Specifically, this model was fine-tuned by DisCO framework with Log-Likelihood (Log-L) score function.


The code is available at: [https://github.com/Optimization-AI/DisCO](https://github.com/Optimization-AI/DisCO)

**Below are comparisons with baseline models and baseline methods for fine-tuning 1.5B models.** OpenAI-o1-preview is included as a reference.  MRL denotes Max Response Length utilized in training/testing. The bottom 9 methods are all for fine-tuning DeepSeek-R1-Distill-Qwen-1.5B model on the same DeepScaleR dataset. DS is short for DeepSeek-R1, DSR is short for DeepScalaR.

| Model                  | MRL(Train/Test) | AIME 2024 | AIME 2025 | MATH 500 | AMC 2023 | Minerva | O-Bench | Avg.  |
| ---------------------- | --------------- | --------- | --------- | -------- | -------- | ------- | ------- | ----- |
| OpenAI-o1-Preview      | \-              | 0.4       | \-        | 0.814    | \-       | \-      | \-      | \-    |
| DS-Distill-Qwen-1.5B | 32k+   /   32k  | 0.288     | 0.263     | 0.828    | 0.629    | 0.265   | 0.433   | 0.451 |
| DS-Distill-Qwen-1.5B | 32k+   /   8k   | 0.181     | 0.215     | 0.758    | 0.515    | 0.237   | 0.353   | 0.376 |
| STILL-3-1.5B-preview   | 29k   /   32k   | 0.325     | 0.248     | 0.844    | 0.667    | 0.290   | 0.454   | 0.471 |
| DSR-1.5B-Preview       | 24k   /   32k   | 0.431     | 0.304     | 0.878    | 0.736    | 0.302   | 0.500   | 0.525 |
| DSR-1.5B-Preview       | 24k   /   8k    | 0.358     | 0.258     | 0.860    | 0.679    | 0.297   | 0.473   | 0.488 |
| GRPO                   | 8k   /   8k     | 0.277     | 0.242     | 0.838    | 0.647    | 0.276   | 0.462   | 0.457 |
| GRPO+ER           | 8k   /   8k     | 0.298     | 0.242     | 0.839    | 0.649    | 0.279   | 0.452   | 0.460 |
| Dr. GRPO               | 8k   /   8k     | 0.250     | 0.238     | 0.830    | 0.629    | 0.270   | 0.443   | 0.443 |
| DAPO                   | 8k   /   8k     | 0.310     | 0.252     | 0.848    | 0.675    | 0.296   | 0.456   | 0.473 |
| TRPA                   | 8k   /   8k     | 0.354     | 0.235     | 0.835    | 0.653    | 0.283   | 0.458   | 0.470 |
| DisCO (L-ratio)        | 8k   /   8k     | 0.381     | 0.306     | 0.878    | 0.746    | 0.319   | 0.512   | 0.524 |
| DisCO (log-L)          | 8k   /   8k     | 0.404     | 0.317     | 0.876    | 0.758    | 0.333   | 0.509   | 0.533 |

## Citation

```bibtex
@article{li2025disco,
  title={DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization},
  author={Li, Gang and Lin, Ming and Galanti, Tomer and Tu, Zhengzhong and Yang, Tianbao},
  journal={arXiv preprint arXiv:2505.12366},
  year={2025}
}
```