Files
Logics-STEM-8B-RL/README.md
ModelHub XC 53ab47da88 初始化项目,由ModelHub XC社区提供模型
Model: Alibaba-DT/Logics-STEM-8B-RL
Source: Original Platform
2026-06-04 09:40:13 +08:00

92 lines
4.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
base_model:
- Qwen/Qwen3-8B
---
# Logics-STEM-8B-RL
## 📰 News
- [2026.01.05]🔥 Release of our [Techinical Report](https://arxiv.org/abs/2601.01562).
- [2026.01.05]🔥 Release the first version of [Logics-STEM-8B-SFT](https://huggingface.co/Logics-MLLM/Logics-STEM-8B-SFT), [Logics-STEM-8B-RL](https://huggingface.co/Logics-MLLM/Logics-STEM-8B-RL), [Logics-STEM-SFT-Dataset-2.2M](https://huggingface.co/datasets/Logics-MLLM/Logics-STEM-SFT-Dataset-2.2M).
---
## 🔎 Overview
**Logics-STEM-8B-RL** is an 8B-parameter open reasoning model for STEM (Science, Technology, Engineering, Mathematics) tasks. It is post-trained from Qwen3-8B model using a dataalgorithm co-design recipe:
- **Stage-1: SFT** (proposal distribution building)
Trained on Logics-STEM-SFT-Dataset, a large-scale long chain-of-thought (CoT) dataset curated with a multi-stage pipeline (annotation, deduplication, decontamination, distillation, and stratified sampling). The released training set includes a 2.2M downsampled version, derived from a 10M-scale curated corpus.
- **Stage-2: RLVR** (target distribution shifting)
A failure-driven post-training paradigm is used to focus learning on “failure regions”: the model is evaluated on gold-standard STEM benchmarks; incorrectly solved problems trigger knowledge retrieval from a high-quality document corpus, followed by document-grounded synthetic QA generation and reinforcement learning with verifiable rewards (RLVR).
RL training adopts DAPO-style objectives with verified correctness rewards, plus additional rewards for format compliance and reasoning length, along with an n-gram repetition penalty to reduce redundancy.
Intended use: STEM reasoning (math competitions, scientific QA, graduate-level reasoning benchmarks), and research on post-training recipes for reasoning models.
---
## 📊 Experimental Results
![](./imgs/math_evaluation_res.png)
*Figure. The pass@1 performance of Logics-STEM-8B-RL on Math Evaluation Benchmark.*
![](./imgs/stem_evaluation_res.png)
*Figure. The pass@1 performance of Logics-STEM-8B-RL on STEM-related Evaluation Benchmark.*
---
## 🚀 Quickstart
### Transformers (recommended)
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Logics-MLLM/Logics-STEM-8B-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
prompt = """Let p be the least prime number for which there exists a positive integer n such that n^4 + 1 is divisible by p^2.
Find the least positive integer m such that m^4 + 1 is divisible by p^2.
Please put your final answer within \\boxed{}.
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=4096,
temperature=0.6,
top_p=0.95,
top_k=20,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```
---
## 📚 Citation
If you use **Logics-STEM-8B-RL**, please cite the following technical report:
```
@misc{xu2026logicsstemempoweringllmreasoning,
title={Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement},
author={Mingyu Xu and Cheng Fang and Keyue Jiang and Yuqian Zheng and Yanghua Xiao and Baojian Zhou and Qifang Zhao and Suhang Zheng and Xiuwen Zhu and Jiyang Tang and Yongchi Zhao and Yijia Luo and Zhiqi Bai and Yuchi Xu and Wenbo Su and Wei Wang and Bing Zhao and Lin Qu and Xiaoxiao Xu},
year={2026},
eprint={2601.01562},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.01562},
}
```
---
## 🙏 Acknowledgement
We thank the open-source community and toolchain that enabled this work, including (non-exhaustive): **Qwen3** models (teacher-model, embedding-model, base-model), **DeepSeek-R1** (data synthesis), **ROLL** (RLVR reward/infra), **math-verify** (answer verification), and community datasets listed in the technical report.