We present the evaluation results of our ReasonFlux-F1-32B on challenging reasoning tasks including AIME2024,AIM2025,MATH500 and GPQA-Diamond. To make a fair comparison, we report the results of the LLMs on our evaluation scripts in ReasonFlux-F1.
Model
AIME2024@pass1
AIME2025@pass1
MATH500@pass1
GPQA@pass1
QwQ-32B-Preview
46.7
37.2
90.6
65.2
LIMO-32B
56.3
44.5
94.8
58.1
s1-32B
56.7
49.3
93.0
59.6
OpenThinker-32B
66.0
53.3
94.8
60.1
R1-Distill-32B
70.0
46.7
92.0
59.6
ReasonFlux-Zero-32B
56.7
37.2
91.2
61.2
ReasonFlux-F1-32B
76.7
53.3
96.0
67.2
Quick start with VLLM
fromvllmimportLLM,SamplingParamsfromtransformersimportAutoTokenizermodel_id='Gen-Verse/ReasonFlux-F1-7B'model=LLM(model_id,tensor_parallel_size=8,)tokenizer=AutoTokenizer.from_pretrained(model_id)sampling_params=SamplingParams(max_tokens=32768,)# 2022 AIME I Problems/Problem 15question="""Let \(x, y\), and \(z\) be positive real numbers satisfying the system of equations:
\[
\begin{array}{c}\sqrt{2 x-x y}+\sqrt{2 y-x y}=1 \\\sqrt{2 y-y z}+\sqrt{2 z-y z}=\sqrt{2}\\\sqrt{2 z-z x}+\sqrt{2 x-z x}=\sqrt{3} .
\end{array}\]
Then \(\left[(1-x)(1-y)(1-z)\right]^{2}\) can be written as \(\frac{m}{n}\), where \(m\) and \(n\) are relatively prime positive integers. Find \(m+n\)."""ds_prompt="<|User|>\n"+question+"<|Assistant|>\n"output=model.generate(ds_prompt,sampling_params=sampling_params)print(output[0].outputs[0].text)
Citation
@article{yang2025reasonflux,
title={ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates},
author={Yang, Ling and Yu, Zhaochen and Cui, Bin and Wang, Mengdi},
journal={arXiv preprint arXiv:2502.06772},
year={2025}}