This benchmark is primarily intended to be used with reasoning models like DeepSeek-R1 and its distilled models like DeepSeek-R1-Distill-Qwen-1.5B. Please use

pip install antlr4-python3-runtime

for parse_latex which we use for symbolic equality check.

Benchmark sglang

Launch the Server

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --port 30000

Note that depending on the GPU this benchmark will take quiet some time. To employ data parallelism please use:

python3 -m sglang_router.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --port 30000 --dp-size 4

Benchmarking

We use suggested parameters of temperature=0.6, top_p=.95, max_new_tokens=32768. The command line argument num-tries can be used to evaluate the model multiple times on the same question. We use the suggested 64 from the repo for AIME 2024. For LIMO, we use 8 as the number of tries due to the size of the dataset.

By default evaluate on LIMO dataset.

python3 bench_sglang.py --parallel 256 --num-tries 64 --port 30000

Evaluate on AIME 2024 dataset.

python3 bench_sglang.py --parallel 256 --port 30000 --data-path Maxwell-Jia/AIME_2024 --question-key Problem --answer-key Answer --num-tries 64

Evaluate on AIME 2025 I dataset. For benchmark result see here.

python3 bench_sglang.py --parallel 256 --port 30000 --data-path opencompass/AIME2025 --question-key question --answer-key answer --num-tries 64

Results

Dataset	Num Tries	Accuracy	Reference
LIMO	8	47.7%	?
AIME 2024	64	33.2%	28.9%
AIME 2025 I	64	29.9%	25.0%