初始化项目,由ModelHub XC社区提供模型
Model: nvidia/AceReason-Nemotron-14B Source: Original Platform
This commit is contained in:
166
README_EVALUATION.md
Normal file
166
README_EVALUATION.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# AceReason Evaluation Toolkit
|
||||
|
||||
We share our evaluation script and code in https://huggingface.co/nvidia/AceReason-Nemotron-14B/blob/main/evaluation.tar.gz
|
||||
|
||||
## Environment
|
||||
- vllm==0.7.3
|
||||
- torch==2.5.1
|
||||
- transformers==4.48.2
|
||||
- 8x NVIDIA H100 80GB HBM3 (CUDA Version: 12.8)
|
||||
|
||||
### Dataset Download
|
||||
LiveCodeBench:
|
||||
```
|
||||
from datasets import load_dataset
|
||||
|
||||
ds = load_dataset(
|
||||
"livecodebench/code_generation_lite",
|
||||
version_tag="release_v6",
|
||||
)["test"]
|
||||
|
||||
ds.to_json("data/livecodebench_problems.json", orient="records", lines=False)
|
||||
```
|
||||
|
||||
Math: see data/*
|
||||
|
||||
## Evaluation Script
|
||||
|
||||
For model generation on single seed, please use the following command:
|
||||
|
||||
```
|
||||
bash generate_livecodebench.sh ${model_path} ${seed} ${output_path} ${model_type}
|
||||
bash generate_aime.sh ${model_path} ${seed} aime24 ${output_path} ${model_type}
|
||||
bash generate_aime.sh ${model_path} ${seed} aime25 ${output_path} ${model_type}
|
||||
```
|
||||
Please specify model_type as r1 for AceReason-Nemotron-1.0 models, and qwen for AceReason-Nemotron-1.1 models.
|
||||
|
||||
Or you can use our configured seeds to reproduce our results on AIME 24/25 (avg@64) and LiveCodeBench v5/v6 (avg@8) as follows:
|
||||
|
||||
```
|
||||
bash run_livecodebench.sh ${model_path} ${output_path}
|
||||
bash run_aime.sh ${model_path} ${output_path}
|
||||
```
|
||||
|
||||
For benchmark evaluation, we provide the following evaluation command to reproduce our results:
|
||||
|
||||
```
|
||||
python evaluate_livecodebench.py -g ${output_path}
|
||||
python evaluate_aime.py --modelfolder ${output_path} --test_data data/aime24.jsonl
|
||||
python evaluate_aime.py --modelfolder ${output_path} --test_data data/aime25.jsonl
|
||||
```
|
||||
|
||||
## Reference Results
|
||||
We also left our generations into cache.tar.gz as references.
|
||||
|
||||
```
|
||||
LiveCodeBench AceReason-Nemotron-1.0-7B (Avg@8)
|
||||
=================================================================
|
||||
Months Corrects Total Accuracy
|
||||
2023-05 180 272 66.17647058823529
|
||||
2023-06 238 312 76.28205128205128
|
||||
2023-07 337 432 78.00925925925925
|
||||
2023-08 185 288 64.23611111111111
|
||||
2023-09 275 352 78.125
|
||||
2023-10 257 352 73.01136363636364
|
||||
2023-11 217 280 77.5
|
||||
2023-12 228 320 71.25
|
||||
2024-01 193 288 67.01388888888889
|
||||
2024-02 169 256 66.015625
|
||||
2024-03 234 360 65.0
|
||||
2024-04 226 296 76.35135135135135
|
||||
2024-05 211 288 73.26388888888889
|
||||
05/23-05/24 2950 4096 72.021484375
|
||||
2024-06 277 368 75.27173913043478
|
||||
2024-07 223 344 64.82558139534883
|
||||
2024-08 275 528 52.083333333333336
|
||||
2024-09 204 376 54.255319148936174
|
||||
2024-10 209 424 49.29245283018868
|
||||
2024-11 216 456 47.36842105263158
|
||||
2024-12 223 392 56.88775510204081
|
||||
2025-01 161 408 39.46078431372549
|
||||
06/24-01/25 1788 3296 54.24757281553398
|
||||
2025-02 179 408 43.872549019607845
|
||||
2025-03 258 544 47.4264705882353
|
||||
2025-04 38 96 39.583333333333336
|
||||
v5 1142 2232 51.16487455197132
|
||||
v6 621 1400 44.357142857142854
|
||||
|
||||
LiveCodeBench AceReason-Nemotron-1.0-14B (Avg@8)
|
||||
=================================================================
|
||||
Months Corrects Total Accuracy
|
||||
2023-05 211 272 77.57352941176471
|
||||
2023-06 282 312 90.38461538461539
|
||||
2023-07 393 432 90.97222222222223
|
||||
2023-08 219 288 76.04166666666667
|
||||
2023-09 315 352 89.48863636363636
|
||||
2023-10 294 352 83.52272727272727
|
||||
2023-11 229 280 81.78571428571429
|
||||
2023-12 263 320 82.1875
|
||||
2024-01 219 288 76.04166666666667
|
||||
2024-02 201 256 78.515625
|
||||
2024-03 296 360 82.22222222222223
|
||||
2024-04 252 296 85.13513513513513
|
||||
2024-05 233 288 80.90277777777777
|
||||
05/23-05/24 3407 4096 83.1787109375
|
||||
2024-06 311 368 84.51086956521739
|
||||
2024-07 248 344 72.09302325581395
|
||||
2024-08 299 528 56.628787878787875
|
||||
2024-09 232 376 61.702127659574465
|
||||
2024-10 266 424 62.735849056603776
|
||||
2024-11 282 456 61.8421052631579
|
||||
2024-12 253 392 64.54081632653062
|
||||
2025-01 217 408 53.18627450980392
|
||||
06/24-01/25 2108 3296 63.95631067961165
|
||||
2025-02 211 408 51.71568627450981
|
||||
2025-03 324 544 59.55882352941177
|
||||
2025-04 41 96 42.708333333333336
|
||||
v5 1350 2232 60.483870967741936
|
||||
v6 775 1400 55.357142857142854
|
||||
|
||||
LiveCodeBench AceReason-Nemotron-1.1-7B (Avg@8)
|
||||
=================================================================
|
||||
Months Corrects Total Accuracy
|
||||
2023-05 205 272 75.36764705882354
|
||||
2023-06 255 312 81.73076923076923
|
||||
2023-07 356 432 82.4074074074074
|
||||
2023-08 208 288 72.22222222222223
|
||||
2023-09 287 352 81.5340909090909
|
||||
2023-10 278 352 78.97727272727273
|
||||
2023-11 234 280 83.57142857142857
|
||||
2023-12 263 320 82.1875
|
||||
2024-01 215 288 74.65277777777777
|
||||
2024-02 182 256 71.09375
|
||||
2024-03 270 360 75.0
|
||||
2024-04 254 296 85.8108108108108
|
||||
2024-05 221 288 76.73611111111111
|
||||
05/23-05/24 3228 4096 78.80859375
|
||||
2024-06 309 368 83.96739130434783
|
||||
2024-07 235 344 68.31395348837209
|
||||
2024-08 292 528 55.303030303030305
|
||||
2024-09 211 376 56.11702127659574
|
||||
2024-10 254 424 59.905660377358494
|
||||
2024-11 269 456 58.99122807017544
|
||||
2024-12 239 392 60.96938775510204
|
||||
2025-01 194 408 47.549019607843135
|
||||
06/24-01/25 2003 3296 60.77063106796116
|
||||
2025-02 203 408 49.754901960784316
|
||||
2025-03 306 544 56.25
|
||||
2025-04 41 96 42.708333333333336
|
||||
v5 1283 2232 57.482078853046595
|
||||
v6 726 1400 51.857142857142854
|
||||
|
||||
AceReason-Nemotron-7B
|
||||
====================================
|
||||
AIME2024 (Avg@64) 68.64583333333334
|
||||
AIME2025 (Avg@64) 53.59375000000002
|
||||
|
||||
AceReason-Nemotron-14B
|
||||
====================================
|
||||
AIME2024 (Avg@64) 78.43749999999997
|
||||
AIME2025 (Avg@64) 67.65625
|
||||
|
||||
AceReason-Nemotron-1.1-7B
|
||||
====================================
|
||||
AIME2024 (Avg@64) 72.60416666666667
|
||||
AIME2025 (Avg@64) 64.84375
|
||||
```
|
||||
Reference in New Issue
Block a user