appendices: model evaluation (written by deekseek-ai)
deepseek-r1-evaluation
for all our (here refer to deekseek-ai) models, the maximum generation length is set to 32,768 tokens; for benchmarks requiring sampling, we use a temperature of 0.6, a top-p value of 0.95, and generate 64 responses per query to estimate pass@1.
Category
Benchmark (Metric)
Claude-3.5-Sonnet-1022
GPT-4o 0513
DeepSeek V3
OpenAI o1-mini
OpenAI o1-1217
DeepSeek R1
Architecture
-
-
MoE
-
-
MoE
# Activated Params
-
-
37B
-
-
37B
# Total Params
-
-
671B
-
-
671B
English
MMLU (Pass@1)
88.3
87.2
88.5
85.2
91.8
90.8
MMLU-Redux (EM)
88.9
88.0
89.1
86.7
-
92.9
MMLU-Pro (EM)
78.0
72.6
75.9
80.3
-
84.0
DROP (3-shot F1)
88.3
83.7
91.6
83.9
90.2
92.2
IF-Eval (Prompt Strict)
86.5
84.3
86.1
84.8
-
83.3
GPQA-Diamond (Pass@1)
65.0
49.9
59.1
60.0
75.7
71.5
SimpleQA (Correct)
28.4
38.2
24.9
7.0
47.0
30.1
FRAMES (Acc.)
72.5
80.5
73.3
76.9
-
82.5
AlpacaEval2.0 (LC-winrate)
52.0
51.1
70.0
57.8
-
87.6
ArenaHard (GPT-4-1106)
85.2
80.4
85.5
92.0
-
92.3
Code
LiveCodeBench (Pass@1-COT)
33.8
34.2
-
53.8
63.4
65.9
Codeforces (Percentile)
20.3
23.6
58.7
93.4
96.6
96.3
Codeforces (Rating)
717
759
1134
1820
2061
2029
SWE Verified (Resolved)
50.8
38.8
42.0
41.6
48.9
49.2
Aider-Polyglot (Acc.)
45.3
16.0
49.6
32.9
61.7
53.3
Math
AIME 2024 (Pass@1)
16.0
9.3
39.2
63.6
79.2
79.8
MATH-500 (Pass@1)
78.3
74.6
90.2
90.0
96.4
97.3
CNMO 2024 (Pass@1)
13.1
10.8
43.2
67.6
-
78.8
Chinese
CLUEWSC (EM)
85.4
87.9
90.9
89.9
-
92.8
C-Eval (EM)
76.7
76.0
86.5
68.9
-
91.8
C-SimpleQA (Correct)
55.4
58.7
68.0
40.3
-
63.7
distilled model evaluation
Model
AIME 2024 pass@1
AIME 2024 cons@64
MATH-500 pass@1
GPQA Diamond pass@1
LiveCodeBench pass@1
CodeForces rating
GPT-4o-0513
9.3
13.4
74.6
49.9
32.9
759
Claude-3.5-Sonnet-1022
16.0
26.7
78.3
65.0
38.9
717
o1-mini
63.6
80.0
90.0
60.0
53.8
1820
QwQ-32B-Preview
44.0
60.0
90.6
54.5
41.9
1316
DeepSeek-R1-Distill-Qwen-1.5B
28.9
52.7
83.9
33.8
16.9
954
DeepSeek-R1-Distill-Qwen-7B
55.5
83.3
92.8
49.1
37.6
1189
DeepSeek-R1-Distill-Qwen-14B
69.7
80.0
93.9
59.1
53.1
1481
DeepSeek-R1-Distill-Qwen-32B
72.6
83.3
94.3
62.1
57.2
1691
DeepSeek-R1-Distill-Llama-8B
50.4
80.0
89.1
49.0
39.6
1205
DeepSeek-R1-Distill-Llama-70B
70.0
86.7
94.5
65.2
57.5
1633
* these two tables are directly quoted from deepseek-ai