You agree to not use the dataset to conduct experiments that cause harm to human subjects.
Company/Organization
Country
text
country
text-generation
transformers
OpenSeek-Mid-v1
OpenSeek-Mid-v1 is a 10.61-billion-parameter language model grown from Qwen3-4B-Base through a two-stage model expansion pipeline and trained on only 2 trillion tokens of fully open-source data.
Despite having 25% fewer parameters and using 18x less training data, OpenSeek-Mid-v1 matches or surpasses Qwen3-14B-Base across multiple benchmarks.
Highlights
Model Growth, Not From-Scratch Training: Grown from Qwen3-4B via width expansion + partial depth stacking, inheriting the seed model's learned representations.
Extreme Data Efficiency: Matches Qwen3-14B-Base (~36T tokens) with only 2T tokens of training — an 18x reduction in data requirement.
Muon Optimizer: Spectral whitening ensures expanded dimensions are effectively utilized, delivering significant gains over AdamW in the model growth setting.
Fully Open-Source Data: All training data comes from publicly available datasets (NemotronCC-v2, Stack-Edu, Dolmino, CCI, etc.).
Stage 1 (%) and Stage 2 (%) denote each dataset's sampling weight within the respective stage. "—" indicates the dataset is not used in that stage.
Web
Dataset
Tokens (B)
Stage 1 (%)
Stage 2 (%)
Nemotron-CC-v2-HQ-Syn
798.41
23.24
19.36
Nemotron-CC-v2-Diverse-QA (×5 shards)
340.81
9.92
8.26
Nemotron-CC-v2-HQ (×5 shards)
303.82
8.84
7.36
dolmino-mix-1124-wiki
3.82
0.15
0.18
dolmino-mix-1124-stackexchange
1.30
0.05
0.06
Math
Dataset
Tokens (B)
Stage 1 (%)
Stage 2 (%)
Nemotron-SFT-MATH
207.46
11.70
11.70
Nemotron-CC-Math-v1-4plus-MIND
74.34
4.19
4.19
Nemotron-CC-Math-v1-4plus
53.37
3.01
3.01
Dolmino-math
11.17
0.63
0.63
OpenMathInstruct-2
5.30
0.30
0.30
OpenMathReasoning-4k
2.48
0.14
0.14
NuminaMath-1.5
0.38
0.02
0.02
Code
Dataset
Tokens (B)
Stage 1 (%)
Stage 2 (%)
Nemotron-Pretraining-Code-v1-Syn
171.53
9.05
10.86
Nemotron-SFT-Code
57.47
3.03
3.64
stack-edu-Java
31.70
1.06
1.27
stack-edu-Markdown
26.64
0.38
0.45
stack-edu-Python
18.27
1.54
1.85
stack-edu-Cpp
12.62
1.11
1.33
stack-edu-JavaScript
8.99
1.00
1.20
stack-edu-SQL
8.23
0.37
0.44
github-issue
8.46
0.25
0.30
stack-edu-PHP
7.43
0.25
0.30
stack-edu-CSharp
7.26
0.37
0.44
stack-edu-C
4.80
0.43
0.52
stack-edu-Shell
2.60
0.01
0.01
stack-edu-TypeScript
2.51
0.18
0.22
OpenCodeInstruct
1.59
—
0.10
stack-edu-Swift
1.53
0.06
0.07
stack-edu-Rust
1.45
0.05
0.06
stack-edu-Go
1.42
0.03
0.04
kaggle-notebooks
1.42
0.65
0.78
stack-edu-Ruby
1.36
0.01
0.01
OpenCodeReasoning-2-cpp-4k
0.76
0.04
0.05
OpenCodeReasoning-2-python-4k
0.58
0.03
0.04
github-code-review
0.32
—
0.02
STEM & Science
Dataset
Tokens (B)
Stage 1 (%)
Stage 2 (%)
Nemotron-Pretraining-Specialized-v1 (×4 shards)
276.83
10.55
12.73
Nemotron-Pretraining-SFT-v1-General
86.93
3.31
4.00
dolmino-mix-1124-pes2o
60.19
0.50
0.50
Nemotron-Pretraining-Specialized-v1.1
9.04
—
0.42
OpenScienceReasoning-2-4k
1.72
0.07
0.08
MegaScience
0.98
0.04
0.04
Multilingual
Dataset
Tokens (B)
Stage 1 (%)
Stage 2 (%)
Nemotron-CC-v2-Translated-Diverse-QA
135.80
1.74
1.74
CCI4_0-Zh-High
98.76
1.26
1.26
Checkpoint Merging
The final model is a weighted average of 5 complementary checkpoints, each selected for a unique strength:
Checkpoint
Weight
Role
Key Metric
iter 169984
0.30
Code anchor
MBPP 78.84
iter 219136
0.25
Reasoning lead
GPQA-d 44.39
iter 174080
0.15
Code peak
EvalPlus 68.88
iter 190464
0.15
Math bridge
GPQA-d 42.86
iter 217088
0.15
General boost
BBH 82.84
Evaluation Results
All evaluations conducted via lm-eval-harness with consistent settings.
Benchmark
Qwen3-4B
Qwen3-8B
Qwen3.5-9B
Nemotron-12B
Gemma3-12B
Qwen3-14B
OpenSeek-Mid-v1
Training tokens
36T
36T
36T
20T
12T
36T
2T
MMLU (5-shot)
72.72
76.57
78.64
78.07
73.28
80.57
79.31
MMLU-Pro (5-shot CoT)
49.31
52.35
58.48
57.57
41.16
56.00
66.57
AGIEval-en (0-shot)
45.92
49.09
45.15
49.20
44.89
52.83
52.18
BBH (3-shot CoT)
71.20
77.75
82.23
69.65
73.78
78.71
82.55
HellaSwag (5-shot)
75.36
79.47
81.04
83.13
83.45
82.05
81.81
Winogrande (5-shot)
71.90
77.51
76.80
79.24
80.35
79.40
79.24
PIQA (5-shot)
78.89
81.39
81.61
82.97
81.80
83.30
83.19
OpenBookQA (5-shot)
45.00
49.00
50.00
50.20
49.60
50.80
49.80
ARC-C (0-shot)
51.19
56.91
56.83
60.58
64.68
59.30
62.12
GSM8K (4-shot CoT)
84.31
86.73
85.60
81.43
72.02
90.07
89.16
MATH (4-shot CoT)
50.16
52.48
56.16
57.30
43.30
59.70
65.88
GPQA-diamond (3-shot CoT)
32.65
35.71
37.76
31.12
23.47
37.76
45.41
MBPP (0-shot)
73.81
75.66
77.51
73.81
73.28
84.92
76.19
EvalPlus Avg (0-shot)
63.96
67.95
59.54
61.20
53.48
73.41
66.45
Avg General
62.39
66.67
67.86
65.04
60.98
69.22
70.75
Avg All
61.88
65.61
66.24
65.39
61.32
69.20
69.99
Avg General: average of knowledge, reasoning, and commonsense benchmarks (MMLU, MMLU-Pro, AGIEval-en, BBH, HellaSwag, Winogrande, PIQA, OpenBookQA, ARC-C).
Avg All: average of all benchmarks above, including math, STEM, and code (+ GSM8K, MATH, GPQA-diamond, MBPP, EvalPlus Avg).
Citation
If you find this work useful, please cite:
@misc{openseek-mid-v1,title={OpenSeek-Mid-v1: Efficient Language Model Scaling via Seed Model Expansion},year={2026},note={Technical report coming soon}}
Acknowledgements
This project was built using open-source data and tools, including NemotronCC-v2, Stack-Edu, Dolmino, CCI, OpenMathInstruct, OpenCodeReasoning, and FlagOS.