Files

ModelHub XC db932fe0b1 初始化项目，由ModelHub XC社区提供模型

Model: BAAI/OpenSeek-Mid-v1
Source: Original Platform

2026-05-19 20:10:08 +08:00

8.4 KiB

Raw Blame History

extra_gated_prompt, extra_gated_fields, pipeline_tag, library_name

extra_gated_prompt

extra_gated_fields

pipeline_tag

library_name

You agree to not use the dataset to conduct experiments that cause harm to human subjects.

Company/Organization	Country
text	country

text-generation

transformers

OpenSeek-Mid-v1

OpenSeek-Mid-v1 is a 10.61-billion-parameter language model grown from Qwen3-4B-Base through a two-stage model expansion pipeline and trained on only 2 trillion tokens of fully open-source data.

Despite having 25% fewer parameters and using 18x less training data, OpenSeek-Mid-v1 matches or surpasses Qwen3-14B-Base across multiple benchmarks.

Highlights

Model Growth, Not From-Scratch Training: Grown from Qwen3-4B via width expansion + partial depth stacking, inheriting the seed model's learned representations.
Extreme Data Efficiency: Matches Qwen3-14B-Base (~36T tokens) with only 2T tokens of training — an 18x reduction in data requirement.
Muon Optimizer: Spectral whitening ensures expanded dimensions are effectively utilized, delivering significant gains over AdamW in the model growth setting.
Fully Open-Source Data: All training data comes from publicly available datasets (NemotronCC-v2, Stack-Edu, Dolmino, CCI, etc.).

Architecture

Specification	Value
Parameters	10.61B
Layers	56
Hidden Size (d_model)	2560
FFN Intermediate Size (d_FFN)	19456
Attention Heads	32
KV Heads	8
Sequence Length	8192
Vocabulary Size	same as Qwen3-4B

Growth Pipeline

Qwen3-4B (4.02B, 36L)
    │  Width expansion (d_FFN: 9728 → 19456, SNR=10dB)
    ▼
Width-Expanded (7.10B, 36L)
    │  Partial depth stacking (layers 14–34 × 2)
    ▼
OpenSeek-Mid-v1 (10.61B, 56L)
    │  Continual pretraining with Muon (2T tokens)
    ▼
Final Model

Training

Training Configuration

Parameter	Value
Optimizer	Muon
Sequence Length	8192
Global Batch Size	2048 sequences
Peak Learning Rate	1e-4
LR Schedule	Cosine with linear warmup
Warmup Steps	1000
Weight Decay	0.1
Training Framework	FlagScale (FlagOS)
Total Training Tokens	~2.06T

Stage 1: Broad Knowledge Acquisition (1.36T tokens)

Stage 1 Data Mixture

Category	Proportion	Tokens (B)
Web	42%	~571B
Math	20%	~272B
Code	20%	~272B
STEM	15%	~204B
Multilingual	3%	~41B

Stage 2: Capability Specialization (0.70T tokens)

Stage 2 Data Mixture

Category	Proportion	Tokens (B)	Delta vs. Stage 1
Web	35%	~245B	-7%
Math	20%	~140B	—
Code	24%	~168B	+4%
STEM	18%	~126B	+3%
Multilingual	3%	~21B	—

Detailed Dataset Composition

Stage 1 (%) and Stage 2 (%) denote each dataset's sampling weight within the respective stage. "—" indicates the dataset is not used in that stage.

Web

Dataset	Tokens (B)	Stage 1 (%)	Stage 2 (%)
Nemotron-CC-v2-HQ-Syn	798.41	23.24	19.36
Nemotron-CC-v2-Diverse-QA (×5 shards)	340.81	9.92	8.26
Nemotron-CC-v2-HQ (×5 shards)	303.82	8.84	7.36
dolmino-mix-1124-wiki	3.82	0.15	0.18
dolmino-mix-1124-stackexchange	1.30	0.05	0.06

Math

Dataset	Tokens (B)	Stage 1 (%)	Stage 2 (%)
Nemotron-SFT-MATH	207.46	11.70	11.70
Nemotron-CC-Math-v1-4plus-MIND	74.34	4.19	4.19
Nemotron-CC-Math-v1-4plus	53.37	3.01	3.01
Dolmino-math	11.17	0.63	0.63
OpenMathInstruct-2	5.30	0.30	0.30
OpenMathReasoning-4k	2.48	0.14	0.14
NuminaMath-1.5	0.38	0.02	0.02

Code

Dataset	Tokens (B)	Stage 1 (%)	Stage 2 (%)
Nemotron-Pretraining-Code-v1-Syn	171.53	9.05	10.86
Nemotron-SFT-Code	57.47	3.03	3.64
stack-edu-Java	31.70	1.06	1.27
stack-edu-Markdown	26.64	0.38	0.45
stack-edu-Python	18.27	1.54	1.85
stack-edu-Cpp	12.62	1.11	1.33
stack-edu-JavaScript	8.99	1.00	1.20
stack-edu-SQL	8.23	0.37	0.44
github-issue	8.46	0.25	0.30
stack-edu-PHP	7.43	0.25	0.30
stack-edu-CSharp	7.26	0.37	0.44
stack-edu-C	4.80	0.43	0.52
stack-edu-Shell	2.60	0.01	0.01
stack-edu-TypeScript	2.51	0.18	0.22
OpenCodeInstruct	1.59	—	0.10
stack-edu-Swift	1.53	0.06	0.07
stack-edu-Rust	1.45	0.05	0.06
stack-edu-Go	1.42	0.03	0.04
kaggle-notebooks	1.42	0.65	0.78
stack-edu-Ruby	1.36	0.01	0.01
OpenCodeReasoning-2-cpp-4k	0.76	0.04	0.05
OpenCodeReasoning-2-python-4k	0.58	0.03	0.04
github-code-review	0.32	—	0.02

STEM & Science

Dataset	Tokens (B)	Stage 1 (%)	Stage 2 (%)
Nemotron-Pretraining-Specialized-v1 (×4 shards)	276.83	10.55	12.73
Nemotron-Pretraining-SFT-v1-General	86.93	3.31	4.00
dolmino-mix-1124-pes2o	60.19	0.50	0.50
Nemotron-Pretraining-Specialized-v1.1	9.04	—	0.42
OpenScienceReasoning-2-4k	1.72	0.07	0.08
MegaScience	0.98	0.04	0.04

Multilingual

Dataset	Tokens (B)	Stage 1 (%)	Stage 2 (%)
Nemotron-CC-v2-Translated-Diverse-QA	135.80	1.74	1.74
CCI4_0-Zh-High	98.76	1.26	1.26

Checkpoint Merging

The final model is a weighted average of 5 complementary checkpoints, each selected for a unique strength:

Checkpoint	Weight	Role	Key Metric
iter 169984	0.30	Code anchor	MBPP 78.84
iter 219136	0.25	Reasoning lead	GPQA-d 44.39
iter 174080	0.15	Code peak	EvalPlus 68.88
iter 190464	0.15	Math bridge	GPQA-d 42.86
iter 217088	0.15	General boost	BBH 82.84

Evaluation Results

All evaluations conducted via lm-eval-harness with consistent settings.

Benchmark	Qwen3-4B	Qwen3-8B	Qwen3.5-9B	Nemotron-12B	Gemma3-12B	Qwen3-14B	OpenSeek-Mid-v1
Training tokens	36T	36T	36T	20T	12T	36T	2T
MMLU (5-shot)	72.72	76.57	78.64	78.07	73.28	80.57	79.31
MMLU-Pro (5-shot CoT)	49.31	52.35	58.48	57.57	41.16	56.00	66.57
AGIEval-en (0-shot)	45.92	49.09	45.15	49.20	44.89	52.83	52.18
BBH (3-shot CoT)	71.20	77.75	82.23	69.65	73.78	78.71	82.55
HellaSwag (5-shot)	75.36	79.47	81.04	83.13	83.45	82.05	81.81
Winogrande (5-shot)	71.90	77.51	76.80	79.24	80.35	79.40	79.24
PIQA (5-shot)	78.89	81.39	81.61	82.97	81.80	83.30	83.19
OpenBookQA (5-shot)	45.00	49.00	50.00	50.20	49.60	50.80	49.80
ARC-C (0-shot)	51.19	56.91	56.83	60.58	64.68	59.30	62.12
GSM8K (4-shot CoT)	84.31	86.73	85.60	81.43	72.02	90.07	89.16
MATH (4-shot CoT)	50.16	52.48	56.16	57.30	43.30	59.70	65.88
GPQA-diamond (3-shot CoT)	32.65	35.71	37.76	31.12	23.47	37.76	45.41
MBPP (0-shot)	73.81	75.66	77.51	73.81	73.28	84.92	76.19
EvalPlus Avg (0-shot)	63.96	67.95	59.54	61.20	53.48	73.41	66.45

Avg General	62.39	66.67	67.86	65.04	60.98	69.22	70.75
Avg All	61.88	65.61	66.24	65.39	61.32	69.20	69.99

Avg General: average of knowledge, reasoning, and commonsense benchmarks (MMLU, MMLU-Pro, AGIEval-en, BBH, HellaSwag, Winogrande, PIQA, OpenBookQA, ARC-C).
Avg All: average of all benchmarks above, including math, STEM, and code (+ GSM8K, MATH, GPQA-diamond, MBPP, EvalPlus Avg).

Citation

If you find this work useful, please cite:

@misc{openseek-mid-v1,
  title={OpenSeek-Mid-v1: Efficient Language Model Scaling via Seed Model Expansion},
  year={2026},
  note={Technical report coming soon}
}

Acknowledgements

This project was built using open-source data and tools, including NemotronCC-v2, Stack-Edu, Dolmino, CCI, OpenMathInstruct, OpenCodeReasoning, and FlagOS.

8.4 KiB Raw Blame History Unescape Escape