Files
OpenSeek-Mid-v1/README.md
ModelHub XC db932fe0b1 初始化项目,由ModelHub XC社区提供模型
Model: BAAI/OpenSeek-Mid-v1
Source: Original Platform
2026-05-19 20:10:08 +08:00

244 lines
8.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
extra_gated_prompt: >-
You agree to not use the dataset to conduct experiments that cause harm to
human subjects.
extra_gated_fields:
Company/Organization: text
Country: country
pipeline_tag: text-generation
library_name: transformers
---
---
license: apache-2.0
---
# OpenSeek-Mid-v1
**OpenSeek-Mid-v1** is a 10.61-billion-parameter language model grown from [Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) through a two-stage model expansion pipeline and trained on only **2 trillion tokens** of fully open-source data.
Despite having **25% fewer parameters** and using **18x less training data**, OpenSeek-Mid-v1 matches or surpasses Qwen3-14B-Base across multiple benchmarks.
<img src="https://cdn-uploads.huggingface.co/production/uploads/642ee226a7e765fff0bf00ac/VcTNOdzlJK1tw5PjgeSSi.png" width="90%" alt="results_all">
---
## Highlights
- **Model Growth, Not From-Scratch Training**: Grown from Qwen3-4B via width expansion + partial depth stacking, inheriting the seed model's learned representations.
- **Extreme Data Efficiency**: Matches Qwen3-14B-Base (~36T tokens) with only 2T tokens of training — an 18x reduction in data requirement.
- **Muon Optimizer**: Spectral whitening ensures expanded dimensions are effectively utilized, delivering significant gains over AdamW in the model growth setting.
- **Fully Open-Source Data**: All training data comes from publicly available datasets (NemotronCC-v2, Stack-Edu, Dolmino, CCI, etc.).
---
## Architecture
| Specification | Value |
|---|---|
| Parameters | 10.61B |
| Layers | 56 |
| Hidden Size (d_model) | 2560 |
| FFN Intermediate Size (d_FFN) | 19456 |
| Attention Heads | 32 |
| KV Heads | 8 |
| Sequence Length | 8192 |
| Vocabulary Size | same as Qwen3-4B |
### Growth Pipeline
```
Qwen3-4B (4.02B, 36L)
│ Width expansion (d_FFN: 9728 → 19456, SNR=10dB)
Width-Expanded (7.10B, 36L)
│ Partial depth stacking (layers 1434 × 2)
OpenSeek-Mid-v1 (10.61B, 56L)
│ Continual pretraining with Muon (2T tokens)
Final Model
```
---
## Training
### Training Configuration
| Parameter | Value |
|---|---|
| Optimizer | Muon |
| Sequence Length | 8192 |
| Global Batch Size | 2048 sequences |
| Peak Learning Rate | 1e-4 |
| LR Schedule | Cosine with linear warmup |
| Warmup Steps | 1000 |
| Weight Decay | 0.1 |
| Training Framework | FlagScale (FlagOS) |
| Total Training Tokens | ~2.06T |
### Stage 1: Broad Knowledge Acquisition (1.36T tokens)
#### Stage 1 Data Mixture
| Category | Proportion | Tokens (B) |
|---|---|---|
| Web | 42% | ~571B |
| Math | 20% | ~272B |
| Code | 20% | ~272B |
| STEM | 15% | ~204B |
| Multilingual | 3% | ~41B |
---
### Stage 2: Capability Specialization (0.70T tokens)
#### Stage 2 Data Mixture
| Category | Proportion | Tokens (B) | Delta vs. Stage 1 |
|---|---|---|---|
| Web | 35% | ~245B | -7% |
| Math | 20% | ~140B | — |
| Code | 24% | ~168B | +4% |
| STEM | 18% | ~126B | +3% |
| Multilingual | 3% | ~21B | — |
---
### Detailed Dataset Composition
Stage 1 (%) and Stage 2 (%) denote each dataset's sampling weight within the respective stage. "—" indicates the dataset is not used in that stage.
**Web**
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-CC-v2-HQ-Syn | 798.41 | 23.24 | 19.36 |
| Nemotron-CC-v2-Diverse-QA (×5 shards) | 340.81 | 9.92 | 8.26 |
| Nemotron-CC-v2-HQ (×5 shards) | 303.82 | 8.84 | 7.36 |
| dolmino-mix-1124-wiki | 3.82 | 0.15 | 0.18 |
| dolmino-mix-1124-stackexchange | 1.30 | 0.05 | 0.06 |
**Math**
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-SFT-MATH | 207.46 | 11.70 | 11.70 |
| Nemotron-CC-Math-v1-4plus-MIND | 74.34 | 4.19 | 4.19 |
| Nemotron-CC-Math-v1-4plus | 53.37 | 3.01 | 3.01 |
| Dolmino-math | 11.17 | 0.63 | 0.63 |
| OpenMathInstruct-2 | 5.30 | 0.30 | 0.30 |
| OpenMathReasoning-4k | 2.48 | 0.14 | 0.14 |
| NuminaMath-1.5 | 0.38 | 0.02 | 0.02 |
**Code**
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-Pretraining-Code-v1-Syn | 171.53 | 9.05 | 10.86 |
| Nemotron-SFT-Code | 57.47 | 3.03 | 3.64 |
| stack-edu-Java | 31.70 | 1.06 | 1.27 |
| stack-edu-Markdown | 26.64 | 0.38 | 0.45 |
| stack-edu-Python | 18.27 | 1.54 | 1.85 |
| stack-edu-Cpp | 12.62 | 1.11 | 1.33 |
| stack-edu-JavaScript | 8.99 | 1.00 | 1.20 |
| stack-edu-SQL | 8.23 | 0.37 | 0.44 |
| github-issue | 8.46 | 0.25 | 0.30 |
| stack-edu-PHP | 7.43 | 0.25 | 0.30 |
| stack-edu-CSharp | 7.26 | 0.37 | 0.44 |
| stack-edu-C | 4.80 | 0.43 | 0.52 |
| stack-edu-Shell | 2.60 | 0.01 | 0.01 |
| stack-edu-TypeScript | 2.51 | 0.18 | 0.22 |
| OpenCodeInstruct | 1.59 | — | 0.10 |
| stack-edu-Swift | 1.53 | 0.06 | 0.07 |
| stack-edu-Rust | 1.45 | 0.05 | 0.06 |
| stack-edu-Go | 1.42 | 0.03 | 0.04 |
| kaggle-notebooks | 1.42 | 0.65 | 0.78 |
| stack-edu-Ruby | 1.36 | 0.01 | 0.01 |
| OpenCodeReasoning-2-cpp-4k | 0.76 | 0.04 | 0.05 |
| OpenCodeReasoning-2-python-4k | 0.58 | 0.03 | 0.04 |
| github-code-review | 0.32 | — | 0.02 |
**STEM & Science**
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-Pretraining-Specialized-v1 (×4 shards) | 276.83 | 10.55 | 12.73 |
| Nemotron-Pretraining-SFT-v1-General | 86.93 | 3.31 | 4.00 |
| dolmino-mix-1124-pes2o | 60.19 | 0.50 | 0.50 |
| Nemotron-Pretraining-Specialized-v1.1 | 9.04 | — | 0.42 |
| OpenScienceReasoning-2-4k | 1.72 | 0.07 | 0.08 |
| MegaScience | 0.98 | 0.04 | 0.04 |
**Multilingual**
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-CC-v2-Translated-Diverse-QA | 135.80 | 1.74 | 1.74 |
| CCI4_0-Zh-High | 98.76 | 1.26 | 1.26 |
---
### Checkpoint Merging
The final model is a weighted average of 5 complementary checkpoints, each selected for a unique strength:
| Checkpoint | Weight | Role | Key Metric |
|---|---|---|---|
| iter 169984 | 0.30 | Code anchor | MBPP **78.84** |
| iter 219136 | 0.25 | Reasoning lead | GPQA-d **44.39** |
| iter 174080 | 0.15 | Code peak | EvalPlus **68.88** |
| iter 190464 | 0.15 | Math bridge | GPQA-d **42.86** |
| iter 217088 | 0.15 | General boost | BBH **82.84** |
---
## Evaluation Results
All evaluations conducted via `lm-eval-harness` with consistent settings.
| Benchmark | Qwen3-4B | Qwen3-8B | Qwen3.5-9B | Nemotron-12B | Gemma3-12B | Qwen3-14B | **OpenSeek-Mid-v1** |
|---|---|---|---|---|---|---|---|
| *Training tokens* | *36T* | *36T* | *36T* | *20T* | *12T* | *36T* | ***2T*** |
| MMLU (5-shot) | 72.72 | 76.57 | 78.64 | 78.07 | 73.28 | **80.57** | <u>79.31</u> |
| MMLU-Pro (5-shot CoT) | 49.31 | 52.35 | <u>58.48</u> | 57.57 | 41.16 | 56.00 | **66.57** |
| AGIEval-en (0-shot) | 45.92 | 49.09 | 45.15 | 49.20 | 44.89 | **52.83** | <u>52.18</u> |
| BBH (3-shot CoT) | 71.20 | 77.75 | <u>82.23</u> | 69.65 | 73.78 | 78.71 | **82.55** |
| HellaSwag (5-shot) | 75.36 | 79.47 | 81.04 | <u>83.13</u> | **83.45** | 82.05 | 81.81 |
| Winogrande (5-shot) | 71.90 | 77.51 | 76.80 | 79.24 | **80.35** | <u>79.40</u> | 79.24 |
| PIQA (5-shot) | 78.89 | 81.39 | 81.61 | 82.97 | 81.80 | **83.30** | <u>83.19</u> |
| OpenBookQA (5-shot) | 45.00 | 49.00 | 50.00 | <u>50.20</u> | 49.60 | **50.80** | 49.80 |
| ARC-C (0-shot) | 51.19 | 56.91 | 56.83 | 60.58 | **64.68** | 59.30 | <u>62.12</u> |
| GSM8K (4-shot CoT) | 84.31 | 86.73 | 85.60 | 81.43 | 72.02 | **90.07** | <u>89.16</u> |
| MATH (4-shot CoT) | 50.16 | 52.48 | 56.16 | 57.30 | 43.30 | <u>59.70</u> | **65.88** |
| GPQA-diamond (3-shot CoT) | 32.65 | 35.71 | <u>37.76</u> | 31.12 | 23.47 | <u>37.76</u> | **45.41** |
| MBPP (0-shot) | 73.81 | 75.66 | <u>77.51</u> | 73.81 | 73.28 | **84.92** | 76.19 |
| EvalPlus Avg (0-shot) | 63.96 | <u>67.95</u> | 59.54 | 61.20 | 53.48 | **73.41** | 66.45 |
| | | | | | | | |
| **Avg General** | 62.39 | 66.67 | 67.86 | 65.04 | 60.98 | <u>69.22</u> | **70.75** |
| **Avg All** | 61.88 | 65.61 | 66.24 | 65.39 | 61.32 | <u>69.20</u> | **69.99** |
- **Avg General**: average of knowledge, reasoning, and commonsense benchmarks (MMLU, MMLU-Pro, AGIEval-en, BBH, HellaSwag, Winogrande, PIQA, OpenBookQA, ARC-C).
- **Avg All**: average of all benchmarks above, including math, STEM, and code (+ GSM8K, MATH, GPQA-diamond, MBPP, EvalPlus Avg).
---
## Citation
If you find this work useful, please cite:
```bibtex
@misc{openseek-mid-v1,
title={OpenSeek-Mid-v1: Efficient Language Model Scaling via Seed Model Expansion},
year={2026},
note={Technical report coming soon}
}
```
---
## Acknowledgements
This project was built using open-source data and tools, including NemotronCC-v2, Stack-Edu, Dolmino, CCI, OpenMathInstruct, OpenCodeReasoning, and FlagOS.