Go to file

ModelHub XC fa8abef9c8 初始化项目，由ModelHub XC社区提供模型

Model: codelion/SmolLM2-70M
Source: Original Platform

2026-05-17 11:19:26 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-17 11:19:26 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 11:19:26 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 11:19:26 +08:00

merges.txt

初始化项目，由ModelHub XC社区提供模型

2026-05-17 11:19:26 +08:00

model.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-17 11:19:26 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-17 11:19:26 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 11:19:26 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 11:19:26 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 11:19:26 +08:00

vocab.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 11:19:26 +08:00

README.md

language, license, library_name, tags, pipeline_tag, model-index, base_model, datasets

language

license

library_name

SmolLM2-70M

A SmolLM2-70M model pretrained on the Sutra-10B pedagogical dataset for 3 epochs (~30.6B tokens total). This model demonstrates that a 69M parameter model can be trained to near-capacity performance using dense, curated educational data.

Model Details

Property	Value
Architecture	LlamaForCausalLM
Parameters	69.2M
Hidden Size	384
Layers	32
Attention Heads	6 (2 KV heads)
Context Length	8,192
Vocabulary	49,152
Precision	bfloat16
Base Model	SmolLM2-70M
Training Dataset	Sutra-10B (10.2B tokens)

Training

The model was trained for 3 epochs on the Sutra-10B dataset using a single NVIDIA L40S GPU (46GB). This checkpoint is the best perplexity checkpoint from epoch 3.

Epoch	Tokens	Training Time	Learning Rate	Best Perplexity
1	10.2B	25.82h	3e-4 → 3e-5	39.50
2	10.2B	25.78h	1e-4 → 1e-5	37.81
3	10.2B	26.16h	3e-5 → 3e-6	37.72
Total	30.6B	77.76h	—	37.72

Training configuration:

Optimizer: AdamW (fused), weight decay 0.1
Schedule: Cosine with warmup
Batch size: 4 per device, gradient accumulation 8 (effective ~262K tokens/step)
Sequence length: 8,192
Flash Attention 2, TF32 matmul, torch.compile
Throughput: ~110K tokens/sec

Benchmark Results

All benchmarks evaluated using lm-evaluation-harness v0.4.11. All tasks are 0-shot except GSM8K (5-shot).

This Model vs Training Progression

Benchmark	E3-best	E3-final	E2-best	E2-final	E1-final
ARC-Easy	33.00	33.16	32.83	33.12	33.46
ARC-Challenge	22.35	21.67	22.61	22.44	22.44
BoolQ	39.66	39.66	39.79	39.54	39.79
HellaSwag	26.14	26.03	26.08	25.91	26.03
PIQA	54.84	55.01	54.24	54.13	54.62
SciQ	45.20	46.30	44.10	45.50	43.60
WinoGrande	50.04	49.33	50.51	48.70	48.78
TruthfulQA	48.02	47.93	48.30	48.14	48.30
GSM8K	0.53	0.61	0.68	0.83	0.15
MMLU	22.96	22.87	23.00	22.98	22.99
OpenBookQA	27.60	27.60	—	—	—
Average (10)	34.27	34.26	34.21	34.13	34.02

Comparison with 1B Token Baselines (SmolLM2-70M)

These are results from training the same SmolLM2-70M model on various 1B-token datasets from the Pre-training Dataset Samples collection for 1 epoch, showing that Sutra-10B at 3 epochs achieves the highest performance for this model size.

Dataset (1B tokens)	HellaSwag	PIQA	WinoGrande	ARC-C	MMLU	TruthfulQA	GSM8K	Avg
Sutra-10B (3 epochs)	26.14	54.84	50.04	22.35	22.96	48.02	0.53	34.27
Sutra-1B	25.43	53.86	49.41	23.04	22.91	49.09	1.14	32.13
FineWiki-1B	25.56	51.69	48.86	24.15	23.34	51.16	0.91	32.24
FinePDFs-1B	25.58	52.56	50.51	22.44	22.95	51.41	1.21	32.38
DCLM-Baseline-1B	25.85	55.17	50.20	21.08	22.97	49.21	0.68	32.16
FineWeb-Edu-1B	25.72	55.11	50.36	21.25	22.96	48.11	1.21	32.10
Essential-Web-1B	26.02	55.44	48.30	20.99	22.95	49.59	1.29	32.08
Synth-1B	26.63	50.98	48.78	21.93	23.24	47.10	1.29	31.42

Key Findings

Capacity ceiling: The 70M parameter model reaches its capacity ceiling at approximately 10B tokens. Additional epochs (up to 30.6B total tokens) yield only marginal improvements in benchmark scores (+0.25 average from epoch 1 to epoch 3), despite continued perplexity improvement (39.50 → 37.72).
Perplexity vs benchmarks: Perplexity continues to decrease across epochs, but downstream benchmark performance plateaus, suggesting the model's representational capacity is the bottleneck rather than data exposure.
Data quality matters: Even at 1B tokens, Sutra outperforms or matches larger web-crawled datasets (DCLM, FineWeb-Edu, Essential-Web) on average, demonstrating the value of curated pedagogical content.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("codelion/SmolLM2-70M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("codelion/SmolLM2-70M")

input_text = "The theory of relativity states that"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

This is a 69M parameter base model (not instruction-tuned) — it generates completions, not conversational responses
Performance is at the capacity ceiling for this model size; larger models would benefit more from the Sutra-10B dataset
The model was trained primarily on English educational content

Dataset: codelion/sutra-10B — 10B token pedagogical pretraining dataset
Sutra Framework: Generates structured educational content optimized for LLM pretraining

Citation

@article{sharma2026sutra,
  title={Scaling Pedagogical Pretraining: From Optimal Mixing to 10 Billion Tokens},
  author={Sharma, Asankhaya},
  year={2026},
  url={https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens}
}

License

Apache 2.0

README.md

SmolLM2-70M

Model Details

Training

Benchmark Results

This Model vs Training Progression

Comparison with 1B Token Baselines (SmolLM2-70M)

Key Findings

Usage

Limitations

Related Resources

Citation

License