language, license, library_name, tags, pipeline_tag, model-index, base_model, datasets
language license library_name tags pipeline_tag model-index base_model datasets
en
apache-2.0 transformers
pretraining
educational
pedagogical
sutra
smollm2
llama
text-generation
name results
SmolLM2-70M
task dataset metrics
type name
text-generation Text Generation
type name config
ai2_arc ARC-Easy ARC-Easy
type value name
acc_norm 33.00 Normalized Accuracy (0-shot)
task dataset metrics
type name
text-generation Text Generation
type name config
ai2_arc ARC-Challenge ARC-Challenge
type value name
acc_norm 22.35 Normalized Accuracy (0-shot)
task dataset metrics
type name
text-generation Text Generation
type name
boolq BoolQ
type value name
acc 39.66 Accuracy (0-shot)
task dataset metrics
type name
text-generation Text Generation
type name
hellaswag HellaSwag
type value name
acc_norm 26.14 Normalized Accuracy (0-shot)
task dataset metrics
type name
text-generation Text Generation
type name
piqa PIQA
type value name
acc_norm 54.84 Normalized Accuracy (0-shot)
task dataset metrics
type name
text-generation Text Generation
type name
sciq SciQ
type value name
acc_norm 45.20 Normalized Accuracy (0-shot)
task dataset metrics
type name
text-generation Text Generation
type name
winogrande WinoGrande
type value name
acc 50.04 Accuracy (0-shot)
task dataset metrics
type name
text-generation Text Generation
type name
truthful_qa TruthfulQA MC2
type value name
acc 48.02 Accuracy (0-shot)
task dataset metrics
type name
text-generation Text Generation
type name
gsm8k GSM8K
type value name
exact_match 0.53 Exact Match (5-shot)
task dataset metrics
type name
text-generation Text Generation
type name
cais/mmlu MMLU
type value name
acc 22.96 Accuracy (0-shot)
task dataset metrics
type name
text-generation Text Generation
type name
openbookqa OpenBookQA
type value name
acc_norm 27.60 Normalized Accuracy (0-shot)
HuggingFaceTB/SmolLM2-70M
codelion/sutra-10B

SmolLM2-70M

A SmolLM2-70M model pretrained on the Sutra-10B pedagogical dataset for 3 epochs (~30.6B tokens total). This model demonstrates that a 69M parameter model can be trained to near-capacity performance using dense, curated educational data.

Model Details

Property Value
Architecture LlamaForCausalLM
Parameters 69.2M
Hidden Size 384
Layers 32
Attention Heads 6 (2 KV heads)
Context Length 8,192
Vocabulary 49,152
Precision bfloat16
Base Model SmolLM2-70M
Training Dataset Sutra-10B (10.2B tokens)

Training

The model was trained for 3 epochs on the Sutra-10B dataset using a single NVIDIA L40S GPU (46GB). This checkpoint is the best perplexity checkpoint from epoch 3.

Epoch Tokens Training Time Learning Rate Best Perplexity
1 10.2B 25.82h 3e-4 → 3e-5 39.50
2 10.2B 25.78h 1e-4 → 1e-5 37.81
3 10.2B 26.16h 3e-5 → 3e-6 37.72
Total 30.6B 77.76h 37.72

Training configuration:

  • Optimizer: AdamW (fused), weight decay 0.1
  • Schedule: Cosine with warmup
  • Batch size: 4 per device, gradient accumulation 8 (effective ~262K tokens/step)
  • Sequence length: 8,192
  • Flash Attention 2, TF32 matmul, torch.compile
  • Throughput: ~110K tokens/sec

Benchmark Results

All benchmarks evaluated using lm-evaluation-harness v0.4.11. All tasks are 0-shot except GSM8K (5-shot).

This Model vs Training Progression

Benchmark E3-best E3-final E2-best E2-final E1-final
ARC-Easy 33.00 33.16 32.83 33.12 33.46
ARC-Challenge 22.35 21.67 22.61 22.44 22.44
BoolQ 39.66 39.66 39.79 39.54 39.79
HellaSwag 26.14 26.03 26.08 25.91 26.03
PIQA 54.84 55.01 54.24 54.13 54.62
SciQ 45.20 46.30 44.10 45.50 43.60
WinoGrande 50.04 49.33 50.51 48.70 48.78
TruthfulQA 48.02 47.93 48.30 48.14 48.30
GSM8K 0.53 0.61 0.68 0.83 0.15
MMLU 22.96 22.87 23.00 22.98 22.99
OpenBookQA 27.60 27.60
Average (10) 34.27 34.26 34.21 34.13 34.02

Comparison with 1B Token Baselines (SmolLM2-70M)

These are results from training the same SmolLM2-70M model on various 1B-token datasets from the Pre-training Dataset Samples collection for 1 epoch, showing that Sutra-10B at 3 epochs achieves the highest performance for this model size.

Dataset (1B tokens) HellaSwag PIQA WinoGrande ARC-C MMLU TruthfulQA GSM8K Avg
Sutra-10B (3 epochs) 26.14 54.84 50.04 22.35 22.96 48.02 0.53 34.27
Sutra-1B 25.43 53.86 49.41 23.04 22.91 49.09 1.14 32.13
FineWiki-1B 25.56 51.69 48.86 24.15 23.34 51.16 0.91 32.24
FinePDFs-1B 25.58 52.56 50.51 22.44 22.95 51.41 1.21 32.38
DCLM-Baseline-1B 25.85 55.17 50.20 21.08 22.97 49.21 0.68 32.16
FineWeb-Edu-1B 25.72 55.11 50.36 21.25 22.96 48.11 1.21 32.10
Essential-Web-1B 26.02 55.44 48.30 20.99 22.95 49.59 1.29 32.08
Synth-1B 26.63 50.98 48.78 21.93 23.24 47.10 1.29 31.42

Key Findings

  1. Capacity ceiling: The 70M parameter model reaches its capacity ceiling at approximately 10B tokens. Additional epochs (up to 30.6B total tokens) yield only marginal improvements in benchmark scores (+0.25 average from epoch 1 to epoch 3), despite continued perplexity improvement (39.50 → 37.72).

  2. Perplexity vs benchmarks: Perplexity continues to decrease across epochs, but downstream benchmark performance plateaus, suggesting the model's representational capacity is the bottleneck rather than data exposure.

  3. Data quality matters: Even at 1B tokens, Sutra outperforms or matches larger web-crawled datasets (DCLM, FineWeb-Edu, Essential-Web) on average, demonstrating the value of curated pedagogical content.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("codelion/SmolLM2-70M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("codelion/SmolLM2-70M")

input_text = "The theory of relativity states that"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

  • This is a 69M parameter base model (not instruction-tuned) — it generates completions, not conversational responses
  • Performance is at the capacity ceiling for this model size; larger models would benefit more from the Sutra-10B dataset
  • The model was trained primarily on English educational content
  • Dataset: codelion/sutra-10B — 10B token pedagogical pretraining dataset
  • Sutra Framework: Generates structured educational content optimized for LLM pretraining

Citation

@article{sharma2026sutra,
  title={Scaling Pedagogical Pretraining: From Optimal Mixing to 10 Billion Tokens},
  author={Sharma, Asankhaya},
  year={2026},
  url={https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens}
}

License

Apache 2.0

Description
Model synced from source: codelion/SmolLM2-70M
Readme 1.3 MiB
Languages
Text 100%