language, license, tags, model-index, datasets
language license tags model-index datasets
en
apache-2.0
text-generation
gpt2
dataset-mixing
pretraining
name results
gpt-2-70m
task metrics
type
text-generation
name type value
MMLU (5-shot) accuracy 24.11
name type value
HellaSwag (0-shot) accuracy 27.03
name type value
ARC-Challenge (0-shot) accuracy 21.67
name type value
PIQA (0-shot) accuracy 57.29
name type value
WinoGrande (0-shot) accuracy 51.46
name type value
TruthfulQA MC2 (0-shot) accuracy 47.31
name type value
Average accuracy 38.15
codelion/finepdfs-1B
codelion/dclm-baseline-1B
codelion/fineweb-edu-1B

GPT-2 70M - Optimal Dataset Mixing

A 70M parameter GPT-2 model trained on 1 billion tokens using an optimized 50-30-20 dataset mixing strategy.

Model Description

This model demonstrates the effectiveness of careful dataset composition for efficient language model pretraining. Despite using 10x less training data than GPT-2 (1B vs 10B tokens), it achieves competitive performance by leveraging an optimal mixture of high-quality data sources.

Architecture: GPT-2

  • Parameters: 70M (64.09M trainable)
  • Layers: 12
  • Hidden Size: 512
  • Attention Heads: 8
  • Context Length: 1024 tokens
  • Vocabulary Size: 50,257

Training Data

The model was trained on 1 billion tokens with the following composition:

  • 50% - FinePDFs (500M tokens): High-quality PDF content
  • 30% - DCLM Baseline (300M tokens): Filtered web content
  • 20% - FineWeb-Edu (200M tokens): Educational web content

This 50-30-20 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.

Training Details

  • Total Tokens: 1,000,000,000
  • Batch Size: 24 (effective: 120 with gradient accumulation)
  • Learning Rate: 5e-4 → 5e-5 (cosine decay)
  • Warmup Steps: 162 (2% of total)
  • Precision: BFloat16
  • Optimizer: AdamW
  • Final Loss: 2.92

Benchmark Results

Performance Comparison

Benchmark Our Model Random GPT-2 vs Random vs GPT-2
MMLU (5-shot) 24.11% 25.00% 26.00% -0.89% -1.89%
HellaSwag (0-shot) 27.03% 25.00% 30.00% +2.03% -2.97%
ARC-Challenge (0-shot) 21.67% 25.00% 24.00% -3.33% -2.33%
PIQA (0-shot) 57.29% 50.00% 63.00% +7.29% -5.71%
WinoGrande (0-shot) 51.46% 50.00% 51.00% +1.46% +0.46%
TruthfulQA MC2 (0-shot) 47.31% 25.00% 40.00% +22.31% +7.31%
Average 38.15% 33.33% 39.00% +4.81% -0.85%

Key Findings

  • Performance Gap: Only 0.85% behind GPT-2 baseline (39.00%)
  • Efficiency: Achieves 84.9% of GPT-2's performance improvement over random guessing
  • Data Efficiency: Competitive results with 10x less training data
  • TruthfulQA Excellence: +7.31% above GPT-2 baseline, demonstrating superior factual accuracy

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("codelion/gpt-2-70m")
model = AutoModelForCausalLM.from_pretrained("codelion/gpt-2-70m")

# Generate text with better sampling parameters
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=50,
    do_sample=True,           # Enable sampling
    temperature=0.8,          # Control randomness
    top_p=0.9,               # Nucleus sampling
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))

Key Insights

  1. Data Quality > Quantity: The 50-30-20 mixing strategy demonstrates that careful dataset composition can achieve strong performance with significantly reduced compute
  2. Factual Accuracy: The model excels at truthfulness (TruthfulQA), likely due to high-quality FinePDF content (50%)
  3. Practical Commonsense: Strong performance on PIQA and WinoGrande shows effective real-world reasoning
  4. Knowledge Gaps: Below-random performance on MMLU and ARC-Challenge indicates insufficient academic/scientific knowledge for this scale

Limitations

  • Academic Knowledge: Limited performance on academic benchmarks (MMLU, ARC-Challenge)
  • Training Scale: 1B tokens is insufficient for comprehensive world knowledge
  • Parameter Count: 70M parameters may limit capacity for complex reasoning

Citation

If you use this model/dataset, please cite:

@article{sharma2025billion,
  title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
  author={Sharma, Asankhaya},
  year={2025},
  url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}

For more details, see the blog post.

Model Card Authors

codelion

Model Card Contact

For questions or issues, please open an issue on the model repository.

Description
Model synced from source: codelion/gpt-2-70m
Readme 623 KiB
Languages
Text 100%