135 lines
5.4 KiB
Markdown
135 lines
5.4 KiB
Markdown
---
|
||
license: apache-2.0
|
||
library_name: transformers
|
||
pipeline_tag: text-generation
|
||
language:
|
||
- en
|
||
tags:
|
||
- marin
|
||
- delphi
|
||
- scaling-laws
|
||
- pretrained
|
||
- research-only
|
||
datasets:
|
||
- nvidia/Nemotron-CC
|
||
- bigcode/starcoderdata
|
||
- EleutherAI/proof-pile-2
|
||
---
|
||
|
||
# delphi-3e19-1.9Bparams-2.5Btokens
|
||
|
||
A 1.9B-parameter base model from the **Delphi** scaling suite. Trained at 3 × 10¹⁹ FLOPs on 2.5B tokens with the Delphi recipe.
|
||
|
||
## About Delphi
|
||
|
||
Delphi is the Marin team's first open scaling suite, inspired by
|
||
[Pythia](https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling).
|
||
It has three parts:
|
||
|
||
- a **scaling recipe** that maps compute budgets to model configurations,
|
||
- a **scaling suite** of models trained from that recipe at IsoFLOP budgets
|
||
from 3 × 10¹⁸ to 1 × 10²³ FLOPs, and
|
||
- a **scaling law** which uses the smaller Delphi models to predict the larger
|
||
ones.
|
||
|
||
A pre-registered forecast from that scaling law predicted the final loss of the
|
||
largest Delphi run (1 × 10²³ FLOPs, 25 B parameters, 600 B tokens) within
|
||
**0.2%**, using **300× less compute** than the training run itself. The same
|
||
process forecasts downstream benchmarks — MMLU, HumanEval, and GSM8K — via a
|
||
two-step regression combining compute and observational scaling laws.
|
||
|
||
See ["Scaling Laws That Extrapolate 300× Past the Fit"](https://openathena.ai/blog/delphi)
|
||
for the recipe, fit, and downstream-eval projections. The full set of Delphi
|
||
checkpoints — IsoFLOP grid points, held-out optima at 1e21/1e22/1e23 with
|
||
multiple random seeds, and training intermediates — lives on
|
||
[`marin-community`](https://huggingface.co/marin-community) on the Hub.
|
||
|
||
This is a research artifact, not a production model.
|
||
|
||
## Model details
|
||
|
||
| | |
|
||
|---|---|
|
||
| Architecture | Qwen 3 (pre-norm decoder, RMSNorm, RoPE, QK-norm with learned scaling, SwiGLU MLPs) |
|
||
| Parameters | 1,934,716,160 |
|
||
| Hidden size | 2048 |
|
||
| Layers | 21 |
|
||
| Attention heads | 16 |
|
||
| KV heads | 16 (no GQA) |
|
||
| Head dim | 128 |
|
||
| FFN intermediate | 8192 (MLP ratio 4) |
|
||
| Vocab size | 128,256 (Llama 3 tokenizer) |
|
||
| Max sequence length | 4096 |
|
||
| Position encoding | RoPE (θ = 500000, Llama 3-style scaling) |
|
||
| Bias terms | None |
|
||
| Tied embeddings | No |
|
||
|
||
## Training
|
||
|
||
| | |
|
||
|---|---|
|
||
| Compute | 3 × 10¹⁹ FLOPs |
|
||
| Tokens | 2,467,495,936 |
|
||
| Steps | 37,650 |
|
||
| Sequence length | 4096 |
|
||
| Optimizer | AdamH (Adam with Hyperball) |
|
||
| Recipe | Delphi (Complete(d)P-style scaling with `(T₀/T)^0.3` token-horizon LR correction) |
|
||
| LR schedule | WSD: 10% linear warmup, 20% linear decay, 0 floor |
|
||
| Precision | f32 master params, bf16 compute |
|
||
| Parallelism | FSDP |
|
||
| Data mixture | Nemotron-CC + StarCoderData + ProofPile 2 |
|
||
| Tokenizer | Llama 3 (vocab 128,256) |
|
||
|
||
[AdamH](https://whenwen.github.io/wd_blog/public/hyperball-part-1.html), Adam
|
||
with Hyperball, constrains every projection weight to stay on the Frobenius-
|
||
norm sphere it was initialized on, so weight decay has nothing to regularize
|
||
away and falls out of the recipe. A Complete(d)P-style transfer rule with a
|
||
`(T₀/T)^0.3` correction sets learning rate as token horizon grows. Reference
|
||
constants: B₀ = 64, T₀ = 2.5 B tokens, η₀ = 0.00630, η₀,Adam = 0.000656,
|
||
ε₀ = 1.85 × 10⁻⁸. Recipe code:
|
||
[`experiments/scaling_law_sweeps/completed_adamh.py`](https://github.com/marin-community/marin/blob/main/experiments/scaling_law_sweeps/completed_adamh.py).
|
||
|
||
## Companion releases
|
||
|
||
- All Delphi model checkpoints: [`marin-community`](https://huggingface.co/marin-community) on the Hub.
|
||
- Plot data behind every figure in the blog post:
|
||
[`marin-community/delphi-blog-data`](https://huggingface.co/datasets/marin-community/delphi-blog-data) (one config per figure, with `wandb_url` on every row).
|
||
- Pipelines that deterministically reproduce the training mixture from public
|
||
Nemotron-CC, StarCoderData, and ProofPile 2 sources: see the Marin repo.
|
||
|
||
## Evaluation
|
||
|
||
This checkpoint is part of the Delphi eval suite
|
||
([`experiments/exp1337_eval_suite.py`](https://github.com/marin-community/marin/blob/main/experiments/exp1337_eval_suite.py)),
|
||
which scores every Delphi run alongside reference open-weights baselines
|
||
(Qwen 3, Llama 2/3, OLMo 2, Marin 8B). Following the blog's two-step
|
||
forecast, soft metrics (per-choice log-prob for multiple-choice tasks,
|
||
bits-per-byte for generative tasks) carry the signal the scaling law is fit on,
|
||
and a sigmoid fit on an external model pool maps soft metric to hard metric
|
||
(accuracy, pass@1, exact-match). Below ~1e21 FLOPs the hard metrics stay near
|
||
chance even when the underlying probabilities are improving smoothly; that is
|
||
expected and is exactly why the soft metrics exist.
|
||
|
||
## Limitations
|
||
|
||
- Trained on an English-heavy web mixture; no multilingual coverage.
|
||
- Pretrained-only — no instruction tuning, RLHF, or safety alignment.
|
||
- The Delphi recipe targets compute-optimal training, not inference-cost-aware
|
||
overtraining; for inference-heavy deployments, an overtrained smaller model
|
||
may be preferable. The blog's "off-optimal training" section quantifies the
|
||
penalty.
|
||
- This is one checkpoint in a much larger Delphi release; pick the one that
|
||
matches your compute / parameter / token regime, or browse the full set at
|
||
[`marin-community`](https://huggingface.co/marin-community).
|
||
|
||
## Citation
|
||
|
||
```bibtex
|
||
@misc{held2026delphi,
|
||
title = {Scaling Laws That Extrapolate 300× Past the Fit},
|
||
author = {Held, Will and {Marin Community}},
|
||
year = {2026},
|
||
url = {https://openathena.ai/blog/delphi}
|
||
}
|
||
```
|