--- license: apache-2.0 library_name: transformers pipeline_tag: text-generation language: - en tags: - marin - delphi - scaling-laws - pretrained - research-only datasets: - nvidia/Nemotron-CC - bigcode/starcoderdata - EleutherAI/proof-pile-2 --- # delphi-9e18-358Mparams-4.7Btokens A 358M-parameter base model from the **Delphi** scaling suite. Trained at 9 × 10¹⁸ FLOPs on 4.7B tokens with the Delphi recipe. ## About Delphi Delphi is the Marin team's first open scaling suite, inspired by [Pythia](https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling). It has three parts: - a **scaling recipe** that maps compute budgets to model configurations, - a **scaling suite** of models trained from that recipe at IsoFLOP budgets from 3 × 10¹⁸ to 1 × 10²³ FLOPs, and - a **scaling law** which uses the smaller Delphi models to predict the larger ones. A pre-registered forecast from that scaling law predicted the final loss of the largest Delphi run (1 × 10²³ FLOPs, 25 B parameters, 600 B tokens) within **0.2%**, using **300× less compute** than the training run itself. The same process forecasts downstream benchmarks — MMLU, HumanEval, and GSM8K — via a two-step regression combining compute and observational scaling laws. See ["Scaling Laws That Extrapolate 300× Past the Fit"](https://openathena.ai/blog/delphi) for the recipe, fit, and downstream-eval projections. The full set of Delphi checkpoints — IsoFLOP grid points, held-out optima at 1e21/1e22/1e23 with multiple random seeds, and training intermediates — lives on [`marin-community`](https://huggingface.co/marin-community) on the Hub. This is a research artifact, not a production model. ## Model details | | | |---|---| | Architecture | Qwen 3 (pre-norm decoder, RMSNorm, RoPE, QK-norm with learned scaling, SwiGLU MLPs) | | Parameters | 358,306,688 | | Hidden size | 896 | | Layers | 10 | | Attention heads | 7 | | KV heads | 7 (no GQA) | | Head dim | 128 | | FFN intermediate | 3584 (MLP ratio 4) | | Vocab size | 128,256 (Llama 3 tokenizer) | | Max sequence length | 4096 | | Position encoding | RoPE (θ = 500000, Llama 3-style scaling) | | Bias terms | None | | Tied embeddings | No | ## Training | | | |---|---| | Compute | 9 × 10¹⁸ FLOPs | | Tokens | 4,728,946,688 | | Steps | 36,078 | | Sequence length | 4096 | | Optimizer | AdamH (Adam with Hyperball) | | Recipe | Delphi (Complete(d)P-style scaling with `(T₀/T)^0.3` token-horizon LR correction) | | LR schedule | WSD: 10% linear warmup, 20% linear decay, 0 floor | | Precision | f32 master params, bf16 compute | | Parallelism | FSDP | | Data mixture | Nemotron-CC + StarCoderData + ProofPile 2 | | Tokenizer | Llama 3 (vocab 128,256) | [AdamH](https://whenwen.github.io/wd_blog/public/hyperball-part-1.html), Adam with Hyperball, constrains every projection weight to stay on the Frobenius- norm sphere it was initialized on, so weight decay has nothing to regularize away and falls out of the recipe. A Complete(d)P-style transfer rule with a `(T₀/T)^0.3` correction sets learning rate as token horizon grows. Reference constants: B₀ = 64, T₀ = 2.5 B tokens, η₀ = 0.00630, η₀,Adam = 0.000656, ε₀ = 1.85 × 10⁻⁸. Recipe code: [`experiments/scaling_law_sweeps/completed_adamh.py`](https://github.com/marin-community/marin/blob/main/experiments/scaling_law_sweeps/completed_adamh.py). ## Companion releases - All Delphi model checkpoints: [`marin-community`](https://huggingface.co/marin-community) on the Hub. - Plot data behind every figure in the blog post: [`marin-community/delphi-blog-data`](https://huggingface.co/datasets/marin-community/delphi-blog-data) (one config per figure, with `wandb_url` on every row). - Pipelines that deterministically reproduce the training mixture from public Nemotron-CC, StarCoderData, and ProofPile 2 sources: see the Marin repo. ## Evaluation This checkpoint is part of the Delphi eval suite ([`experiments/exp1337_eval_suite.py`](https://github.com/marin-community/marin/blob/main/experiments/exp1337_eval_suite.py)), which scores every Delphi run alongside reference open-weights baselines (Qwen 3, Llama 2/3, OLMo 2, Marin 8B). Following the blog's two-step forecast, soft metrics (per-choice log-prob for multiple-choice tasks, bits-per-byte for generative tasks) carry the signal the scaling law is fit on, and a sigmoid fit on an external model pool maps soft metric to hard metric (accuracy, pass@1, exact-match). Below ~1e21 FLOPs the hard metrics stay near chance even when the underlying probabilities are improving smoothly; that is expected and is exactly why the soft metrics exist. ## Limitations - Trained on an English-heavy web mixture; no multilingual coverage. - Pretrained-only — no instruction tuning, RLHF, or safety alignment. - The Delphi recipe targets compute-optimal training, not inference-cost-aware overtraining; for inference-heavy deployments, an overtrained smaller model may be preferable. The blog's "off-optimal training" section quantifies the penalty. - This is one checkpoint in a much larger Delphi release; pick the one that matches your compute / parameter / token regime, or browse the full set at [`marin-community`](https://huggingface.co/marin-community). ## Citation ```bibtex @misc{held2026delphi, title = {Scaling Laws That Extrapolate 300× Past the Fit}, author = {Held, Will and {Marin Community}}, year = {2026}, url = {https://openathena.ai/blog/delphi} } ```