Files
delphi-1e22-9.7Bparams-160B…/README.md
ModelHub XC 67570713a5 初始化项目,由ModelHub XC社区提供模型
Model: marin-community/delphi-1e22-9.7Bparams-160Btokens-seed42
Source: Original Platform
2026-06-01 07:31:18 +08:00

135 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
language:
- en
tags:
- marin
- delphi
- scaling-laws
- pretrained
- research-only
datasets:
- nvidia/Nemotron-CC
- bigcode/starcoderdata
- EleutherAI/proof-pile-2
---
# delphi-1e22-9.7Bparams-160Btokens-seed42
A 9.7B-parameter base model from the **Delphi** scaling suite. Trained at 1 × 10²² FLOPs on 160B tokens with the Delphi recipe.
## About Delphi
Delphi is the Marin team's first open scaling suite, inspired by
[Pythia](https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling).
It has three parts:
- a **scaling recipe** that maps compute budgets to model configurations,
- a **scaling suite** of models trained from that recipe at IsoFLOP budgets
from 3 × 10¹⁸ to 1 × 10²³ FLOPs, and
- a **scaling law** which uses the smaller Delphi models to predict the larger
ones.
A pre-registered forecast from that scaling law predicted the final loss of the
largest Delphi run (1 × 10²³ FLOPs, 25 B parameters, 600 B tokens) within
**0.2%**, using **300× less compute** than the training run itself. The same
process forecasts downstream benchmarks — MMLU, HumanEval, and GSM8K — via a
two-step regression combining compute and observational scaling laws.
See ["Scaling Laws That Extrapolate 300× Past the Fit"](https://openathena.ai/blog/delphi)
for the recipe, fit, and downstream-eval projections. The full set of Delphi
checkpoints — IsoFLOP grid points, held-out optima at 1e21/1e22/1e23 with
multiple random seeds, and training intermediates — lives on
[`marin-community`](https://huggingface.co/marin-community) on the Hub.
This is a research artifact, not a production model.
## Model details
| | |
|---|---|
| Architecture | Qwen 3 (pre-norm decoder, RMSNorm, RoPE, QK-norm with learned scaling, SwiGLU MLPs) |
| Parameters | 9,714,698,752 |
| Hidden size | 3840 |
| Layers | 37 |
| Attention heads | 30 |
| KV heads | 30 (no GQA) |
| Head dim | 128 |
| FFN intermediate | 15360 (MLP ratio 4) |
| Vocab size | 128,256 (Llama 3 tokenizer) |
| Max sequence length | 4096 |
| Position encoding | RoPE (θ = 500000, Llama 3-style scaling) |
| Bias terms | None |
| Tied embeddings | No |
## Training
| | |
|---|---|
| Compute | 1 × 10²² FLOPs |
| Tokens | 160,369,213,440 |
| Steps | 38,234 |
| Sequence length | 4096 |
| Optimizer | AdamH (Adam with Hyperball) |
| Recipe | Delphi (Complete(d)P-style scaling with `(T₀/T)^0.3` token-horizon LR correction) |
| LR schedule | WSD: 10% linear warmup, 20% linear decay, 0 floor |
| Precision | f32 master params, bf16 compute |
| Parallelism | FSDP |
| Data mixture | Nemotron-CC + StarCoderData + ProofPile 2 |
| Tokenizer | Llama 3 (vocab 128,256) |
[AdamH](https://whenwen.github.io/wd_blog/public/hyperball-part-1.html), Adam
with Hyperball, constrains every projection weight to stay on the Frobenius-
norm sphere it was initialized on, so weight decay has nothing to regularize
away and falls out of the recipe. A Complete(d)P-style transfer rule with a
`(T₀/T)^0.3` correction sets learning rate as token horizon grows. Reference
constants: B₀ = 64, T₀ = 2.5 B tokens, η₀ = 0.00630, η₀,Adam = 0.000656,
ε₀ = 1.85 × 10⁻⁸. Recipe code:
[`experiments/scaling_law_sweeps/completed_adamh.py`](https://github.com/marin-community/marin/blob/main/experiments/scaling_law_sweeps/completed_adamh.py).
## Companion releases
- All Delphi model checkpoints: [`marin-community`](https://huggingface.co/marin-community) on the Hub.
- Plot data behind every figure in the blog post:
[`marin-community/delphi-blog-data`](https://huggingface.co/datasets/marin-community/delphi-blog-data) (one config per figure, with `wandb_url` on every row).
- Pipelines that deterministically reproduce the training mixture from public
Nemotron-CC, StarCoderData, and ProofPile 2 sources: see the Marin repo.
## Evaluation
This checkpoint is part of the Delphi eval suite
([`experiments/exp1337_eval_suite.py`](https://github.com/marin-community/marin/blob/main/experiments/exp1337_eval_suite.py)),
which scores every Delphi run alongside reference open-weights baselines
(Qwen 3, Llama 2/3, OLMo 2, Marin 8B). Following the blog's two-step
forecast, soft metrics (per-choice log-prob for multiple-choice tasks,
bits-per-byte for generative tasks) carry the signal the scaling law is fit on,
and a sigmoid fit on an external model pool maps soft metric to hard metric
(accuracy, pass@1, exact-match). Below ~1e21 FLOPs the hard metrics stay near
chance even when the underlying probabilities are improving smoothly; that is
expected and is exactly why the soft metrics exist.
## Limitations
- Trained on an English-heavy web mixture; no multilingual coverage.
- Pretrained-only — no instruction tuning, RLHF, or safety alignment.
- The Delphi recipe targets compute-optimal training, not inference-cost-aware
overtraining; for inference-heavy deployments, an overtrained smaller model
may be preferable. The blog's "off-optimal training" section quantifies the
penalty.
- This is one checkpoint in a much larger Delphi release; pick the one that
matches your compute / parameter / token regime, or browse the full set at
[`marin-community`](https://huggingface.co/marin-community).
## Citation
```bibtex
@misc{held2026delphi,
title = {Scaling Laws That Extrapolate 300× Past the Fit},
author = {Held, Will and {Marin Community}},
year = {2026},
url = {https://openathena.ai/blog/delphi}
}
```