marin-community/delphi-9e18-669Mparams-2.3Btokens

Go to file

ModelHub XC 468e185fc7 初始化项目，由ModelHub XC社区提供模型

Model: marin-community/delphi-9e18-669Mparams-2.3Btokens
Source: Original Platform

2026-05-28 06:32:16 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-28 06:32:16 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-28 06:32:16 +08:00

model.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-28 06:32:16 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-28 06:32:16 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-05-28 06:32:16 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-28 06:32:16 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-28 06:32:16 +08:00

README.md

license, library_name, pipeline_tag, language, tags, datasets

license

library_name

pipeline_tag

language

delphi-9e18-669Mparams-2.3Btokens

A 669M-parameter base model from the Delphi scaling suite. Trained at 9 × 10¹⁸ FLOPs on 2.3B tokens with the Delphi recipe.

About Delphi

Delphi is the Marin team's first open scaling suite, inspired by Pythia. It has three parts:

a scaling recipe that maps compute budgets to model configurations,
a scaling suite of models trained from that recipe at IsoFLOP budgets from 3 × 10¹⁸ to 1 × 10²³ FLOPs, and
a scaling law which uses the smaller Delphi models to predict the larger ones.

A pre-registered forecast from that scaling law predicted the final loss of the largest Delphi run (1 × 10²³ FLOPs, 25 B parameters, 600 B tokens) within 0.2%, using 300× less compute than the training run itself. The same process forecasts downstream benchmarks — MMLU, HumanEval, and GSM8K — via a two-step regression combining compute and observational scaling laws.

See "Scaling Laws That Extrapolate 300× Past the Fit" for the recipe, fit, and downstream-eval projections. The full set of Delphi checkpoints — IsoFLOP grid points, held-out optima at 1e21/1e22/1e23 with multiple random seeds, and training intermediates — lives on marin-community on the Hub.

This is a research artifact, not a production model.

Model details


Architecture	Qwen 3 (pre-norm decoder, RMSNorm, RoPE, QK-norm with learned scaling, SwiGLU MLPs)
Parameters	669,160,448
Hidden size	1280
Layers	13
Attention heads	10
KV heads	10 (no GQA)
Head dim	128
FFN intermediate	5120 (MLP ratio 4)
Vocab size	128,256 (Llama 3 tokenizer)
Max sequence length	4096
Position encoding	RoPE (θ = 500000, Llama 3-style scaling)
Bias terms	None
Tied embeddings	No

Training


Compute	9 × 10¹⁸ FLOPs
Tokens	2,336,227,328
Steps	35,647
Sequence length	4096
Optimizer	AdamH (Adam with Hyperball)
Recipe	Delphi (Complete(d)P-style scaling with `(T₀/T)^0.3` token-horizon LR correction)
LR schedule	WSD: 10% linear warmup, 20% linear decay, 0 floor
Precision	f32 master params, bf16 compute
Parallelism	FSDP
Data mixture	Nemotron-CC + StarCoderData + ProofPile 2
Tokenizer	Llama 3 (vocab 128,256)

AdamH, Adam with Hyperball, constrains every projection weight to stay on the Frobenius- norm sphere it was initialized on, so weight decay has nothing to regularize away and falls out of the recipe. A Complete(d)P-style transfer rule with a (T₀/T)^0.3 correction sets learning rate as token horizon grows. Reference constants: B₀ = 64, T₀ = 2.5 B tokens, η₀ = 0.00630, η₀,Adam = 0.000656, ε₀ = 1.85 × 10⁻⁸. Recipe code: experiments/scaling_law_sweeps/completed_adamh.py.

Companion releases

All Delphi model checkpoints: marin-community on the Hub.
Plot data behind every figure in the blog post: marin-community/delphi-blog-data (one config per figure, with wandb_url on every row).
Pipelines that deterministically reproduce the training mixture from public Nemotron-CC, StarCoderData, and ProofPile 2 sources: see the Marin repo.

Evaluation

This checkpoint is part of the Delphi eval suite (experiments/exp1337_eval_suite.py), which scores every Delphi run alongside reference open-weights baselines (Qwen 3, Llama 2/3, OLMo 2, Marin 8B). Following the blog's two-step forecast, soft metrics (per-choice log-prob for multiple-choice tasks, bits-per-byte for generative tasks) carry the signal the scaling law is fit on, and a sigmoid fit on an external model pool maps soft metric to hard metric (accuracy, pass@1, exact-match). Below ~1e21 FLOPs the hard metrics stay near chance even when the underlying probabilities are improving smoothly; that is expected and is exactly why the soft metrics exist.

Limitations

Trained on an English-heavy web mixture; no multilingual coverage.
Pretrained-only — no instruction tuning, RLHF, or safety alignment.
The Delphi recipe targets compute-optimal training, not inference-cost-aware overtraining; for inference-heavy deployments, an overtrained smaller model may be preferable. The blog's "off-optimal training" section quantifies the penalty.
This is one checkpoint in a much larger Delphi release; pick the one that matches your compute / parameter / token regime, or browse the full set at marin-community.

Citation

@misc{held2026delphi,
  title  = {Scaling Laws That Extrapolate 300× Past the Fit},
  author = {Held, Will and {Marin Community}},
  year   = {2026},
  url    = {https://openathena.ai/blog/delphi}
}

README.md Unescape Escape

delphi-9e18-669Mparams-2.3Btokens

About Delphi

Model details

Training

Companion releases

Evaluation

Limitations

Citation

README.md