初始化项目，由ModelHub XC社区提供模型

Model: marin-community/delphi-9e18-210Mparams-9.1Btokens Source: Original Platform
2026-06-04 14:55:15 +08:00
commit 1de90711f8
7 changed files with 2358 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,134 @@
+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-generation
+language:
+- en
+tags:
+- marin
+- delphi
+- scaling-laws
+- pretrained
+- research-only
+datasets:
+- nvidia/Nemotron-CC
+- bigcode/starcoderdata
+- EleutherAI/proof-pile-2
+---
+
+# delphi-9e18-210Mparams-9.1Btokens
+
+A 210M-parameter base model from the **Delphi** scaling suite. Trained at 9 × 10¹⁸ FLOPs on 9.1B tokens with the Delphi recipe.
+
+## About Delphi
+
+Delphi is the Marin team's first open scaling suite, inspired by
+[Pythia](https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling).
+It has three parts:
+
+- a **scaling recipe** that maps compute budgets to model configurations,
+- a **scaling suite** of models trained from that recipe at IsoFLOP budgets
+  from 3 × 10¹⁸ to 1 × 10²³ FLOPs, and
+- a **scaling law** which uses the smaller Delphi models to predict the larger
+  ones.
+
+A pre-registered forecast from that scaling law predicted the final loss of the
+largest Delphi run (1 × 10²³ FLOPs, 25 B parameters, 600 B tokens) within
+**0.2%**, using **300× less compute** than the training run itself. The same
+process forecasts downstream benchmarks — MMLU, HumanEval, and GSM8K — via a
+two-step regression combining compute and observational scaling laws.
+
+See ["Scaling Laws That Extrapolate 300× Past the Fit"](https://openathena.ai/blog/delphi)
+for the recipe, fit, and downstream-eval projections. The full set of Delphi
+checkpoints — IsoFLOP grid points, held-out optima at 1e21/1e22/1e23 with
+multiple random seeds, and training intermediates — lives on
+[`marin-community`](https://huggingface.co/marin-community) on the Hub.
+
+This is a research artifact, not a production model.
+
+## Model details
+
+| | |
+|---|---|
+| Architecture | Qwen 3 (pre-norm decoder, RMSNorm, RoPE, QK-norm with learned scaling, SwiGLU MLPs) |
+| Parameters | 210,054,272 |
+| Hidden size | 640 |
+| Layers | 7 |
+| Attention heads | 5 |
+| KV heads | 5 (no GQA) |
+| Head dim | 128 |
+| FFN intermediate | 2560 (MLP ratio 4) |
+| Vocab size | 128,256 (Llama 3 tokenizer) |
+| Max sequence length | 4096 |
+| Position encoding | RoPE (θ = 500000, Llama 3-style scaling) |
+| Bias terms | None |
+| Tied embeddings | No |
+
+## Training
+
+| | |
+|---|---|
+| Compute | 9 × 10¹⁸ FLOPs |
+| Tokens | 9,097,969,664 |
+| Steps | 34,705 |
+| Sequence length | 4096 |
+| Optimizer | AdamH (Adam with Hyperball) |
+| Recipe | Delphi (Complete(d)P-style scaling with `(T₀/T)^0.3` token-horizon LR correction) |
+| LR schedule | WSD: 10% linear warmup, 20% linear decay, 0 floor |
+| Precision | f32 master params, bf16 compute |
+| Parallelism | FSDP |
+| Data mixture | Nemotron-CC + StarCoderData + ProofPile 2 |
+| Tokenizer | Llama 3 (vocab 128,256) |
+
+[AdamH](https://whenwen.github.io/wd_blog/public/hyperball-part-1.html), Adam
+with Hyperball, constrains every projection weight to stay on the Frobenius-
+norm sphere it was initialized on, so weight decay has nothing to regularize
+away and falls out of the recipe. A Complete(d)P-style transfer rule with a
+`(T₀/T)^0.3` correction sets learning rate as token horizon grows. Reference
+constants: B₀ = 64, T₀ = 2.5 B tokens, η₀ = 0.00630, η₀,Adam = 0.000656,
+ε₀ = 1.85 × 10⁻⁸. Recipe code:
+[`experiments/scaling_law_sweeps/completed_adamh.py`](https://github.com/marin-community/marin/blob/main/experiments/scaling_law_sweeps/completed_adamh.py).
+
+## Companion releases
+
+- All Delphi model checkpoints: [`marin-community`](https://huggingface.co/marin-community) on the Hub.
+- Plot data behind every figure in the blog post:
+  [`marin-community/delphi-blog-data`](https://huggingface.co/datasets/marin-community/delphi-blog-data) (one config per figure, with `wandb_url` on every row).
+- Pipelines that deterministically reproduce the training mixture from public
+  Nemotron-CC, StarCoderData, and ProofPile 2 sources: see the Marin repo.
+
+## Evaluation
+
+This checkpoint is part of the Delphi eval suite
+([`experiments/exp1337_eval_suite.py`](https://github.com/marin-community/marin/blob/main/experiments/exp1337_eval_suite.py)),
+which scores every Delphi run alongside reference open-weights baselines
+(Qwen 3, Llama 2/3, OLMo 2, Marin 8B). Following the blog's two-step
+forecast, soft metrics (per-choice log-prob for multiple-choice tasks,
+bits-per-byte for generative tasks) carry the signal the scaling law is fit on,
+and a sigmoid fit on an external model pool maps soft metric to hard metric
+(accuracy, pass@1, exact-match). Below ~1e21 FLOPs the hard metrics stay near
+chance even when the underlying probabilities are improving smoothly; that is
+expected and is exactly why the soft metrics exist.
+
+## Limitations
+
+- Trained on an English-heavy web mixture; no multilingual coverage.
+- Pretrained-only — no instruction tuning, RLHF, or safety alignment.
+- The Delphi recipe targets compute-optimal training, not inference-cost-aware
+  overtraining; for inference-heavy deployments, an overtrained smaller model
+  may be preferable. The blog's "off-optimal training" section quantifies the
+  penalty.
+- This is one checkpoint in a much larger Delphi release; pick the one that
+  matches your compute / parameter / token regime, or browse the full set at
+  [`marin-community`](https://huggingface.co/marin-community).
+
+## Citation
+
+```bibtex
+@misc{held2026delphi,
+  title  = {Scaling Laws That Extrapolate 300× Past the Fit},
+  author = {Held, Will and {Marin Community}},
+  year   = {2026},
+  url    = {https://openathena.ai/blog/delphi}
+}
+```