初始化项目,由ModelHub XC社区提供模型
Model: marin-community/delphi-2e19-1.9Bparams-1.5Btokens Source: Original Platform
This commit is contained in:
36
.gitattributes
vendored
Normal file
36
.gitattributes
vendored
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
||||||
134
README.md
Normal file
134
README.md
Normal file
@@ -0,0 +1,134 @@
|
|||||||
|
---
|
||||||
|
license: apache-2.0
|
||||||
|
library_name: transformers
|
||||||
|
pipeline_tag: text-generation
|
||||||
|
language:
|
||||||
|
- en
|
||||||
|
tags:
|
||||||
|
- marin
|
||||||
|
- delphi
|
||||||
|
- scaling-laws
|
||||||
|
- pretrained
|
||||||
|
- research-only
|
||||||
|
datasets:
|
||||||
|
- nvidia/Nemotron-CC
|
||||||
|
- bigcode/starcoderdata
|
||||||
|
- EleutherAI/proof-pile-2
|
||||||
|
---
|
||||||
|
|
||||||
|
# delphi-2e19-1.9Bparams-1.5Btokens
|
||||||
|
|
||||||
|
A 1.9B-parameter base model from the **Delphi** scaling suite. Trained at 2 × 10¹⁹ FLOPs on 1.5B tokens with the Delphi recipe.
|
||||||
|
|
||||||
|
## About Delphi
|
||||||
|
|
||||||
|
Delphi is the Marin team's first open scaling suite, inspired by
|
||||||
|
[Pythia](https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling).
|
||||||
|
It has three parts:
|
||||||
|
|
||||||
|
- a **scaling recipe** that maps compute budgets to model configurations,
|
||||||
|
- a **scaling suite** of models trained from that recipe at IsoFLOP budgets
|
||||||
|
from 3 × 10¹⁸ to 1 × 10²³ FLOPs, and
|
||||||
|
- a **scaling law** which uses the smaller Delphi models to predict the larger
|
||||||
|
ones.
|
||||||
|
|
||||||
|
A pre-registered forecast from that scaling law predicted the final loss of the
|
||||||
|
largest Delphi run (1 × 10²³ FLOPs, 25 B parameters, 600 B tokens) within
|
||||||
|
**0.2%**, using **300× less compute** than the training run itself. The same
|
||||||
|
process forecasts downstream benchmarks — MMLU, HumanEval, and GSM8K — via a
|
||||||
|
two-step regression combining compute and observational scaling laws.
|
||||||
|
|
||||||
|
See ["Scaling Laws That Extrapolate 300× Past the Fit"](https://openathena.ai/blog/delphi)
|
||||||
|
for the recipe, fit, and downstream-eval projections. The full set of Delphi
|
||||||
|
checkpoints — IsoFLOP grid points, held-out optima at 1e21/1e22/1e23 with
|
||||||
|
multiple random seeds, and training intermediates — lives on
|
||||||
|
[`marin-community`](https://huggingface.co/marin-community) on the Hub.
|
||||||
|
|
||||||
|
This is a research artifact, not a production model.
|
||||||
|
|
||||||
|
## Model details
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Architecture | Qwen 3 (pre-norm decoder, RMSNorm, RoPE, QK-norm with learned scaling, SwiGLU MLPs) |
|
||||||
|
| Parameters | 1,934,716,160 |
|
||||||
|
| Hidden size | 2048 |
|
||||||
|
| Layers | 21 |
|
||||||
|
| Attention heads | 16 |
|
||||||
|
| KV heads | 16 (no GQA) |
|
||||||
|
| Head dim | 128 |
|
||||||
|
| FFN intermediate | 8192 (MLP ratio 4) |
|
||||||
|
| Vocab size | 128,256 (Llama 3 tokenizer) |
|
||||||
|
| Max sequence length | 4096 |
|
||||||
|
| Position encoding | RoPE (θ = 500000, Llama 3-style scaling) |
|
||||||
|
| Bias terms | None |
|
||||||
|
| Tied embeddings | No |
|
||||||
|
|
||||||
|
## Training
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Compute | 2 × 10¹⁹ FLOPs |
|
||||||
|
| Tokens | 1,480,491,008 |
|
||||||
|
| Steps | 45,180 |
|
||||||
|
| Sequence length | 4096 |
|
||||||
|
| Optimizer | AdamH (Adam with Hyperball) |
|
||||||
|
| Recipe | Delphi (Complete(d)P-style scaling with `(T₀/T)^0.3` token-horizon LR correction) |
|
||||||
|
| LR schedule | WSD: 10% linear warmup, 20% linear decay, 0 floor |
|
||||||
|
| Precision | f32 master params, bf16 compute |
|
||||||
|
| Parallelism | FSDP |
|
||||||
|
| Data mixture | Nemotron-CC + StarCoderData + ProofPile 2 |
|
||||||
|
| Tokenizer | Llama 3 (vocab 128,256) |
|
||||||
|
|
||||||
|
[AdamH](https://whenwen.github.io/wd_blog/public/hyperball-part-1.html), Adam
|
||||||
|
with Hyperball, constrains every projection weight to stay on the Frobenius-
|
||||||
|
norm sphere it was initialized on, so weight decay has nothing to regularize
|
||||||
|
away and falls out of the recipe. A Complete(d)P-style transfer rule with a
|
||||||
|
`(T₀/T)^0.3` correction sets learning rate as token horizon grows. Reference
|
||||||
|
constants: B₀ = 64, T₀ = 2.5 B tokens, η₀ = 0.00630, η₀,Adam = 0.000656,
|
||||||
|
ε₀ = 1.85 × 10⁻⁸. Recipe code:
|
||||||
|
[`experiments/scaling_law_sweeps/completed_adamh.py`](https://github.com/marin-community/marin/blob/main/experiments/scaling_law_sweeps/completed_adamh.py).
|
||||||
|
|
||||||
|
## Companion releases
|
||||||
|
|
||||||
|
- All Delphi model checkpoints: [`marin-community`](https://huggingface.co/marin-community) on the Hub.
|
||||||
|
- Plot data behind every figure in the blog post:
|
||||||
|
[`marin-community/delphi-blog-data`](https://huggingface.co/datasets/marin-community/delphi-blog-data) (one config per figure, with `wandb_url` on every row).
|
||||||
|
- Pipelines that deterministically reproduce the training mixture from public
|
||||||
|
Nemotron-CC, StarCoderData, and ProofPile 2 sources: see the Marin repo.
|
||||||
|
|
||||||
|
## Evaluation
|
||||||
|
|
||||||
|
This checkpoint is part of the Delphi eval suite
|
||||||
|
([`experiments/exp1337_eval_suite.py`](https://github.com/marin-community/marin/blob/main/experiments/exp1337_eval_suite.py)),
|
||||||
|
which scores every Delphi run alongside reference open-weights baselines
|
||||||
|
(Qwen 3, Llama 2/3, OLMo 2, Marin 8B). Following the blog's two-step
|
||||||
|
forecast, soft metrics (per-choice log-prob for multiple-choice tasks,
|
||||||
|
bits-per-byte for generative tasks) carry the signal the scaling law is fit on,
|
||||||
|
and a sigmoid fit on an external model pool maps soft metric to hard metric
|
||||||
|
(accuracy, pass@1, exact-match). Below ~1e21 FLOPs the hard metrics stay near
|
||||||
|
chance even when the underlying probabilities are improving smoothly; that is
|
||||||
|
expected and is exactly why the soft metrics exist.
|
||||||
|
|
||||||
|
## Limitations
|
||||||
|
|
||||||
|
- Trained on an English-heavy web mixture; no multilingual coverage.
|
||||||
|
- Pretrained-only — no instruction tuning, RLHF, or safety alignment.
|
||||||
|
- The Delphi recipe targets compute-optimal training, not inference-cost-aware
|
||||||
|
overtraining; for inference-heavy deployments, an overtrained smaller model
|
||||||
|
may be preferable. The blog's "off-optimal training" section quantifies the
|
||||||
|
penalty.
|
||||||
|
- This is one checkpoint in a much larger Delphi release; pick the one that
|
||||||
|
matches your compute / parameter / token regime, or browse the full set at
|
||||||
|
[`marin-community`](https://huggingface.co/marin-community).
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
|
||||||
|
```bibtex
|
||||||
|
@misc{held2026delphi,
|
||||||
|
title = {Scaling Laws That Extrapolate 300× Past the Fit},
|
||||||
|
author = {Held, Will and {Marin Community}},
|
||||||
|
year = {2026},
|
||||||
|
url = {https://openathena.ai/blog/delphi}
|
||||||
|
}
|
||||||
|
```
|
||||||
118
config.json
Normal file
118
config.json
Normal file
@@ -0,0 +1,118 @@
|
|||||||
|
{
|
||||||
|
"vocab_size": 128256,
|
||||||
|
"max_position_embeddings": 4096,
|
||||||
|
"hidden_size": 2048,
|
||||||
|
"intermediate_size": 8192,
|
||||||
|
"num_hidden_layers": 21,
|
||||||
|
"num_attention_heads": 16,
|
||||||
|
"use_sliding_window": false,
|
||||||
|
"sliding_window": null,
|
||||||
|
"max_window_layers": 28,
|
||||||
|
"num_key_value_heads": 16,
|
||||||
|
"head_dim": 128,
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"rms_norm_eps": 1e-05,
|
||||||
|
"use_cache": true,
|
||||||
|
"rope_theta": 500000,
|
||||||
|
"rope_scaling": {
|
||||||
|
"factor": 8.0,
|
||||||
|
"low_freq_factor": 1.0,
|
||||||
|
"high_freq_factor": 4.0,
|
||||||
|
"original_max_position_embeddings": 8192,
|
||||||
|
"rope_type": "llama3"
|
||||||
|
},
|
||||||
|
"attention_bias": false,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"layer_types": [
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention"
|
||||||
|
],
|
||||||
|
"return_dict": true,
|
||||||
|
"output_hidden_states": false,
|
||||||
|
"torchscript": false,
|
||||||
|
"dtype": null,
|
||||||
|
"pruned_heads": {},
|
||||||
|
"tie_word_embeddings": false,
|
||||||
|
"chunk_size_feed_forward": 0,
|
||||||
|
"is_encoder_decoder": false,
|
||||||
|
"is_decoder": false,
|
||||||
|
"cross_attention_hidden_size": null,
|
||||||
|
"add_cross_attention": false,
|
||||||
|
"tie_encoder_decoder": false,
|
||||||
|
"architectures": [
|
||||||
|
"Qwen3ForCausalLM"
|
||||||
|
],
|
||||||
|
"finetuning_task": null,
|
||||||
|
"id2label": {
|
||||||
|
"0": "LABEL_0",
|
||||||
|
"1": "LABEL_1"
|
||||||
|
},
|
||||||
|
"label2id": {
|
||||||
|
"LABEL_0": 0,
|
||||||
|
"LABEL_1": 1
|
||||||
|
},
|
||||||
|
"task_specific_params": null,
|
||||||
|
"problem_type": null,
|
||||||
|
"tokenizer_class": null,
|
||||||
|
"prefix": null,
|
||||||
|
"bos_token_id": 128000,
|
||||||
|
"pad_token_id": null,
|
||||||
|
"eos_token_id": 128001,
|
||||||
|
"sep_token_id": null,
|
||||||
|
"decoder_start_token_id": 128000,
|
||||||
|
"max_length": 20,
|
||||||
|
"min_length": 0,
|
||||||
|
"do_sample": false,
|
||||||
|
"early_stopping": false,
|
||||||
|
"num_beams": 1,
|
||||||
|
"temperature": 1.0,
|
||||||
|
"top_k": 50,
|
||||||
|
"top_p": 1.0,
|
||||||
|
"typical_p": 1.0,
|
||||||
|
"repetition_penalty": 1.0,
|
||||||
|
"length_penalty": 1.0,
|
||||||
|
"no_repeat_ngram_size": 0,
|
||||||
|
"encoder_no_repeat_ngram_size": 0,
|
||||||
|
"bad_words_ids": null,
|
||||||
|
"num_return_sequences": 1,
|
||||||
|
"output_scores": false,
|
||||||
|
"return_dict_in_generate": false,
|
||||||
|
"forced_bos_token_id": null,
|
||||||
|
"forced_eos_token_id": null,
|
||||||
|
"remove_invalid_values": false,
|
||||||
|
"exponential_decay_length_penalty": null,
|
||||||
|
"suppress_tokens": null,
|
||||||
|
"begin_suppress_tokens": [
|
||||||
|
128000,
|
||||||
|
128001
|
||||||
|
],
|
||||||
|
"num_beam_groups": 1,
|
||||||
|
"diversity_penalty": 0.0,
|
||||||
|
"_name_or_path": "",
|
||||||
|
"transformers_version": "4.57.5",
|
||||||
|
"no_bias": true,
|
||||||
|
"tf_legacy_loss": false,
|
||||||
|
"use_bfloat16": false,
|
||||||
|
"model_type": "qwen3",
|
||||||
|
"output_attentions": false
|
||||||
|
}
|
||||||
3
model-00001-of-00002.safetensors
Normal file
3
model-00001-of-00002.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:6baeb316bb764ad910d03b160b6d5c684b1fd32d60fc9f703e7c623d2ac86e1e
|
||||||
|
size 4987231936
|
||||||
3
model-00002-of-00002.safetensors
Normal file
3
model-00002-of-00002.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:d066c71cf22dc57621ef5d8ffc80f13495def4fe1626985b882ee10730df7baa
|
||||||
|
size 2751659344
|
||||||
1
model.safetensors.index.json
Normal file
1
model.safetensors.index.json
Normal file
File diff suppressed because one or more lines are too long
16
special_tokens_map.json
Normal file
16
special_tokens_map.json
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
{
|
||||||
|
"bos_token": {
|
||||||
|
"content": "<|begin_of_text|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"eos_token": {
|
||||||
|
"content": "<|end_of_text|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
}
|
||||||
BIN
tokenizer.json
(Stored with Git LFS)
Normal file
BIN
tokenizer.json
(Stored with Git LFS)
Normal file
Binary file not shown.
2062
tokenizer_config.json
Normal file
2062
tokenizer_config.json
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user