初始化项目，由ModelHub XC社区提供模型

Model: marin-community/delphi-2e19-1.9Bparams-1.5Btokens Source: Original Platform
2026-05-25 02:53:16 +08:00
commit f039554391
9 changed files with 2376 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,36 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,134 @@
 ---
 license: apache-2.0
 library_name: transformers
 pipeline_tag: text-generation
 language:
 - en
 tags:
 - marin
 - delphi
 - scaling-laws
 - pretrained
 - research-only
 datasets:
 - nvidia/Nemotron-CC
 - bigcode/starcoderdata
 - EleutherAI/proof-pile-2
 ---
 # delphi-2e19-1.9Bparams-1.5Btokens
 A 1.9B-parameter base model from the **Delphi** scaling suite. Trained at 2 × 10¹⁹ FLOPs on 1.5B tokens with the Delphi recipe.
 ## About Delphi
 Delphi is the Marin team's first open scaling suite, inspired by
 [Pythia](https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling).
 It has three parts:
 - a **scaling recipe** that maps compute budgets to model configurations,
 - a **scaling suite** of models trained from that recipe at IsoFLOP budgets
  from 3 × 10¹⁸ to 1 × 10²³ FLOPs, and
 - a **scaling law** which uses the smaller Delphi models to predict the larger
  ones.
 A pre-registered forecast from that scaling law predicted the final loss of the
 largest Delphi run (1 × 10²³ FLOPs, 25 B parameters, 600 B tokens) within
 **0.2%**, using **300× less compute** than the training run itself. The same
 process forecasts downstream benchmarks — MMLU, HumanEval, and GSM8K — via a
 two-step regression combining compute and observational scaling laws.
 See ["Scaling Laws That Extrapolate 300× Past the Fit"](https://openathena.ai/blog/delphi)
 for the recipe, fit, and downstream-eval projections. The full set of Delphi
 checkpoints — IsoFLOP grid points, held-out optima at 1e21/1e22/1e23 with
 multiple random seeds, and training intermediates — lives on
 [`marin-community`](https://huggingface.co/marin-community) on the Hub.
 This is a research artifact, not a production model.
 ## Model details
 | | |
 |---|---|
 | Architecture | Qwen 3 (pre-norm decoder, RMSNorm, RoPE, QK-norm with learned scaling, SwiGLU MLPs) |
 | Parameters | 1,934,716,160 |
 | Hidden size | 2048 |
 | Layers | 21 |
 | Attention heads | 16 |
 | KV heads | 16 (no GQA) |
 | Head dim | 128 |
 | FFN intermediate | 8192 (MLP ratio 4) |
 | Vocab size | 128,256 (Llama 3 tokenizer) |
 | Max sequence length | 4096 |
 | Position encoding | RoPE (θ = 500000, Llama 3-style scaling) |
 | Bias terms | None |
 | Tied embeddings | No |
 ## Training
 | | |
 |---|---|
 | Compute | 2 × 10¹⁹ FLOPs |
 | Tokens | 1,480,491,008 |
 | Steps | 45,180 |
 | Sequence length | 4096 |
 | Optimizer | AdamH (Adam with Hyperball) |
 | Recipe | Delphi (Complete(d)P-style scaling with `(T₀/T)^0.3` token-horizon LR correction) |
 | LR schedule | WSD: 10% linear warmup, 20% linear decay, 0 floor |
 | Precision | f32 master params, bf16 compute |
 | Parallelism | FSDP |
 | Data mixture | Nemotron-CC + StarCoderData + ProofPile 2 |
 | Tokenizer | Llama 3 (vocab 128,256) |
 [AdamH](https://whenwen.github.io/wd_blog/public/hyperball-part-1.html), Adam
 with Hyperball, constrains every projection weight to stay on the Frobenius-
 norm sphere it was initialized on, so weight decay has nothing to regularize
 away and falls out of the recipe. A Complete(d)P-style transfer rule with a
 `(T₀/T)^0.3` correction sets learning rate as token horizon grows. Reference
 constants: B₀ = 64, T₀ = 2.5 B tokens, η₀ = 0.00630, η₀,Adam = 0.000656,
 ε₀ = 1.85 × 10⁻⁸. Recipe code:
 [`experiments/scaling_law_sweeps/completed_adamh.py`](https://github.com/marin-community/marin/blob/main/experiments/scaling_law_sweeps/completed_adamh.py).
 ## Companion releases
 - All Delphi model checkpoints: [`marin-community`](https://huggingface.co/marin-community) on the Hub.
 - Plot data behind every figure in the blog post:
  [`marin-community/delphi-blog-data`](https://huggingface.co/datasets/marin-community/delphi-blog-data) (one config per figure, with `wandb_url` on every row).
 - Pipelines that deterministically reproduce the training mixture from public
  Nemotron-CC, StarCoderData, and ProofPile 2 sources: see the Marin repo.
 ## Evaluation
 This checkpoint is part of the Delphi eval suite
 ([`experiments/exp1337_eval_suite.py`](https://github.com/marin-community/marin/blob/main/experiments/exp1337_eval_suite.py)),
 which scores every Delphi run alongside reference open-weights baselines
 (Qwen 3, Llama 2/3, OLMo 2, Marin 8B). Following the blog's two-step
 forecast, soft metrics (per-choice log-prob for multiple-choice tasks,
 bits-per-byte for generative tasks) carry the signal the scaling law is fit on,
 and a sigmoid fit on an external model pool maps soft metric to hard metric
 (accuracy, pass@1, exact-match). Below ~1e21 FLOPs the hard metrics stay near
 chance even when the underlying probabilities are improving smoothly; that is
 expected and is exactly why the soft metrics exist.
 ## Limitations
 - Trained on an English-heavy web mixture; no multilingual coverage.
 - Pretrained-only — no instruction tuning, RLHF, or safety alignment.
 - The Delphi recipe targets compute-optimal training, not inference-cost-aware
  overtraining; for inference-heavy deployments, an overtrained smaller model
  may be preferable. The blog's "off-optimal training" section quantifies the
  penalty.
 - This is one checkpoint in a much larger Delphi release; pick the one that
  matches your compute / parameter / token regime, or browse the full set at
  [`marin-community`](https://huggingface.co/marin-community).
 ## Citation
 ```bibtex
@misc{held2026delphi,
  title  = {Scaling Laws That Extrapolate 300× Past the Fit},
  author = {Held, Will and {Marin Community}},
  year   = {2026},
  url    = {https://openathena.ai/blog/delphi}
 }
 ```
--- a/config.json
+++ b/config.json
@@ -0,0 +1,118 @@
 {
  "vocab_size": 128256,
  "max_position_embeddings": 4096,
  "hidden_size": 2048,
  "intermediate_size": 8192,
  "num_hidden_layers": 21,
  "num_attention_heads": 16,
  "use_sliding_window": false,
  "sliding_window": null,
  "max_window_layers": 28,
  "num_key_value_heads": 16,
  "head_dim": 128,
  "hidden_act": "silu",
  "initializer_range": 0.02,
  "rms_norm_eps": 1e-05,
  "use_cache": true,
  "rope_theta": 500000,
  "rope_scaling": {
    "factor": 8.0,
    "low_freq_factor": 1.0,
    "high_freq_factor": 4.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "attention_bias": false,
  "attention_dropout": 0.0,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "return_dict": true,
  "output_hidden_states": false,
  "torchscript": false,
  "dtype": null,
  "pruned_heads": {},
  "tie_word_embeddings": false,
  "chunk_size_feed_forward": 0,
  "is_encoder_decoder": false,
  "is_decoder": false,
  "cross_attention_hidden_size": null,
  "add_cross_attention": false,
  "tie_encoder_decoder": false,
  "architectures": [
    "Qwen3ForCausalLM"
  ],
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "task_specific_params": null,
  "problem_type": null,
  "tokenizer_class": null,
  "prefix": null,
  "bos_token_id": 128000,
  "pad_token_id": null,
  "eos_token_id": 128001,
  "sep_token_id": null,
  "decoder_start_token_id": 128000,
  "max_length": 20,
  "min_length": 0,
  "do_sample": false,
  "early_stopping": false,
  "num_beams": 1,
  "temperature": 1.0,
  "top_k": 50,
  "top_p": 1.0,
  "typical_p": 1.0,
  "repetition_penalty": 1.0,
  "length_penalty": 1.0,
  "no_repeat_ngram_size": 0,
  "encoder_no_repeat_ngram_size": 0,
  "bad_words_ids": null,
  "num_return_sequences": 1,
  "output_scores": false,
  "return_dict_in_generate": false,
  "forced_bos_token_id": null,
  "forced_eos_token_id": null,
  "remove_invalid_values": false,
  "exponential_decay_length_penalty": null,
  "suppress_tokens": null,
  "begin_suppress_tokens": [
    128000,
    128001
  ],
  "num_beam_groups": 1,
  "diversity_penalty": 0.0,
  "_name_or_path": "",
  "transformers_version": "4.57.5",
  "no_bias": true,
  "tf_legacy_loss": false,
  "use_bfloat16": false,
  "model_type": "qwen3",
  "output_attentions": false
 }
--- a/model-00001-of-00002.safetensors
+++ b/model-00001-of-00002.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:6baeb316bb764ad910d03b160b6d5c684b1fd32d60fc9f703e7c623d2ac86e1e
 size 4987231936
--- a/model-00002-of-00002.safetensors
+++ b/model-00002-of-00002.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:d066c71cf22dc57621ef5d8ffc80f13495def4fe1626985b882ee10730df7baa
 size 2751659344
--- a/model.safetensors.index.json
+++ b/model.safetensors.index.json
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,16 @@
 {
  "bos_token": {
    "content": "<|begin_of_text|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "<|end_of_text|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
 }
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json