初始化项目，由ModelHub XC社区提供模型

Model: Beebey/smallcoder-303m Source: Original Platform
2026-05-17 01:33:48 +08:00
commit 91411e40ca
14 changed files with 310395 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,35 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,196 @@
 ---
 license: apache-2.0
 language:
 - en
 - code
 library_name: transformers
 pipeline_tag: text-generation
 tags:
 - smallcoder
 - code-llm
 - code-generation
 - sft
 - pretraining
 - tpu
 - 303m
 - trc
 datasets:
 - HuggingFaceFW/fineweb-edu
 - nvidia/Nemotron-Pretraining-SFT-v1
 - bigcode/starcoderdata
 - nvidia/Nemotron-Pretraining-Code-v1
 - HuggingFaceFW/finewiki
 - open-web-math/open-web-math
 - nvidia/Nemotron-CC-Math-v1
 - nvidia/OpenCodeInstruct
 - nvidia/OpenMathInstruct-2
 ---
 # 🧠 SmallCoder (303M)
 **SmallCoder** is a **303M parameter** LLaMA-style language model trained **from scratch** for **code generation** and **algorithmic reasoning**.
 This checkpoint represents a **6B-token Supervised Fine-Tuning (SFT)** run that fixed a critical **End-of-Sequence (EOS) token bug** from earlier versions.
 Despite its compact size, SmallCoder achieves **state-of-the-art (SOTA) coding performance for <500M models**, rivaling 1B–7B parameter LLMs.
 > Trained with support from **Google’s TPU Research Cloud (TRC)** program.
 ---
 ## 🚀 Key Results
 | Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
 |:------|:----:|:------------------:|:--------------:|
 | **SmallCoder (Stage 4.1)** | **303M** | **27.4 %** | **31.0 %** |
 | TinyLlama-1.1B | 1.1B | ~26.4 % | ~27.6 % |
 | MPT-1B-Instruct | 1.0B | ~22.0 % | ~25.0 % |
 | Zephyr-1.3B-SFT | 1.3B | 31.0 % | 34.0 % |
 | Mistral-7B-Base | 7B | 30.5 % | 47.5 % |
 > ⚖️ **SmallCoder nearly matches Mistral 7B on HumanEval while being 23× smaller.**
 ---
 ## 🧬 Model Architecture
 A **LLaMA-type causal decoder** with standard Multi-Head Attention (MHA).
 ```python
 LlamaConfig(
  vocab_size=49152,               # StarCoder tokenizer
  hidden_size=768,
  num_hidden_layers=24,
  num_attention_heads=8,
  num_key_value_heads=8,
  intermediate_size=3072,
  max_position_embeddings=1024,
 )
 ````
 | Parameter         | Value                          |
 | ----------------- | ------------------------------ |
 | Total parameters  | ≈ 303 M                        |
 | Context length    | 1 024 tokens                   |
 | Tokenizer         | `bigcode/starcoder`            |
 | Architecture type | LLaMA (MHA, non-GQA)           |
 | Precision         | bfloat16                       |
 | Optimizer         | AdamW XLA                      | 
 | Hardware          | TPU v4-32 (TRC)                 |
 ---
 ## 📚 Training Curriculum (4 Stages, 29.8B tokens)
 | Stage                      | Tokens (B) | Dataset                                              | Objective                        |    Loss ↓    |
 | :------------------------- | :--------: | :--------------------------------------------------- | :------------------------------- | :----------: |
 | **1. Linguistic Base**     |     6.3    | FineWeb-Edu                                          | General English grounding        | 10.87 → 2.58 |
 | **2. Code Specialization** |     7.5    | 60 % Nemotron Synthetic Code / 40 % StarCoderData    | Code syntax & reasoning          |  5.00 → 1.25 |
 | **3. Math & Knowledge**    |    10.0    | Nemotron CC-Math / FineWiki / OpenWebMath            | Mathematical reasoning           |  2.77 → 1.55 |
 | **4.1 SFT (EOS Fixed)**    |     6.0    | Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 | Instruction-tuned code alignment | 1.73 → ~0.70 |
 > 🧩 Total ≈ 29.8 B tokens of curated curriculum learning.
 ---
 ## 📊 Detailed Benchmarks (Stage 4.1 SFT)
 | Domain          | Benchmark            | Metric       |     Score     |
 | :-------------- | :------------------- | :----------- | :-----------: |
 | **Code**        | HumanEval (0-shot)   | pass@1       |   **27.4 %**  |
 | **Code**        | MBPP (3-shot)        | pass@1       |   **31.0 %**  |
 | **Math**        | GSM8k (0-shot)       | exact match  |   **4.55 %**  |
 | **Knowledge**   | Wikitext-2           | perplexity ↓ |   **167.6**   |
 | **Reasoning**   | ARC (Easy/Challenge) | acc norm     | 34.6 / 22.8 % |
 | **Commonsense** | HellaSwag            | acc norm     |     28.3 %    |
 > `humaneval`/`mbpp` were computed with manual evaluation (`max_new_tokens=512`, `temp=0.2`) due to SFT format truncation issues in `lm-eval`.
 ---
 ## ⚠️ Known Limitations
 1. **Code-Specialized Model**
   Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.
 2. **Short Context**
   Trained on **1 024-token** sequences only. Performance degrades on longer inputs.
 3. **Tokenizer Bias**
   Uses `bigcode/starcoder` BPE vocabulary — optimized for code, not prose.
 ---
 ## 💻 Usage Example
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 model_id = "Beebey/smallcoder-303m"
 device = "cuda" if torch.cuda.is_available() else "cpu"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
 prompt = """User: Write a Python function to compute Fibonacci numbers.
 Assistant:"""
 inputs = tokenizer(prompt, return_tensors="pt").to(device)
 with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 💡 *Trained using the “User:” / “Assistant:” dialogue format.*
 ---
 ## 🧾 Citation
 If you use **SmallCoder (303M)** in your research, please cite:
 ```
@misc{smallcoder303m,
  title  = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
  author = {Da Silva, Ilan},
  year   = {2025},
  url    = {https://huggingface.co/Beebey/smallcoder-303m},
  note   = {Trained with Google TPU Research Cloud (TRC) support}
 }
 ```
 ---
 ## 🙏 Acknowledgements
 This model was trained with support from the **Google TPU Research Cloud (TRC)** program.
 Special thanks to the open datasets that enabled this work:
 FineWeb, StarCoderData, Nemotron, and OpenWebMath.
 ---
 ## 🧩 Summary
 | Category            | Description                 |
 | ------------------- | --------------------------- |
 | **Type**            | Code LLM (LLaMA-style)      |
 | **Parameters**      | 303 M                       |
 | **Training tokens** | ~29.8 B                     |
 | **Specialty**       | Code generation & reasoning |
 | **Context window**  | 1 024 tokens                |
 | **Tokenizer**       | `bigcode/starcoder`         |
 | **License**         | Apache 2.0                  |
 | **Hardware**        | TPU v4 (TRC Program)        |
 ---
 > 🔬 **SmallCoder (303M)** demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval — proving that *efficient, compact, open models* still matter.
 ```
--- a/config.json
+++ b/config.json
@@ -0,0 +1,30 @@
 {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "dtype": "float32",
  "eos_token_id": 0,
  "head_dim": 96,
  "hidden_act": "silu",
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 1024,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 8,
  "num_hidden_layers": 24,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.1",
  "use_cache": false,
  "vocab_size": 49152
 }
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,11 @@
 {
  "_from_model_config": true,
  "bos_token_id": 0,
  "eos_token_id": [
    0,
    2
  ],
  "pad_token_id": 0,
  "transformers_version": "4.57.1",
  "use_cache": false
 }
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:136e7d81c875d8e266daa7b1cab1733fdc6b6091b85e58ca14033a3d3e724ca6
 size 1208134600
--- a/optimizer.pt
+++ b/optimizer.pt
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:d3af3d153cfaa1b2c9f8ed46763bd233d7be870d2450600de19deab84ebc3275
 size 2416399051
--- a/scheduler.pt
+++ b/scheduler.pt
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:a6b22a6f1b6d08c8ce59ab34d6028388e307761e14a69ec7c3847c234d45d1ae
 size 1465
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,45 @@
 {
  "additional_special_tokens": [
    "<|endoftext|>",
    "<fim_prefix>",
    "<fim_middle>",
    "<fim_suffix>",
    "<fim_pad>",
    "<filename>",
    "<gh_stars>",
    "<issue_start>",
    "<issue_comment>",
    "<issue_closed>",
    "<jupyter_start>",
    "<jupyter_text>",
    "<jupyter_code>",
    "<jupyter_output>",
    "<empty_output>",
    "<commit_before>",
    "<commit_msg>",
    "<commit_after>",
    "<reponame>"
  ],
  "bos_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": "<|endoftext|>",
  "unk_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
 }
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,187 @@
 {
  "add_prefix_space": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<|endoftext|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<fim_prefix>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "<fim_middle>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "3": {
      "content": "<fim_suffix>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "4": {
      "content": "<fim_pad>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "5": {
      "content": "<filename>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "6": {
      "content": "<gh_stars>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "7": {
      "content": "<issue_start>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "8": {
      "content": "<issue_comment>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "9": {
      "content": "<issue_closed>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "10": {
      "content": "<jupyter_start>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "11": {
      "content": "<jupyter_text>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "12": {
      "content": "<jupyter_code>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "13": {
      "content": "<jupyter_output>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "14": {
      "content": "<empty_output>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "15": {
      "content": "<commit_before>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "16": {
      "content": "<commit_msg>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "17": {
      "content": "<commit_after>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "18": {
      "content": "<reponame>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [
    "<|endoftext|>",
    "<fim_prefix>",
    "<fim_middle>",
    "<fim_suffix>",
    "<fim_pad>",
    "<filename>",
    "<gh_stars>",
    "<issue_start>",
    "<issue_comment>",
    "<issue_closed>",
    "<jupyter_start>",
    "<jupyter_text>",
    "<jupyter_code>",
    "<jupyter_output>",
    "<empty_output>",
    "<commit_before>",
    "<commit_msg>",
    "<commit_after>",
    "<reponame>"
  ],
  "bos_token": "<|endoftext|>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|endoftext|>",
  "extra_special_tokens": {},
  "model_max_length": 1024,
  "pad_token": "<|endoftext|>",
  "tokenizer_class": "GPT2Tokenizer",
  "unk_token": "<|endoftext|>",
  "vocab_size": 49152
 }
--- a/trainer_state.json
+++ b/trainer_state.json
--- a/training_args.bin
+++ b/training_args.bin
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:be299abc6ce1a6876e35cbfe57a1b5e4957b577ba68a0f27cd247ee0090c7814
 size 5841
--- a/vocab.json
+++ b/vocab.json