初始化项目，由ModelHub XC社区提供模型

Model: Beebey/smallcoder-303m Source: Original Platform
2026-05-17 01:33:48 +08:00
commit 91411e40ca
14 changed files with 310395 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,35 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,196 @@
+---
+license: apache-2.0
+language:
+- en
+- code
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- smallcoder
+- code-llm
+- code-generation
+- sft
+- pretraining
+- tpu
+- 303m
+- trc
+datasets:
+- HuggingFaceFW/fineweb-edu
+- nvidia/Nemotron-Pretraining-SFT-v1
+- bigcode/starcoderdata
+- nvidia/Nemotron-Pretraining-Code-v1
+- HuggingFaceFW/finewiki
+- open-web-math/open-web-math
+- nvidia/Nemotron-CC-Math-v1
+- nvidia/OpenCodeInstruct
+- nvidia/OpenMathInstruct-2
+---
+
+# 🧠 SmallCoder (303M)
+
+**SmallCoder** is a **303M parameter** LLaMA-style language model trained **from scratch** for **code generation** and **algorithmic reasoning**.
+
+This checkpoint represents a **6B-token Supervised Fine-Tuning (SFT)** run that fixed a critical **End-of-Sequence (EOS) token bug** from earlier versions.
+
+Despite its compact size, SmallCoder achieves **state-of-the-art (SOTA) coding performance for <500M models**, rivaling 1B–7B parameter LLMs.
+
+> Trained with support from **Google’s TPU Research Cloud (TRC)** program.
+
+---
+
+## 🚀 Key Results
+
+| Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
+|:------|:----:|:------------------:|:--------------:|
+| **SmallCoder (Stage 4.1)** | **303M** | **27.4 %** | **31.0 %** |
+| TinyLlama-1.1B | 1.1B | ~26.4 % | ~27.6 % |
+| MPT-1B-Instruct | 1.0B | ~22.0 % | ~25.0 % |
+| Zephyr-1.3B-SFT | 1.3B | 31.0 % | 34.0 % |
+| Mistral-7B-Base | 7B | 30.5 % | 47.5 % |
+
+> ⚖️ **SmallCoder nearly matches Mistral 7B on HumanEval while being 23× smaller.**
+
+---
+
+## 🧬 Model Architecture
+
+A **LLaMA-type causal decoder** with standard Multi-Head Attention (MHA).
+
+```python
+LlamaConfig(
+  vocab_size=49152,               # StarCoder tokenizer
+  hidden_size=768,
+  num_hidden_layers=24,
+  num_attention_heads=8,
+  num_key_value_heads=8,
+  intermediate_size=3072,
+  max_position_embeddings=1024,
+)
+````
+
+| Parameter         | Value                          |
+| ----------------- | ------------------------------ |
+| Total parameters  | ≈ 303 M                        |
+| Context length    | 1 024 tokens                   |
+| Tokenizer         | `bigcode/starcoder`            |
+| Architecture type | LLaMA (MHA, non-GQA)           |
+| Precision         | bfloat16                       |
+| Optimizer         | AdamW XLA                      | 
+| Hardware          | TPU v4-32 (TRC)                 |
+
+---
+
+## 📚 Training Curriculum (4 Stages, 29.8B tokens)
+
+| Stage                      | Tokens (B) | Dataset                                              | Objective                        |    Loss ↓    |
+| :------------------------- | :--------: | :--------------------------------------------------- | :------------------------------- | :----------: |
+| **1. Linguistic Base**     |     6.3    | FineWeb-Edu                                          | General English grounding        | 10.87 → 2.58 |
+| **2. Code Specialization** |     7.5    | 60 % Nemotron Synthetic Code / 40 % StarCoderData    | Code syntax & reasoning          |  5.00 → 1.25 |
+| **3. Math & Knowledge**    |    10.0    | Nemotron CC-Math / FineWiki / OpenWebMath            | Mathematical reasoning           |  2.77 → 1.55 |
+| **4.1 SFT (EOS Fixed)**    |     6.0    | Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 | Instruction-tuned code alignment | 1.73 → ~0.70 |
+
+> 🧩 Total ≈ 29.8 B tokens of curated curriculum learning.
+
+---
+
+## 📊 Detailed Benchmarks (Stage 4.1 SFT)
+
+| Domain          | Benchmark            | Metric       |     Score     |
+| :-------------- | :------------------- | :----------- | :-----------: |
+| **Code**        | HumanEval (0-shot)   | pass@1       |   **27.4 %**  |
+| **Code**        | MBPP (3-shot)        | pass@1       |   **31.0 %**  |
+| **Math**        | GSM8k (0-shot)       | exact match  |   **4.55 %**  |
+| **Knowledge**   | Wikitext-2           | perplexity ↓ |   **167.6**   |
+| **Reasoning**   | ARC (Easy/Challenge) | acc norm     | 34.6 / 22.8 % |
+| **Commonsense** | HellaSwag            | acc norm     |     28.3 %    |
+
+> `humaneval`/`mbpp` were computed with manual evaluation (`max_new_tokens=512`, `temp=0.2`) due to SFT format truncation issues in `lm-eval`.
+
+---
+
+## ⚠️ Known Limitations
+
+1. **Code-Specialized Model**
+   Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.
+
+2. **Short Context**
+   Trained on **1 024-token** sequences only. Performance degrades on longer inputs.
+
+3. **Tokenizer Bias**
+   Uses `bigcode/starcoder` BPE vocabulary — optimized for code, not prose.
+
+---
+
+## 💻 Usage Example
+
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+model_id = "Beebey/smallcoder-303m"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
+
+prompt = """User: Write a Python function to compute Fibonacci numbers.
+Assistant:"""
+inputs = tokenizer(prompt, return_tensors="pt").to(device)
+
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=512,
+        eos_token_id=tokenizer.eos_token_id,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+
+💡 *Trained using the “User:” / “Assistant:” dialogue format.*
+
+---
+
+## 🧾 Citation
+
+If you use **SmallCoder (303M)** in your research, please cite:
+
+```
+@misc{smallcoder303m,
+  title  = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
+  author = {Da Silva, Ilan},
+  year   = {2025},
+  url    = {https://huggingface.co/Beebey/smallcoder-303m},
+  note   = {Trained with Google TPU Research Cloud (TRC) support}
+}
+```
+
+---
+
+## 🙏 Acknowledgements
+
+This model was trained with support from the **Google TPU Research Cloud (TRC)** program.
+Special thanks to the open datasets that enabled this work:
+FineWeb, StarCoderData, Nemotron, and OpenWebMath.
+
+---
+
+## 🧩 Summary
+
+| Category            | Description                 |
+| ------------------- | --------------------------- |
+| **Type**            | Code LLM (LLaMA-style)      |
+| **Parameters**      | 303 M                       |
+| **Training tokens** | ~29.8 B                     |
+| **Specialty**       | Code generation & reasoning |
+| **Context window**  | 1 024 tokens                |
+| **Tokenizer**       | `bigcode/starcoder`         |
+| **License**         | Apache 2.0                  |
+| **Hardware**        | TPU v4 (TRC Program)        |
+
+---
+
+> 🔬 **SmallCoder (303M)** demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval — proving that *efficient, compact, open models* still matter.
+
+```
--- a/config.json
+++ b/config.json
@@ -0,0 +1,30 @@
+{
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 0,
+  "dtype": "float32",
+  "eos_token_id": 0,
+  "head_dim": 96,
+  "hidden_act": "silu",
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "max_position_embeddings": 1024,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 8,
+  "num_hidden_layers": 24,
+  "num_key_value_heads": 8,
+  "pad_token_id": 0,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 10000.0,
+  "tie_word_embeddings": false,
+  "transformers_version": "4.57.1",
+  "use_cache": false,
+  "vocab_size": 49152
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,11 @@
+{
+  "_from_model_config": true,
+  "bos_token_id": 0,
+  "eos_token_id": [
+    0,
+    2
+  ],
+  "pad_token_id": 0,
+  "transformers_version": "4.57.1",
+  "use_cache": false
+}
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:136e7d81c875d8e266daa7b1cab1733fdc6b6091b85e58ca14033a3d3e724ca6
+size 1208134600
--- a/optimizer.pt
+++ b/optimizer.pt
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d3af3d153cfaa1b2c9f8ed46763bd233d7be870d2450600de19deab84ebc3275
+size 2416399051
--- a/scheduler.pt
+++ b/scheduler.pt
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a6b22a6f1b6d08c8ce59ab34d6028388e307761e14a69ec7c3847c234d45d1ae
+size 1465
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,45 @@
+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,187 @@
+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<fim_prefix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<fim_middle>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<fim_suffix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<fim_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<commit_before>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<commit_msg>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<commit_after>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 1024,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}
--- a/trainer_state.json
+++ b/trainer_state.json
--- a/training_args.bin
+++ b/training_args.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:be299abc6ce1a6876e35cbfe57a1b5e4957b577ba68a0f27cd247ee0090c7814
+size 5841
--- a/vocab.json
+++ b/vocab.json