初始化项目，由ModelHub XC社区提供模型

Model: Beebey/smallcoder-303m Source: Original Platform
2026-05-17 01:33:48 +08:00
commit 91411e40ca
14 changed files with 310395 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,196 @@
+---
+license: apache-2.0
+language:
+- en
+- code
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- smallcoder
+- code-llm
+- code-generation
+- sft
+- pretraining
+- tpu
+- 303m
+- trc
+datasets:
+- HuggingFaceFW/fineweb-edu
+- nvidia/Nemotron-Pretraining-SFT-v1
+- bigcode/starcoderdata
+- nvidia/Nemotron-Pretraining-Code-v1
+- HuggingFaceFW/finewiki
+- open-web-math/open-web-math
+- nvidia/Nemotron-CC-Math-v1
+- nvidia/OpenCodeInstruct
+- nvidia/OpenMathInstruct-2
+---
+
+# 🧠 SmallCoder (303M)
+
+**SmallCoder** is a **303M parameter** LLaMA-style language model trained **from scratch** for **code generation** and **algorithmic reasoning**.
+
+This checkpoint represents a **6B-token Supervised Fine-Tuning (SFT)** run that fixed a critical **End-of-Sequence (EOS) token bug** from earlier versions.
+
+Despite its compact size, SmallCoder achieves **state-of-the-art (SOTA) coding performance for <500M models**, rivaling 1B–7B parameter LLMs.
+
+> Trained with support from **Google’s TPU Research Cloud (TRC)** program.
+
+---
+
+## 🚀 Key Results
+
+| Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
+|:------|:----:|:------------------:|:--------------:|
+| **SmallCoder (Stage 4.1)** | **303M** | **27.4 %** | **31.0 %** |
+| TinyLlama-1.1B | 1.1B | ~26.4 % | ~27.6 % |
+| MPT-1B-Instruct | 1.0B | ~22.0 % | ~25.0 % |
+| Zephyr-1.3B-SFT | 1.3B | 31.0 % | 34.0 % |
+| Mistral-7B-Base | 7B | 30.5 % | 47.5 % |
+
+> ⚖️ **SmallCoder nearly matches Mistral 7B on HumanEval while being 23× smaller.**
+
+---
+
+## 🧬 Model Architecture
+
+A **LLaMA-type causal decoder** with standard Multi-Head Attention (MHA).
+
+```python
+LlamaConfig(
+  vocab_size=49152,               # StarCoder tokenizer
+  hidden_size=768,
+  num_hidden_layers=24,
+  num_attention_heads=8,
+  num_key_value_heads=8,
+  intermediate_size=3072,
+  max_position_embeddings=1024,
+)
+````
+
+| Parameter         | Value                          |
+| ----------------- | ------------------------------ |
+| Total parameters  | ≈ 303 M                        |
+| Context length    | 1 024 tokens                   |
+| Tokenizer         | `bigcode/starcoder`            |
+| Architecture type | LLaMA (MHA, non-GQA)           |
+| Precision         | bfloat16                       |
+| Optimizer         | AdamW XLA                      | 
+| Hardware          | TPU v4-32 (TRC)                 |
+
+---
+
+## 📚 Training Curriculum (4 Stages, 29.8B tokens)
+
+| Stage                      | Tokens (B) | Dataset                                              | Objective                        |    Loss ↓    |
+| :------------------------- | :--------: | :--------------------------------------------------- | :------------------------------- | :----------: |
+| **1. Linguistic Base**     |     6.3    | FineWeb-Edu                                          | General English grounding        | 10.87 → 2.58 |
+| **2. Code Specialization** |     7.5    | 60 % Nemotron Synthetic Code / 40 % StarCoderData    | Code syntax & reasoning          |  5.00 → 1.25 |
+| **3. Math & Knowledge**    |    10.0    | Nemotron CC-Math / FineWiki / OpenWebMath            | Mathematical reasoning           |  2.77 → 1.55 |
+| **4.1 SFT (EOS Fixed)**    |     6.0    | Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 | Instruction-tuned code alignment | 1.73 → ~0.70 |
+
+> 🧩 Total ≈ 29.8 B tokens of curated curriculum learning.
+
+---
+
+## 📊 Detailed Benchmarks (Stage 4.1 SFT)
+
+| Domain          | Benchmark            | Metric       |     Score     |
+| :-------------- | :------------------- | :----------- | :-----------: |
+| **Code**        | HumanEval (0-shot)   | pass@1       |   **27.4 %**  |
+| **Code**        | MBPP (3-shot)        | pass@1       |   **31.0 %**  |
+| **Math**        | GSM8k (0-shot)       | exact match  |   **4.55 %**  |
+| **Knowledge**   | Wikitext-2           | perplexity ↓ |   **167.6**   |
+| **Reasoning**   | ARC (Easy/Challenge) | acc norm     | 34.6 / 22.8 % |
+| **Commonsense** | HellaSwag            | acc norm     |     28.3 %    |
+
+> `humaneval`/`mbpp` were computed with manual evaluation (`max_new_tokens=512`, `temp=0.2`) due to SFT format truncation issues in `lm-eval`.
+
+---
+
+## ⚠️ Known Limitations
+
+1. **Code-Specialized Model**
+   Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.
+
+2. **Short Context**
+   Trained on **1 024-token** sequences only. Performance degrades on longer inputs.
+
+3. **Tokenizer Bias**
+   Uses `bigcode/starcoder` BPE vocabulary — optimized for code, not prose.
+
+---
+
+## 💻 Usage Example
+
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+model_id = "Beebey/smallcoder-303m"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
+
+prompt = """User: Write a Python function to compute Fibonacci numbers.
+Assistant:"""
+inputs = tokenizer(prompt, return_tensors="pt").to(device)
+
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=512,
+        eos_token_id=tokenizer.eos_token_id,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+
+💡 *Trained using the “User:” / “Assistant:” dialogue format.*
+
+---
+
+## 🧾 Citation
+
+If you use **SmallCoder (303M)** in your research, please cite:
+
+```
+@misc{smallcoder303m,
+  title  = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
+  author = {Da Silva, Ilan},
+  year   = {2025},
+  url    = {https://huggingface.co/Beebey/smallcoder-303m},
+  note   = {Trained with Google TPU Research Cloud (TRC) support}
+}
+```
+
+---
+
+## 🙏 Acknowledgements
+
+This model was trained with support from the **Google TPU Research Cloud (TRC)** program.
+Special thanks to the open datasets that enabled this work:
+FineWeb, StarCoderData, Nemotron, and OpenWebMath.
+
+---
+
+## 🧩 Summary
+
+| Category            | Description                 |
+| ------------------- | --------------------------- |
+| **Type**            | Code LLM (LLaMA-style)      |
+| **Parameters**      | 303 M                       |
+| **Training tokens** | ~29.8 B                     |
+| **Specialty**       | Code generation & reasoning |
+| **Context window**  | 1 024 tokens                |
+| **Tokenizer**       | `bigcode/starcoder`         |
+| **License**         | Apache 2.0                  |
+| **Hardware**        | TPU v4 (TRC Program)        |
+
+---
+
+> 🔬 **SmallCoder (303M)** demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval — proving that *efficient, compact, open models* still matter.
+
+```