Files
smallcoder-303m/README.md
ModelHub XC 91411e40ca 初始化项目,由ModelHub XC社区提供模型
Model: Beebey/smallcoder-303m
Source: Original Platform
2026-05-17 01:33:48 +08:00

196 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
language:
- en
- code
library_name: transformers
pipeline_tag: text-generation
tags:
- smallcoder
- code-llm
- code-generation
- sft
- pretraining
- tpu
- 303m
- trc
datasets:
- HuggingFaceFW/fineweb-edu
- nvidia/Nemotron-Pretraining-SFT-v1
- bigcode/starcoderdata
- nvidia/Nemotron-Pretraining-Code-v1
- HuggingFaceFW/finewiki
- open-web-math/open-web-math
- nvidia/Nemotron-CC-Math-v1
- nvidia/OpenCodeInstruct
- nvidia/OpenMathInstruct-2
---
# 🧠 SmallCoder (303M)
**SmallCoder** is a **303M parameter** LLaMA-style language model trained **from scratch** for **code generation** and **algorithmic reasoning**.
This checkpoint represents a **6B-token Supervised Fine-Tuning (SFT)** run that fixed a critical **End-of-Sequence (EOS) token bug** from earlier versions.
Despite its compact size, SmallCoder achieves **state-of-the-art (SOTA) coding performance for <500M models**, rivaling 1B7B parameter LLMs.
> Trained with support from **Googles TPU Research Cloud (TRC)** program.
---
## 🚀 Key Results
| Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
|:------|:----:|:------------------:|:--------------:|
| **SmallCoder (Stage 4.1)** | **303M** | **27.4 %** | **31.0 %** |
| TinyLlama-1.1B | 1.1B | ~26.4 % | ~27.6 % |
| MPT-1B-Instruct | 1.0B | ~22.0 % | ~25.0 % |
| Zephyr-1.3B-SFT | 1.3B | 31.0 % | 34.0 % |
| Mistral-7B-Base | 7B | 30.5 % | 47.5 % |
> ⚖️ **SmallCoder nearly matches Mistral 7B on HumanEval while being 23× smaller.**
---
## 🧬 Model Architecture
A **LLaMA-type causal decoder** with standard Multi-Head Attention (MHA).
```python
LlamaConfig(
vocab_size=49152, # StarCoder tokenizer
hidden_size=768,
num_hidden_layers=24,
num_attention_heads=8,
num_key_value_heads=8,
intermediate_size=3072,
max_position_embeddings=1024,
)
````
| Parameter | Value |
| ----------------- | ------------------------------ |
| Total parameters | 303 M |
| Context length | 1 024 tokens |
| Tokenizer | `bigcode/starcoder` |
| Architecture type | LLaMA (MHA, non-GQA) |
| Precision | bfloat16 |
| Optimizer | AdamW XLA |
| Hardware | TPU v4-32 (TRC) |
---
## 📚 Training Curriculum (4 Stages, 29.8B tokens)
| Stage | Tokens (B) | Dataset | Objective | Loss |
| :------------------------- | :--------: | :--------------------------------------------------- | :------------------------------- | :----------: |
| **1. Linguistic Base** | 6.3 | FineWeb-Edu | General English grounding | 10.87 2.58 |
| **2. Code Specialization** | 7.5 | 60 % Nemotron Synthetic Code / 40 % StarCoderData | Code syntax & reasoning | 5.00 1.25 |
| **3. Math & Knowledge** | 10.0 | Nemotron CC-Math / FineWiki / OpenWebMath | Mathematical reasoning | 2.77 1.55 |
| **4.1 SFT (EOS Fixed)** | 6.0 | Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 | Instruction-tuned code alignment | 1.73 ~0.70 |
> 🧩 Total 29.8 B tokens of curated curriculum learning.
---
## 📊 Detailed Benchmarks (Stage 4.1 SFT)
| Domain | Benchmark | Metric | Score |
| :-------------- | :------------------- | :----------- | :-----------: |
| **Code** | HumanEval (0-shot) | pass@1 | **27.4 %** |
| **Code** | MBPP (3-shot) | pass@1 | **31.0 %** |
| **Math** | GSM8k (0-shot) | exact match | **4.55 %** |
| **Knowledge** | Wikitext-2 | perplexity | **167.6** |
| **Reasoning** | ARC (Easy/Challenge) | acc norm | 34.6 / 22.8 % |
| **Commonsense** | HellaSwag | acc norm | 28.3 % |
> `humaneval`/`mbpp` were computed with manual evaluation (`max_new_tokens=512`, `temp=0.2`) due to SFT format truncation issues in `lm-eval`.
---
## ⚠️ Known Limitations
1. **Code-Specialized Model**
Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.
2. **Short Context**
Trained on **1 024-token** sequences only. Performance degrades on longer inputs.
3. **Tokenizer Bias**
Uses `bigcode/starcoder` BPE vocabulary optimized for code, not prose.
---
## 💻 Usage Example
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Beebey/smallcoder-303m"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
prompt = """User: Write a Python function to compute Fibonacci numbers.
Assistant:"""
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
💡 *Trained using the “User:” / “Assistant:” dialogue format.*
---
## 🧾 Citation
If you use **SmallCoder (303M)** in your research, please cite:
```
@misc{smallcoder303m,
title = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
author = {Da Silva, Ilan},
year = {2025},
url = {https://huggingface.co/Beebey/smallcoder-303m},
note = {Trained with Google TPU Research Cloud (TRC) support}
}
```
---
## 🙏 Acknowledgements
This model was trained with support from the **Google TPU Research Cloud (TRC)** program.
Special thanks to the open datasets that enabled this work:
FineWeb, StarCoderData, Nemotron, and OpenWebMath.
---
## 🧩 Summary
| Category | Description |
| ------------------- | --------------------------- |
| **Type** | Code LLM (LLaMA-style) |
| **Parameters** | 303 M |
| **Training tokens** | ~29.8 B |
| **Specialty** | Code generation & reasoning |
| **Context window** | 1 024 tokens |
| **Tokenizer** | `bigcode/starcoder` |
| **License** | Apache 2.0 |
| **Hardware** | TPU v4 (TRC Program) |
---
> 🔬 **SmallCoder (303M)** demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval — proving that *efficient, compact, open models* still matter.
```