smallcoder-303m/README.md

---
license: apache-2.0
language:
- en
- code
library_name: transformers
pipeline_tag: text-generation
tags:
- smallcoder
- code-llm
- code-generation
- sft
- pretraining
- tpu
- 303m
- trc
datasets:
- HuggingFaceFW/fineweb-edu
- nvidia/Nemotron-Pretraining-SFT-v1
- bigcode/starcoderdata
- nvidia/Nemotron-Pretraining-Code-v1
- HuggingFaceFW/finewiki
- open-web-math/open-web-math
- nvidia/Nemotron-CC-Math-v1
- nvidia/OpenCodeInstruct
- nvidia/OpenMathInstruct-2
---

# 🧠 SmallCoder (303M)

**SmallCoder** is a **303M parameter** LLaMA-style language model trained **from scratch** for **code generation** and **algorithmic reasoning**.

This checkpoint represents a **6B-token Supervised Fine-Tuning (SFT)** run that fixed a critical **End-of-Sequence (EOS) token bug** from earlier versions.

Despite its compact size, SmallCoder achieves **state-of-the-art (SOTA) coding performance for <500M models**, rivaling 1B–7B parameter LLMs.

> Trained with support from **Google’s TPU Research Cloud (TRC)** program.

---

## 🚀 Key Results

| Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
|:------|:----:|:------------------:|:--------------:|
| **SmallCoder (Stage 4.1)** | **303M** | **27.4 %** | **31.0 %** |
| TinyLlama-1.1B | 1.1B | ~26.4 % | ~27.6 % |
| MPT-1B-Instruct | 1.0B | ~22.0 % | ~25.0 % |
| Zephyr-1.3B-SFT | 1.3B | 31.0 % | 34.0 % |
| Mistral-7B-Base | 7B | 30.5 % | 47.5 % |

> ⚖️ **SmallCoder nearly matches Mistral 7B on HumanEval while being 23× smaller.**

---

## 🧬 Model Architecture

A **LLaMA-type causal decoder** with standard Multi-Head Attention (MHA).

```python
LlamaConfig(
  vocab_size=49152,               # StarCoder tokenizer
  hidden_size=768,
  num_hidden_layers=24,
  num_attention_heads=8,
  num_key_value_heads=8,
  intermediate_size=3072,
  max_position_embeddings=1024,
)
````

| Parameter         | Value                          |
| ----------------- | ------------------------------ |
| Total parameters  | ≈ 303 M                        |
| Context length    | 1 024 tokens                   |
| Tokenizer         | `bigcode/starcoder`            |
| Architecture type | LLaMA (MHA, non-GQA)           |
| Precision         | bfloat16                       |
| Optimizer         | AdamW XLA                      | 
| Hardware          | TPU v4-32 (TRC)                 |

---

## 📚 Training Curriculum (4 Stages, 29.8B tokens)

| Stage                      | Tokens (B) | Dataset                                              | Objective                        |    Loss ↓    |
| :------------------------- | :--------: | :--------------------------------------------------- | :------------------------------- | :----------: |
| **1. Linguistic Base**     |     6.3    | FineWeb-Edu                                          | General English grounding        | 10.87 → 2.58 |
| **2. Code Specialization** |     7.5    | 60 % Nemotron Synthetic Code / 40 % StarCoderData    | Code syntax & reasoning          |  5.00 → 1.25 |
| **3. Math & Knowledge**    |    10.0    | Nemotron CC-Math / FineWiki / OpenWebMath            | Mathematical reasoning           |  2.77 → 1.55 |
| **4.1 SFT (EOS Fixed)**    |     6.0    | Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 | Instruction-tuned code alignment | 1.73 → ~0.70 |

> 🧩 Total ≈ 29.8 B tokens of curated curriculum learning.

---

## 📊 Detailed Benchmarks (Stage 4.1 SFT)

| Domain          | Benchmark            | Metric       |     Score     |
| :-------------- | :------------------- | :----------- | :-----------: |
| **Code**        | HumanEval (0-shot)   | pass@1       |   **27.4 %**  |
| **Code**        | MBPP (3-shot)        | pass@1       |   **31.0 %**  |
| **Math**        | GSM8k (0-shot)       | exact match  |   **4.55 %**  |
| **Knowledge**   | Wikitext-2           | perplexity ↓ |   **167.6**   |
| **Reasoning**   | ARC (Easy/Challenge) | acc norm     | 34.6 / 22.8 % |
| **Commonsense** | HellaSwag            | acc norm     |     28.3 %    |

> `humaneval`/`mbpp` were computed with manual evaluation (`max_new_tokens=512`, `temp=0.2`) due to SFT format truncation issues in `lm-eval`.

---

## ⚠️ Known Limitations

1. **Code-Specialized Model**
   Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.

2. **Short Context**
   Trained on **1 024-token** sequences only. Performance degrades on longer inputs.

3. **Tokenizer Bias**
   Uses `bigcode/starcoder` BPE vocabulary — optimized for code, not prose.

---

## 💻 Usage Example

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Beebey/smallcoder-303m"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)

prompt = """User: Write a Python function to compute Fibonacci numbers.
Assistant:"""
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

💡 *Trained using the “User:” / “Assistant:” dialogue format.*

---

## 🧾 Citation

If you use **SmallCoder (303M)** in your research, please cite:

```
@misc{smallcoder303m,
  title  = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
  author = {Da Silva, Ilan},
  year   = {2025},
  url    = {https://huggingface.co/Beebey/smallcoder-303m},
  note   = {Trained with Google TPU Research Cloud (TRC) support}
}
```

---

## 🙏 Acknowledgements

This model was trained with support from the **Google TPU Research Cloud (TRC)** program.
Special thanks to the open datasets that enabled this work:
FineWeb, StarCoderData, Nemotron, and OpenWebMath.

---

## 🧩 Summary

| Category            | Description                 |
| ------------------- | --------------------------- |
| **Type**            | Code LLM (LLaMA-style)      |
| **Parameters**      | 303 M                       |
| **Training tokens** | ~29.8 B                     |
| **Specialty**       | Code generation & reasoning |
| **Context window**  | 1 024 tokens                |
| **Tokenizer**       | `bigcode/starcoder`         |
| **License**         | Apache 2.0                  |
| **Hardware**        | TPU v4 (TRC Program)        |

---

> 🔬 **SmallCoder (303M)** demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval — proving that *efficient, compact, open models* still matter.

```