--- language: en license: apache-2.0 library_name: transformers pipeline_tag: text-generation tags: - text-generation - llama - from-scratch - jax model-index: - name: KodaLite-1.3B results: - task: type: text-generation dataset: name: HellaSwag (zero-shot) type: hellaswag metrics: - type: accuracy value: 0.2565 - task: type: text-generation dataset: name: ARC-Easy (zero-shot) type: ai2_arc metrics: - type: accuracy value: 0.3279 - task: type: text-generation dataset: name: ARC-Challenge (zero-shot) type: ai2_arc metrics: - type: accuracy value: 0.2150 - task: type: text-generation dataset: name: WinoGrande (zero-shot) type: winogrande metrics: - type: accuracy value: 0.4957 - task: type: text-generation dataset: name: PIQA (zero-shot) type: piqa metrics: - type: accuracy value: 0.5892 - task: type: text-generation dataset: name: BoolQ (zero-shot) type: boolq metrics: - type: accuracy value: 0.4434 - task: type: text-generation dataset: name: OpenBookQA (zero-shot) type: openbookqa metrics: - type: accuracy value: 0.2500 - task: type: text-generation dataset: name: LAMBADA (OpenAI, zero-shot) type: lambada_openai metrics: - type: accuracy value: 0.1822 - type: perplexity value: 93.78 --- # KodaLite-1.3B (Koda-v0.1) A **1.27B** parameter LLaMA-style decoder-only language model, trained **entirely from scratch** on 2x NVIDIA L40S GPUs using JAX + Flax NNX, then converted to HuggingFace Transformers format. > **TL;DR** — KodaLite reaches ~37% average accuracy on standard LLM benchmarks. It is **severely undertrained** (only 1.64B tokens vs 40B–3T for comparable models), which places it just below GPT-2-124M despite having 10× more parameters. A nice illustration of the **Chinchilla scaling law**: tokens matter more than parameters at this budget. ## Benchmark results (zero-shot, 8 standard tasks) Evaluated against 8 comparable ~1B-parameter models on the same benchmarks (HellaSwag, ARC-E/C, WinoGrande, PIQA, BoolQ, OpenBookQA, LAMBADA-OpenAI). | Rank | Model | Params | Train tokens | Avg accuracy | |---|---|---|---|---| | 1 | TinyLlama-1.1B | 1.10B | 3000B | **50.3%** | | 2 | Pythia-1.4B | 1.41B | 300B | **50.2%** | | 3 | GPT-2-XL | 1.56B | 40B | **49.4%** | | 4 | OPT-1.3B | 1.32B | 180B | **49.1%** | | 5 | Pythia-1B | 1.01B | 300B | **47.6%** | | 6 | GPT-2-large | 0.77B | 40B | **46.2%** | | 7 | GPT-2-medium | 0.35B | 40B | **44.2%** | | 8 | GPT-2-124m | 0.12B | 40B | **39.7%** | | **9** | **KodaLite-1.3B** | **1.27B** | **1.64B** | **36.8%** | ### Per-task breakdown | Task | KodaLite-1.3B | GPT-2-124M | GPT-2-XL | Pythia-1.4B | TinyLlama-1.1B | Random | |---|---|---|---|---|---|---| | HellaSwag | 25.65 | 29.22 | 47.94 | 49.21 | 56.2 | 25.0 | | ARC-Easy | 32.79 | 38.30 | 50.80 | 51.73 | 43.9 | 25.0 | | ARC-Challenge | 21.50 | 22.70 | 28.16 | 29.01 | 30.0 | 25.0 | | WinoGrande | 49.57 | 49.49 | 51.93 | 52.88 | 52.2 | 50.0 | | PIQA | 58.92 | 62.24 | 70.89 | 71.22 | 72.1 | 50.0 | | BoolQ | 44.34 | 49.76 | 61.59 | 63.70 | 60.6 | 50.0 | | OpenBookQA | 25.00 | 26.40 | 34.20 | 33.40 | 37.2 | 25.0 | | LAMBADA (acc / ppl) | 18.22 / 93.8 | 30.84 / 17.5 | 50.79 / 6.4 | 61.03 / 3.8 | — | — | ## Why KodaLite scores below GPT-2-124M (despite being 10× bigger) The **Chinchilla scaling law** (DeepMind, 2022) states that a model with N parameters needs approximately **20×N training tokens** to be well-trained: | Model | Params | Chinchilla target (~20× params) | Actual tokens | Ratio | |---|---|---|---|---| | **KodaLite-1.3B** | 1.27B | ~25B | **1.64B** | **6.5 %** 🔴 | | GPT-2-XL | 1.5B | ~30B | 40B | 133 % | | Pythia-1.4B | 1.4B | ~28B | 300B | 1070 % | | TinyLlama-1.1B | 1.1B | ~22B | 3000B | 13600 % | KodaLite has seen **only 6.5%** of what it would need to be competitive. A bigger but undertrained model scores lower than a smaller but well-trained one. The LAMBADA perplexity (94 vs 17 for GPT-2-124M) is the clearest signal: the base language modeling is not converged. On **PIQA** (physical commonsense) the gap is smallest — that kind of knowledge appears to be learned faster than factual knowledge or precise language modeling. ## Chat Format Model uses 3 text markers (no special tokens): `<|user|>`, `<|assistant|>`, `<|end|>`. ``` <|user|> Your question <|assistant|> Model response <|end|> ``` **Important**: `<|end|>` is NOT a single token (it tokenizes to 5 BPE tokens). Always pass it as a `stop_strings` parameter when generating, otherwise the model will run past its natural end-of-turn. ## Usage (Transformers) ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer tok = AutoTokenizer.from_pretrained("YoAbriel/KodaLite-1.3B") model = AutoModelForCausalLM.from_pretrained( "YoAbriel/KodaLite-1.3B", dtype=torch.bfloat16, device_map="auto" ) msg = [{"role": "user", "content": "What is the capital of France?"}] prompt = tok.apply_chat_template(msg, tokenize=False, add_generation_prompt=False) inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate( **inputs, max_new_tokens=150, do_sample=True, temperature=0.7, top_k=40, stop_strings=["<|end|>"], tokenizer=tok, ) print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)) ``` ## Usage (MLX — Apple Silicon) See [YoAbriel/KodaLite-1.3B-mlx](https://huggingface.co/YoAbriel/KodaLite-1.3B-mlx). ```python from mlx_lm import load, stream_generate model, tok = load("YoAbriel/KodaLite-1.3B-mlx-8bit") def chat(q): prompt = tok.apply_chat_template([{"role": "user", "content": q}], tokenize=False) text = "" for resp in stream_generate(model, tok, prompt=prompt, max_tokens=150): text += resp.text if "<|end|>" in text: return text.split("<|end|>")[0] return text print(chat("What is the capital of France?")) ``` ## Usage (llama.cpp / Ollama / LM Studio) See [YoAbriel/KodaLite-1.3B-GGUF](https://huggingface.co/YoAbriel/KodaLite-1.3B-GGUF). ```bash ollama run hf.co/YoAbriel/KodaLite-1.3B-GGUF:Q4_K_M ``` **LM Studio note**: the model was trained with `<|end|>` as a multi-token end marker. Since GGUF only supports single-token EOS, you need to **manually add `<|end|>` as a Stop String** in LM Studio's Advanced Settings. ## Architecture (LLaMA-compatible) | Component | Value | |---|---| | Parameters | 1.27B | | Layers | 24 | | Hidden size | 2048 | | Attention | GQA (32Q / 8KV heads) | | Head dim | 64 | | FFN | SwiGLU, intermediate 5504 | | Normalization | RMSNorm (pre-norm) | | Position | RoPE (theta=10000) | | Context | 1024 tokens | | Vocab | 50,257 (GPT-2 BPE) | ## Training ### Pre-training - **Dataset**: SlimPajama-6B (streaming) - **Tokens seen**: 1.64B - **Hardware**: 2x NVIDIA L40S (96GB VRAM total) - **Precision**: bfloat16 - **Framework**: JAX + Flax NNX (trained from scratch, no base model) ### SFT - **Datasets**: Databricks Dolly-15K + OpenAssistant OASST1 - **Method**: LoRA (rank=16, alpha=32), then merged into base weights - **End-of-turn marker**: `<|end|>` (5 BPE tokens, NOT a special token) ## Limitations - **Severely undertrained** (6.5% of Chinchilla-optimal) — factual accuracy is low - May produce repetitive or inaccurate responses - English only - 1024 context window - Educational / research project — not production-ready ## Lessons learned (for a potential v0.2) 1. **Train longer**: aim for 20B+ tokens (Chinchilla-optimal for 1.3B would be ~25B). 2. **Use `<|endoftext|>` (single token) as end-of-turn marker** for native GGUF/LM Studio stop support. 3. SwiGLU + RMSNorm + GQA + RoPE architecture is correct — no issues there, confirmed by the fact that our scaling follows the expected curve. ## License Apache 2.0