--- language: - en license: mit library_name: transformers pipeline_tag: text-generation tags: - pytorch - causal-lm - llama - from-scratch - pretraining - gqa - swiglu - rope - rmsnorm model-index: - name: Mythos-172M results: [] widget: - text: "The history of artificial intelligence begins with" example_title: "History" - text: "A transformer is a neural network that" example_title: "Architecture" inference: parameters: temperature: 0.8 top_p: 0.9 max_new_tokens: 128 ---
# Mythos-172M **A decoder-only language model built from scratch — LLaMA-compatible weights.** [![GitHub](https://img.shields.io/badge/GitHub-borisgraudt/mythos-24292e?logo=github)](https://github.com/borisgraudt/mythos) [![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/borisgraudt/mythos/blob/main/LICENSE) [![PyTorch](https://img.shields.io/badge/PyTorch-2.5+-ee4c2c.svg?logo=pytorch)](https://pytorch.org) [![transformers](https://img.shields.io/badge/🤗%20transformers-compatible-yellow)](https://github.com/huggingface/transformers)
--- > ⚠️ **Research preview.** Debug checkpoint — trained on ~21 M tokens with vocab 3 252 for 5 000 steps. Intended to verify the architecture, not for downstream use. A production 500 M checkpoint will supersede it. ## Model Summary Mythos is a LLaMA-style autoregressive transformer implemented **from first principles** in pure PyTorch — no `transformers` inheritance, no `nn.TransformerBlock`, no shortcuts. Every component (attention, rotary embeddings, SwiGLU, RMSNorm, the training loop, the BPE tokenizer, the data pipeline, the KV-cache inference engine) is hand-written in the reference repository. This release packages the weights in the **`LlamaForCausalLM`** format so that the model is natively usable via the standard `transformers`, `vLLM`, `TGI`, and `llama.cpp` toolchains — no custom code or `trust_remote_code` required. | | | |---|---| | **Developed by** | Boris Graudt | | **Model type** | Decoder-only causal transformer | | **Language** | English | | **License** | MIT | | **Compatible with** | 🤗 `transformers`, vLLM, TGI, llama.cpp, Ollama | | **Reference implementation** | [github.com/borisgraudt/mythos](https://github.com/borisgraudt/mythos) | ## Architecture | Component | Choice | Value | |---|---|---:| | Parameters | — | **172 M** | | Hidden layers | Pre-norm decoder blocks | 24 | | Hidden size | `d_model` | 768 | | Intermediate size | SwiGLU hidden | 2048 | | Attention heads | Multi-head | 12 | | Key / value heads | **Grouped-Query Attention** | 4 | | Head dim | `d_model / n_heads` | 64 | | Positional encoding | **Rotary (RoPE)** | θ = 10,000 | | Normalization | **RMSNorm** (pre-norm) | ε = 1e-05 | | Activation | **SwiGLU** | — | | Tied embeddings | Embedding ↔ LM head | ✅ | | Vocabulary | ByteLevel BPE | 3,252 | | Context length | Max sequence | 2,048 | ## Quickstart ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "bgraudt/mythos" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") inputs = tokenizer("The history of artificial intelligence begins with", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.8, top_p=0.9, do_sample=True) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Serving with vLLM ```bash pip install vllm python -m vllm.entrypoints.openai.api_server --model bgraudt/mythos ``` ### Serving with llama.cpp ```bash # Convert to GGUF (one-time) python llama.cpp/convert_hf_to_gguf.py mythos ./llama-cli -m ggml-model-f16.gguf -p "Hello" ``` ## Training ### Data - **Corpus:** Wikipedia (English 20231101 snapshot) — 5 000 articles, ~21 M tokens - **Tokenizer:** ByteLevel BPE trained from scratch, vocab size **3,252** - **Training context:** 512 tokens ### Hyperparameters | | | |---|---:| | Steps | 5,000 | | Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.1) | | LR schedule | Cosine decay, 2 000-step warmup | | Peak learning rate | 3 × 10⁻⁴ | | Precision | bfloat16 mixed | | Hardware | Apple M2 (MPS) | ## Limitations and Intended Use - **Base model only** — no instruction tuning, no RLHF, no safety alignment. - English-only; non-English performance is poor. - May reproduce biases and factual errors from the training distribution. - Tiny vocabulary (3 252 tokens) severely caps fluency — intended as an architecture demo. - Not suitable for medical, legal, financial, or other high-stakes applications. ## Citation ```bibtex @software{graudt2026mythos, author = {Graudt, Boris}, title = {Mythos: A Decoder-Only Language Model Built From Scratch}, year = {2026}, url = {https://github.com/borisgraudt/mythos}, license = {MIT} } ``` ## Acknowledgements Architecture inspired by **LLaMA** (Touvron et al., 2023) and **Mistral 7B** (Jiang et al., 2023). Data pipeline follows the **FineWeb** methodology (Penedo et al., 2024).