language, license, library_name, pipeline_tag, tags, model-index, widget, inference
language license library_name pipeline_tag tags model-index widget inference
en
mit transformers text-generation
pytorch
causal-lm
llama
from-scratch
pretraining
gqa
swiglu
rope
rmsnorm
name results
Mythos-172M
text example_title
The history of artificial intelligence begins with History
text example_title
A transformer is a neural network that Architecture
parameters
temperature top_p max_new_tokens
0.8 0.9 128

Mythos-172M

A decoder-only language model built from scratch — LLaMA-compatible weights.

GitHub License PyTorch transformers


⚠️ Research preview. Debug checkpoint — trained on ~21 M tokens with vocab 3 252 for 5 000 steps. Intended to verify the architecture, not for downstream use. A production 500 M checkpoint will supersede it.

Model Summary

Mythos is a LLaMA-style autoregressive transformer implemented from first principles in pure PyTorch — no transformers inheritance, no nn.TransformerBlock, no shortcuts. Every component (attention, rotary embeddings, SwiGLU, RMSNorm, the training loop, the BPE tokenizer, the data pipeline, the KV-cache inference engine) is hand-written in the reference repository.

This release packages the weights in the LlamaForCausalLM format so that the model is natively usable via the standard transformers, vLLM, TGI, and llama.cpp toolchains — no custom code or trust_remote_code required.

Developed by Boris Graudt
Model type Decoder-only causal transformer
Language English
License MIT
Compatible with 🤗 transformers, vLLM, TGI, llama.cpp, Ollama
Reference implementation github.com/borisgraudt/mythos

Architecture

Component Choice Value
Parameters 172 M
Hidden layers Pre-norm decoder blocks 24
Hidden size d_model 768
Intermediate size SwiGLU hidden 2048
Attention heads Multi-head 12
Key / value heads Grouped-Query Attention 4
Head dim d_model / n_heads 64
Positional encoding Rotary (RoPE) θ = 10,000
Normalization RMSNorm (pre-norm) ε = 1e-05
Activation SwiGLU
Tied embeddings Embedding ↔ LM head
Vocabulary ByteLevel BPE 3,252
Context length Max sequence 2,048

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bgraudt/mythos"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

inputs = tokenizer("The history of artificial intelligence begins with", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.8, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Serving with vLLM

pip install vllm
python -m vllm.entrypoints.openai.api_server --model bgraudt/mythos

Serving with llama.cpp

# Convert to GGUF (one-time)
python llama.cpp/convert_hf_to_gguf.py mythos
./llama-cli -m ggml-model-f16.gguf -p "Hello"

Training

Data

  • Corpus: Wikipedia (English 20231101 snapshot) — 5 000 articles, ~21 M tokens
  • Tokenizer: ByteLevel BPE trained from scratch, vocab size 3,252
  • Training context: 512 tokens

Hyperparameters

Steps 5,000
Optimizer AdamW (β₁=0.9, β₂=0.95, wd=0.1)
LR schedule Cosine decay, 2 000-step warmup
Peak learning rate 3 × 10⁻⁴
Precision bfloat16 mixed
Hardware Apple M2 (MPS)

Limitations and Intended Use

  • Base model only — no instruction tuning, no RLHF, no safety alignment.
  • English-only; non-English performance is poor.
  • May reproduce biases and factual errors from the training distribution.
  • Tiny vocabulary (3 252 tokens) severely caps fluency — intended as an architecture demo.
  • Not suitable for medical, legal, financial, or other high-stakes applications.

Citation

@software{graudt2026mythos,
  author  = {Graudt, Boris},
  title   = {Mythos: A Decoder-Only Language Model Built From Scratch},
  year    = {2026},
  url     = {https://github.com/borisgraudt/mythos},
  license = {MIT}
}

Acknowledgements

Architecture inspired by LLaMA (Touvron et al., 2023) and Mistral 7B (Jiang et al., 2023). Data pipeline follows the FineWeb methodology (Penedo et al., 2024).

Description
Model synced from source: bgraudt/mythos
Readme 68 KiB