Files
mythos/README.md
ModelHub XC b6d3ea87b1 初始化项目,由ModelHub XC社区提供模型
Model: bgraudt/mythos
Source: Original Platform
2026-04-24 22:49:10 +08:00

5.0 KiB
Raw Blame History

language, license, library_name, pipeline_tag, tags, model-index, widget, inference
language license library_name pipeline_tag tags model-index widget inference
en
mit transformers text-generation
pytorch
causal-lm
llama
from-scratch
pretraining
gqa
swiglu
rope
rmsnorm
name results
Mythos-172M
text example_title
The history of artificial intelligence begins with History
text example_title
A transformer is a neural network that Architecture
parameters
temperature top_p max_new_tokens
0.8 0.9 128

Mythos-172M

A decoder-only language model built from scratch — LLaMA-compatible weights.

GitHub License PyTorch transformers


⚠️ Research preview. Debug checkpoint — trained on ~21 M tokens with vocab 3 252 for 5 000 steps. Intended to verify the architecture, not for downstream use. A production 500 M checkpoint will supersede it.

Model Summary

Mythos is a LLaMA-style autoregressive transformer implemented from first principles in pure PyTorch — no transformers inheritance, no nn.TransformerBlock, no shortcuts. Every component (attention, rotary embeddings, SwiGLU, RMSNorm, the training loop, the BPE tokenizer, the data pipeline, the KV-cache inference engine) is hand-written in the reference repository.

This release packages the weights in the LlamaForCausalLM format so that the model is natively usable via the standard transformers, vLLM, TGI, and llama.cpp toolchains — no custom code or trust_remote_code required.

Developed by Boris Graudt
Model type Decoder-only causal transformer
Language English
License MIT
Compatible with 🤗 transformers, vLLM, TGI, llama.cpp, Ollama
Reference implementation github.com/borisgraudt/mythos

Architecture

Component Choice Value
Parameters 172 M
Hidden layers Pre-norm decoder blocks 24
Hidden size d_model 768
Intermediate size SwiGLU hidden 2048
Attention heads Multi-head 12
Key / value heads Grouped-Query Attention 4
Head dim d_model / n_heads 64
Positional encoding Rotary (RoPE) θ = 10,000
Normalization RMSNorm (pre-norm) ε = 1e-05
Activation SwiGLU
Tied embeddings Embedding ↔ LM head
Vocabulary ByteLevel BPE 3,252
Context length Max sequence 2,048

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bgraudt/mythos"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

inputs = tokenizer("The history of artificial intelligence begins with", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.8, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Serving with vLLM

pip install vllm
python -m vllm.entrypoints.openai.api_server --model bgraudt/mythos

Serving with llama.cpp

# Convert to GGUF (one-time)
python llama.cpp/convert_hf_to_gguf.py mythos
./llama-cli -m ggml-model-f16.gguf -p "Hello"

Training

Data

  • Corpus: Wikipedia (English 20231101 snapshot) — 5 000 articles, ~21 M tokens
  • Tokenizer: ByteLevel BPE trained from scratch, vocab size 3,252
  • Training context: 512 tokens

Hyperparameters

Steps 5,000
Optimizer AdamW (β₁=0.9, β₂=0.95, wd=0.1)
LR schedule Cosine decay, 2 000-step warmup
Peak learning rate 3 × 10⁻⁴
Precision bfloat16 mixed
Hardware Apple M2 (MPS)

Limitations and Intended Use

  • Base model only — no instruction tuning, no RLHF, no safety alignment.
  • English-only; non-English performance is poor.
  • May reproduce biases and factual errors from the training distribution.
  • Tiny vocabulary (3 252 tokens) severely caps fluency — intended as an architecture demo.
  • Not suitable for medical, legal, financial, or other high-stakes applications.

Citation

@software{graudt2026mythos,
  author  = {Graudt, Boris},
  title   = {Mythos: A Decoder-Only Language Model Built From Scratch},
  year    = {2026},
  url     = {https://github.com/borisgraudt/mythos},
  license = {MIT}
}

Acknowledgements

Architecture inspired by LLaMA (Touvron et al., 2023) and Mistral 7B (Jiang et al., 2023). Data pipeline follows the FineWeb methodology (Penedo et al., 2024).