Files

ModelHub XC b6d3ea87b1 初始化项目，由ModelHub XC社区提供模型

Model: bgraudt/mythos
Source: Original Platform

2026-04-24 22:49:10 +08:00

5.0 KiB

Raw Blame History

language, license, library_name, pipeline_tag, tags, model-index, widget, inference

language

license

library_name

pipeline_tag

Mythos-172M

A decoder-only language model built from scratch — LLaMA-compatible weights.

⚠️ Research preview. Debug checkpoint — trained on ~21 M tokens with vocab 3 252 for 5 000 steps. Intended to verify the architecture, not for downstream use. A production 500 M checkpoint will supersede it.

Model Summary

Mythos is a LLaMA-style autoregressive transformer implemented from first principles in pure PyTorch — no transformers inheritance, no nn.TransformerBlock, no shortcuts. Every component (attention, rotary embeddings, SwiGLU, RMSNorm, the training loop, the BPE tokenizer, the data pipeline, the KV-cache inference engine) is hand-written in the reference repository.

This release packages the weights in the LlamaForCausalLM format so that the model is natively usable via the standard transformers, vLLM, TGI, and llama.cpp toolchains — no custom code or trust_remote_code required.


Developed by	Boris Graudt
Model type	Decoder-only causal transformer
Language	English
License	MIT
Compatible with	🤗 `transformers`, vLLM, TGI, llama.cpp, Ollama
Reference implementation	github.com/borisgraudt/mythos

Architecture

Component	Choice	Value
Parameters	—	172 M
Hidden layers	Pre-norm decoder blocks	24
Hidden size	`d_model`	768
Intermediate size	SwiGLU hidden	2048
Attention heads	Multi-head	12
Key / value heads	Grouped-Query Attention	4
Head dim	`d_model / n_heads`	64
Positional encoding	Rotary (RoPE)	θ = 10,000
Normalization	RMSNorm (pre-norm)	ε = 1e-05
Activation	SwiGLU	—
Tied embeddings	Embedding ↔ LM head	✅
Vocabulary	ByteLevel BPE	3,252
Context length	Max sequence	2,048

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bgraudt/mythos"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

inputs = tokenizer("The history of artificial intelligence begins with", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.8, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Serving with vLLM

pip install vllm
python -m vllm.entrypoints.openai.api_server --model bgraudt/mythos

Serving with llama.cpp

# Convert to GGUF (one-time)
python llama.cpp/convert_hf_to_gguf.py mythos
./llama-cli -m ggml-model-f16.gguf -p "Hello"

Training

Data

Corpus: Wikipedia (English 20231101 snapshot) — 5 000 articles, ~21 M tokens
Tokenizer: ByteLevel BPE trained from scratch, vocab size 3,252
Training context: 512 tokens

Hyperparameters


Steps	5,000
Optimizer	AdamW (β₁=0.9, β₂=0.95, wd=0.1)
LR schedule	Cosine decay, 2 000-step warmup
Peak learning rate	3 × 10⁻⁴
Precision	bfloat16 mixed
Hardware	Apple M2 (MPS)

Limitations and Intended Use

Base model only — no instruction tuning, no RLHF, no safety alignment.
English-only; non-English performance is poor.
May reproduce biases and factual errors from the training distribution.
Tiny vocabulary (3 252 tokens) severely caps fluency — intended as an architecture demo.
Not suitable for medical, legal, financial, or other high-stakes applications.

Citation

@software{graudt2026mythos,
  author  = {Graudt, Boris},
  title   = {Mythos: A Decoder-Only Language Model Built From Scratch},
  year    = {2026},
  url     = {https://github.com/borisgraudt/mythos},
  license = {MIT}
}

Acknowledgements

Architecture inspired by LLaMA (Touvron et al., 2023) and Mistral 7B (Jiang et al., 2023). Data pipeline follows the FineWeb methodology (Penedo et al., 2024).

5.0 KiB Raw Blame History Unescape Escape