Files
mythos/README.md
ModelHub XC b6d3ea87b1 初始化项目,由ModelHub XC社区提供模型
Model: bgraudt/mythos
Source: Original Platform
2026-04-24 22:49:10 +08:00

161 lines
5.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- pytorch
- causal-lm
- llama
- from-scratch
- pretraining
- gqa
- swiglu
- rope
- rmsnorm
model-index:
- name: Mythos-172M
results: []
widget:
- text: "The history of artificial intelligence begins with"
example_title: "History"
- text: "A transformer is a neural network that"
example_title: "Architecture"
inference:
parameters:
temperature: 0.8
top_p: 0.9
max_new_tokens: 128
---
<div align="center">
# Mythos-172M
**A decoder-only language model built from scratch — LLaMA-compatible weights.**
[![GitHub](https://img.shields.io/badge/GitHub-borisgraudt/mythos-24292e?logo=github)](https://github.com/borisgraudt/mythos)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/borisgraudt/mythos/blob/main/LICENSE)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.5+-ee4c2c.svg?logo=pytorch)](https://pytorch.org)
[![transformers](https://img.shields.io/badge/🤗%20transformers-compatible-yellow)](https://github.com/huggingface/transformers)
</div>
---
> ⚠️ **Research preview.** Debug checkpoint — trained on ~21 M tokens with vocab 3 252 for 5 000 steps. Intended to verify the architecture, not for downstream use. A production 500 M checkpoint will supersede it.
## Model Summary
Mythos is a LLaMA-style autoregressive transformer implemented **from first principles**
in pure PyTorch — no `transformers` inheritance, no `nn.TransformerBlock`, no shortcuts.
Every component (attention, rotary embeddings, SwiGLU, RMSNorm, the training loop, the
BPE tokenizer, the data pipeline, the KV-cache inference engine) is hand-written in the
reference repository.
This release packages the weights in the **`LlamaForCausalLM`** format so that the model
is natively usable via the standard `transformers`, `vLLM`, `TGI`, and `llama.cpp`
toolchains — no custom code or `trust_remote_code` required.
| | |
|---|---|
| **Developed by** | Boris Graudt |
| **Model type** | Decoder-only causal transformer |
| **Language** | English |
| **License** | MIT |
| **Compatible with** | 🤗 `transformers`, vLLM, TGI, llama.cpp, Ollama |
| **Reference implementation** | [github.com/borisgraudt/mythos](https://github.com/borisgraudt/mythos) |
## Architecture
| Component | Choice | Value |
|---|---|---:|
| Parameters | — | **172 M** |
| Hidden layers | Pre-norm decoder blocks | 24 |
| Hidden size | `d_model` | 768 |
| Intermediate size | SwiGLU hidden | 2048 |
| Attention heads | Multi-head | 12 |
| Key / value heads | **Grouped-Query Attention** | 4 |
| Head dim | `d_model / n_heads` | 64 |
| Positional encoding | **Rotary (RoPE)** | θ = 10,000 |
| Normalization | **RMSNorm** (pre-norm) | ε = 1e-05 |
| Activation | **SwiGLU** | — |
| Tied embeddings | Embedding ↔ LM head | ✅ |
| Vocabulary | ByteLevel BPE | 3,252 |
| Context length | Max sequence | 2,048 |
## Quickstart
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "bgraudt/mythos"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
inputs = tokenizer("The history of artificial intelligence begins with", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.8, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Serving with vLLM
```bash
pip install vllm
python -m vllm.entrypoints.openai.api_server --model bgraudt/mythos
```
### Serving with llama.cpp
```bash
# Convert to GGUF (one-time)
python llama.cpp/convert_hf_to_gguf.py mythos
./llama-cli -m ggml-model-f16.gguf -p "Hello"
```
## Training
### Data
- **Corpus:** Wikipedia (English 20231101 snapshot) — 5 000 articles, ~21 M tokens
- **Tokenizer:** ByteLevel BPE trained from scratch, vocab size **3,252**
- **Training context:** 512 tokens
### Hyperparameters
| | |
|---|---:|
| Steps | 5,000 |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.1) |
| LR schedule | Cosine decay, 2 000-step warmup |
| Peak learning rate | 3 × 10⁻⁴ |
| Precision | bfloat16 mixed |
| Hardware | Apple M2 (MPS) |
## Limitations and Intended Use
- **Base model only** — no instruction tuning, no RLHF, no safety alignment.
- English-only; non-English performance is poor.
- May reproduce biases and factual errors from the training distribution.
- Tiny vocabulary (3 252 tokens) severely caps fluency — intended as an architecture demo.
- Not suitable for medical, legal, financial, or other high-stakes applications.
## Citation
```bibtex
@software{graudt2026mythos,
author = {Graudt, Boris},
title = {Mythos: A Decoder-Only Language Model Built From Scratch},
year = {2026},
url = {https://github.com/borisgraudt/mythos},
license = {MIT}
}
```
## Acknowledgements
Architecture inspired by **LLaMA** (Touvron et al., 2023) and **Mistral 7B**
(Jiang et al., 2023). Data pipeline follows the **FineWeb** methodology
(Penedo et al., 2024).