license, datasets, language, tags
license
datasets
language
tags
mit
lennart-finke/SimpleStories
small-language-model
story-generation
text-generation
efficient-nlp
distilled-models
SimpleStories Model Family
The SimpleStories models are a tiny model family created for interpretability research, trained on the SimpleStories dataset . This is the second iteration of the model family.
Paper: https://arxiv.org/abs/2504.09184
Training code: https://github.com/simple-stories/simple_stories_train
Traning checkpoints: https://wandb.ai/finke/simplestories-v2
Usage
Model Variants
Model Name
n_params
n_layers
d_model
n_heads
n_ctx
d_vocab
SimpleStories-35M
35 million
12
512
8
512
4019
SimpleStories-30M
30 million
10
512
8
512
4019
SimpleStories-11M
11 million
6
384
6
512
4019
SimpleStories-5M
5 million
6
256
4
512
4019
SimpleStories-1.25M
1.25 million
4
128
4
512
4019
Dataset
The SimpleStories dataset is a collection of short stories generated by state-of-the-art language models. It features:
Story annotation with high-level concepts: theme, topic, style, etc.
Higher semantic and syntactic diversity through seeded story generation
Generated by 2024 models
Several NLP-metrics pre-computed to aid filtering
ASCII-only guarantee for the English dataset
Key improvements from previous version
Improved evaluation scores due to the increased training epochs
Pruning and optimization of the tokenizer resulting in vocabulary size from 4096 to 4019
Model training checkpoints are stored periodically in wandb for further research