Files

ModelHub XC d5adc6a928 初始化项目，由ModelHub XC社区提供模型

Model: SimpleStories/SimpleStories-V2-5M
Source: Original Platform

2026-04-11 17:43:00 +08:00

2.6 KiB

Raw Blame History

license, datasets, language, tags

license

datasets

language

SimpleStories Model Family

The SimpleStories models are a tiny model family created for interpretability research, trained on the SimpleStories dataset. This is the second iteration of the model family.

Paper: https://arxiv.org/abs/2504.09184
Training code: https://github.com/simple-stories/simple_stories_train
Traning checkpoints: https://wandb.ai/finke/simplestories-v2

Usage

import torch
from transformers import AutoTokenizer, LlamaForCausalLM


MODEL_SIZE = "5M"
model_path = "SimpleStories/SimpleStories-V2-{}".format(MODEL_SIZE)

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path)
model.to("cuda")
model.eval()

prompt = "The curious cat looked at the"

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
input_ids = inputs.input_ids.to("cuda")

eos_token_id = 1

with torch.no_grad():
    output_ids = model.generate(
        input_ids=input_ids,
        max_new_tokens=400,
        temperature=0.7,
        do_sample=True,
        eos_token_id=eos_token_id
)

output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"\nGenerated text:\n{output_text}")

Model Variants

Model Name	n_params	n_layers	d_model	n_heads	n_ctx	d_vocab
SimpleStories-35M	35 million	12	512	8	512	4019
SimpleStories-30M	30 million	10	512	8	512	4019
SimpleStories-11M	11 million	6	384	6	512	4019
SimpleStories-5M	5 million	6	256	4	512	4019
SimpleStories-1.25M	1.25 million	4	128	4	512	4019

Dataset

The SimpleStories dataset is a collection of short stories generated by state-of-the-art language models. It features:

Story annotation with high-level concepts: theme, topic, style, etc.
Higher semantic and syntactic diversity through seeded story generation
Generated by 2024 models
Several NLP-metrics pre-computed to aid filtering
ASCII-only guarantee for the English dataset

Key improvements from previous version

Improved evaluation scores due to the increased training epochs
Pruning and optimization of the tokenizer resulting in vocabulary size from 4096 to 4019
Model training checkpoints are stored periodically in wandb for further research