--- datasets: - common-pile/comma_v0.1_training_dataset language: - en license: apache-2.0 library_name: transformers pipeline_tag: text-generation --- # TinyComma 1.8B TinyComma 1.8B is a 1.8B parameter, decoder-only base LM trained entirely on permissively licensed data from the [Common Pile](https://huggingface.co/collections/common-pile/common-pile-v01). Different from the official Comma model series, TinyComma 1.8B uses the 128K-vocabulary [Llama3](https://huggingface.co/collections/meta-llama/llama-31) tokenizer to ensure compatibility with two-model decoding setups. We trained TinyComma 1.8B to support our research on inference-time copyright mitigation. - **Paper:** [Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model](https://arxiv.org/abs/2602.07120) - **Repository:** [jacqueline-he/anchored-decoding](https://github.com/jacqueline-he/anchored-decoding) - **Project Page:** [Interactive Demo](https://tinyurl.com/anchored-decoding-demo) ## Benchmarking TinyComma 1.8B We benchmarked TinyComma 1.8B and several other permissively trained base models on several common natural language understanding tasks from the [OLMES](https://github.com/allenai/olmes) evaluation suite.


Benchmarking results using OLMES. TinyComma 1.8B outperforms other models of its size range.

## Training details We trained TinyComma 1.8B using the [lingua](https://github.com/facebookresearch/lingua/) training framework. Pre-training consists of two stages: (1) a 156B-token generation training stage over the entire Common Pile, following original domain weights specified by [Kandpal et al., 2025](https://arxiv.org/pdf/2506.05209#page=49.20), and (2) a 13.5B-token cooldown stage on a weighted mixture of three high-quality domains (70% Wikimedia, 15% DOAB, and 15% Data Provenance Initiative data). Our hardware is a single node of 8 140 GiB H200 GPUs. Model configuration and pre-training hyperparameter details are below:
TinyComma 1.8B model configuration.
Params Head Dim. Hidden Size Attn. Heads Hidden Layers KV Heads
1,758,562,304 64 2048 32 24 32


TinyComma 1.8B pretraining configuration.
Hyperparameters Values
Optimizer AdamW (β1=0.9, β2=0.95)
Learning rate 3e−3 for Stage 1, 1e−3 for Stage 2
Weight decay 0.033 for Stage 1
Batch size 4M tokens
Warmup 1000 steps for Stage 1, none for Stage 2
Schedule Cosine schedule for Stage 1, linear schedule for Stage 2
Sequence length Pack to 2048 tokens
## Citation ```bibtex @article{he2026anchored, title={{Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model}}, author={Jacqueline He and Jonathan Hayase and Wen-tau Yih and Sewoong Oh and Luke Zettlemoyer and Pang Wei Koh}, journal={arXiv preprint}, year={2026} } ```