jacquelinehe/tinycomma-1.8b-llama3-tokenizer

Files

ModelHub XC 9120218e7d 初始化项目，由ModelHub XC社区提供模型

Model: jacquelinehe/tinycomma-1.8b-llama3-tokenizer
Source: Original Platform

2026-05-14 02:10:37 +08:00

3.9 KiB

Raw Permalink Blame History

datasets, language, license, library_name, pipeline_tag

datasets

language

license

library_name

pipeline_tag

common-pile/comma_v0.1_training_dataset

apache-2.0

transformers

text-generation

TinyComma 1.8B

TinyComma 1.8B is a 1.8B parameter, decoder-only base LM trained entirely on permissively licensed data from the Common Pile. Different from the official Comma model series, TinyComma 1.8B uses the 128K-vocabulary Llama3 tokenizer to ensure compatibility with two-model decoding setups. We trained TinyComma 1.8B to support our research on inference-time copyright mitigation.

Paper: Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model
Repository: jacqueline-he/anchored-decoding
Project Page: Interactive Demo

Benchmarking TinyComma 1.8B

We benchmarked TinyComma 1.8B and several other permissively trained base models on several common natural language understanding tasks from the OLMES evaluation suite.

Benchmarking results using OLMES. TinyComma 1.8B outperforms other models of its size range.

Training details

We trained TinyComma 1.8B using the lingua training framework. Pre-training consists of two stages: (1) a 156B-token generation training stage over the entire Common Pile, following original domain weights specified by Kandpal et al., 2025, and (2) a 13.5B-token cooldown stage on a weighted mixture of three high-quality domains (70% Wikimedia, 15% DOAB, and 15% Data Provenance Initiative data). Our hardware is a single node of 8 140 GiB H200 GPUs. Model configuration and pre-training hyperparameter details are below:

TinyComma 1.8B model configuration.
Params	Head Dim.	Hidden Size	Attn. Heads	Hidden Layers	KV Heads
1,758,562,304	64	2048	32	24	32

TinyComma 1.8B pretraining configuration.
Hyperparameters	Values
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Learning rate	3e⁻³ for Stage 1, 1e⁻³ for Stage 2
Weight decay	0.033 for Stage 1
Batch size	4M tokens
Warmup	1000 steps for Stage 1, none for Stage 2
Schedule	Cosine schedule for Stage 1, linear schedule for Stage 2
Sequence length	Pack to 2048 tokens

Citation

@article{he2026anchored,
  title={{Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model}},
  author={Jacqueline He and Jonathan Hayase and Wen-tau Yih and Sewoong Oh and Luke Zettlemoyer and Pang Wei Koh},
  journal={arXiv preprint},
  year={2026}
}

3.9 KiB Raw Permalink Blame History Unescape Escape

TinyComma 1.8B

Benchmarking TinyComma 1.8B

Training details

Citation

3.9 KiB

Raw Permalink Blame History