Of note, the base checkpoint used was from commit "final model" fad4f1a5cd0563ac41349b8fec2e6e51156568a0 which was subsequently reverted, and not the current main branch 3T checkpoint of TinyLlama-1.1B.
The quantized model fits alongside a 4.25bpw 70B model at 32k sequence length on a single A6000 and provides noticeable speed-up with speculative decoding.
Wikitext (wikitext-2-raw-v1_train) Perplexity (64 rows) as evaluated via exllamav2: