Kernels: Q2_0 is not yet in mainline llama.cpp. Use our fork at PrismML-Eng/llama.cpp (prism branch, default) which adds Q2_0 support for CPU (NEON/generic) and Metal. Upstream PR coming soon.
Model Overview
Item
Specification
Base model
Qwen3-1.7B
Parameters
1.72B
Architecture
GQA, SwiGLU MLP, RoPE, RMSNorm
Context length
32,768 tokens
Vocab size
151,936
Weight format
GGUF Q2_0 g128: {-1, 0, +1} with FP16 group-wise scaling
Packed Q2_0 size
436 MiB (0.46 GB)
Ternary coverage
Embeddings, attention projections, MLP projections, LM head
License
Apache 2.0
Quantization Format: GGUF Q2_0 (g128)
Each weight takes a value from {-1, 0, +1}, with one shared FP16 scale per group of 128 weights:
w_i = scale_g * t_i, t_i in {-1, 0, +1}
Q2_0 encodes each weight as a 2-bit code q in {0, 1, 2, 3}, dequantized via w = (q - 1) * scale. One 128-element block is 34 bytes (2 bytes FP16 scale + 32 bytes of packed 2-bit codes) for an effective 2.125 bits/weight. The fourth code point (q = 3, reconstructing to +2 * scale) is reserved for future extensions; for ternary weights it is unused.
Memory
Format
Size
Reduction
Ratio
FP16
3.44 GB
--
1.0x
GGUF Q2_0 g128
436 MiB (0.46 GB)
86.6%
7.5x
Files in this repo
File
Format
Size
Recommended
Ternary-Bonsai-1.7B-F16.gguf
FP16
3.44 GB
baseline / re-quantization source
Ternary-Bonsai-1.7B-Q2_0.gguf
Q2_0 (g128)
442 MB
recommended (lossless for ternary)
Quickstart
Build from the Prism fork
git clone https://github.com/PrismML-Eng/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON # or -DGGML_CUDA=ON, -DGGML_VULKAN=ON
cmake --build build -j
Flags: -ngl 99 -fa 1 for Metal; -ngl 0 -fa 1 -t 10 for CPU.
Fidelity (Q2_0 vs FP16 baseline)
Measured on wikitext-2 (20 chunks, context 512) via llama-perplexity --kl-divergence:
Metric
Value
Mean KL
0.000000
Top-1 agreement
100.000 %
RMS Δp
0.015 %
PPL ratio (Q/base)
1.0048
Q2_0 is effectively lossless for ternary weights — the ternary values land exactly on three of the four 2-bit code points.
Benchmarks
Evaluated with EvalScope v1.4.2 + vLLM 0.15.1 on NVIDIA H100. Full benchmark suite:
Model
Size
Avg
MMLU-R
MuSR
IFEval
GSM8K
HE+
BFCLv3
Ternary Bonsai 1.7B
0.44 GB
58.47
52.9
50.8
70.1
74.2
51.8
51.0
1-bit Bonsai 1.7B (prior)
0.24 GB
49.60
43.2
45.1
63.0
66.3
45.1
34.9
Qwen3 1.7B
3.44 GB
66.57
66.8
50.1
70.3
83.1
57.3
71.8
Qwen3 0.6B
1.19 GB
48.02
47.5
41.5
62.8
64.1
30.5
41.7
LFM2 1.2B
2.34 GB
46.73
52.9
25.4
77.5
62.2
36.0
26.4
Gemma3 1B
2.00 GB
45.53
43.2
37.0
61.9
64.4
40.2
26.5
Llama 3.2 1B
2.47 GB
39.88
47.2
29.2
47.7
49.0
35.4
30.8
Intelligence Density
density = -ln(1 - score/100) / size_GB
Model
Size
Intelligence Density (1/GB)
Ternary Bonsai 1.7B
0.44 GB
2.001
1-bit Bonsai 1.7B (prior)
0.24 GB
2.832
Qwen3 0.6B
1.19 GB
0.549
Qwen3 1.7B
3.44 GB
0.318
Gemma3 1B
2.00 GB
0.304
LFM2 1.2B
2.34 GB
0.269
Llama 3.2 1B
2.47 GB
0.206
Citation
@techreport{ternarybonsai,title={Ternary Bonsai: 1.58-bit Language Models at 8B, 4B, and 1.7B Scale},author={Prism ML},year={2026},month={April},url={https://prismml.com}}