Kernels: Q2_0 is not yet in mainline llama.cpp. Use our fork at PrismML-Eng/llama.cpp (prism branch, default) which adds Q2_0 support for CPU (NEON/generic) and Metal. Upstream PR coming soon.
GGUF Q2_0 g128: {-1, 0, +1} with FP16 group-wise scaling
Packed Q2_0 size
1,020 MiB (1.07 GB)
Ternary coverage
Embeddings, attention projections, MLP projections, LM head
License
Apache 2.0
Quantization Format: GGUF Q2_0 (g128)
Each weight takes a value from {-1, 0, +1}, with one shared FP16 scale per group of 128 weights:
w_i = scale_g * t_i, t_i in {-1, 0, +1}
Q2_0 encodes each weight as a 2-bit code q in {0, 1, 2, 3}, dequantized via w = (q - 1) * scale. One 128-element block is 34 bytes (2 bytes FP16 scale + 32 bytes of packed 2-bit codes) for an effective 2.125 bits/weight. The fourth code point (q = 3, reconstructing to +2 * scale) is reserved for future extensions; for ternary weights it is unused.
Memory
Format
Size
Reduction
Ratio
FP16
8.04 GB
--
1.0x
GGUF Q2_0 g128
1,020 MiB (1.07 GB)
86.3%
7.3x
Files in this repo
File
Format
Size
Recommended
Ternary-Bonsai-4B-F16.gguf
FP16
8.04 GB
baseline / re-quantization source
Ternary-Bonsai-4B-Q2_0.gguf
Q2_0 (g128)
1,020 MB
recommended (lossless for ternary)
Quickstart
Build from the Prism fork
git clone https://github.com/PrismML-Eng/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON # or -DGGML_CUDA=ON, -DGGML_VULKAN=ON
cmake --build build -j
Flags: -ngl 99 -fa 1 for Metal; -ngl 0 -fa 1 -t 10 for CPU.
Fidelity (Q2_0 vs FP16 baseline)
Q2_0 is effectively lossless for ternary weights — the ternary values land exactly on three of the four 2-bit code points, so quantize/dequantize is bit-exact in the absence of FP16 scale rounding.
Benchmarks
Evaluated with EvalScope v1.4.2 + vLLM 0.15.1 on NVIDIA H100. Full benchmark suite:
Model
Size
Avg
MMLU-R
MuSR
IFEval
GSM8K
HE+
BFCLv3
Ternary Bonsai 4B
1.02 GB
70.7
69.7
45.1
72.1
90.5
78.7
67.8
1-bit Bonsai 4B (prior)
0.57 GB
62.7
58.7
41.4
69.6
87.3
71.3
48.0
Qwen 3 4B
8.04 GB
77.1
79.8
57.4
80.0
92.1
74.4
78.9
Ministral3 3B
6.86 GB
73.2
77.5
56.5
73.1
91.4
69.5
71.3
Gemma 3 4B
7.76 GB
67.9
66.0
46.3
73.0
89.8
67.1
65.1
Llama 3.2 3B
6.43 GB
64.4
65.5
48.9
78.3
80.1
52.4
60.9
Intelligence Density
density = -ln(1 - score/100) / size_GB
Model
Size
Intelligence Density (1/GB)
Ternary Bonsai 4B
1.02 GB
1.202
1-bit Bonsai 4B (prior)
0.57 GB
1.744
Ministral3 3B
6.86 GB
0.192
Qwen 3 4B
8.04 GB
0.183
Llama 3.2 3B
6.43 GB
0.161
Gemma 3 4B
7.76 GB
0.146
Citation
@techreport{ternarybonsai,title={Ternary Bonsai: 1.58-bit Language Models at 8B, 4B, and 1.7B Scale},author={Prism ML},year={2026},month={April},url={https://prismml.com}}