初始化项目,由ModelHub XC社区提供模型

Model: bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-06 21:15:53 +08:00
commit 3e988cb286
27 changed files with 312 additions and 0 deletions

60
.gitattributes vendored Normal file
View File

@@ -0,0 +1,60 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q6_K_L.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q6_K.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q5_K_L.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q5_K_S.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_K_L.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_K_S.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_1.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ4_NL.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ4_XS.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_XL.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_L.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_M.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ3_M.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_S.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ3_XS.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ3_XXS.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q2_K_L.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q2_K.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ2_M.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-bf16.gguf filter=lfs diff=lfs merge=lfs -text
Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404.imatrix filter=lfs diff=lfs merge=lfs -text

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:ab842ab43ee3acdf191f9d4054aa47bf30b13f5855f1fdda1039d8406b9494e9
size 1229032256

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:180f874709ddf7baa282cf60e0a8e881c977fa7b0a469e8617e51f17ae2d2ad6
size 1599669056

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:8f1165ad0b2b8dc22888d760251085800201dbd64a867c9fbf012b3c34caf9a0
size 1476789056

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b7baf84cefe4f823bd44e00d3c5b9b1576917841b05f33b581df16125c958ac2
size 1348766528

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d0148f1699819990e2ee76f465056ed0268590d41389eceb22cc317c9b3950c3
size 1917190976

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:5c817b5124d2b79ea2d37f1c19c3ec2a4e20170101788b3a21308a8615482658
size 1829110592

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:4b5bf0d8c23a37437eabf1f19d7be5df3c224657aca5abb3d887d29a8fa637c1
size 1363936064

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b3cb763ed3054abc9f0d5aa9dfe738480bc587a085483c391a76e9d8b53cfbe7
size 1459358528

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:68d6a33e1cd3de1a1e5a34c70414854c94f41302c6041b3f212e7c334457af4a
size 1815348032

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0de618cd43522aadb4e870951800f01d5c28d4f2c6a811e53440c36a3fadf2c0
size 1687159616

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:5cdb6bb22273b568b98ecd917ade95baa112e3fe8956301a50f84510d53f8a3b
size 1542849344

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:8375e7f9398ce3e55f2fb8d9fc84274ac94b9f9f2ec2b8f7d1e68be2a10d52c2
size 1910770496

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:9d8958257d7d4c4349c9745207416833716c68efdd2b36d79477955be88a7dcd
size 1921909568

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:bd915c288951bdbcc9feaac2f6fea582fb3c094fa9907e89b8b6602bb538f2a8
size 2093351744

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:fef7e772b63d52a75815375fdf863fd6ab5387e705c9a7b50d30c790a2b2027c
size 2114800448

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b9f01bead9e163db9351af036d8d63ef479d7d48a1bb44934ead732a180f371c
size 2019377984

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:1fd54a1acd9d62ac57f29e8c71eb4671d2163544f5d6e26d655e218707fbb2cf
size 1928201024

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c5dab25a50226b3a162fc3d510464673b3b31a0f97c3efb54a51776146ca3e28
size 2417576768

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:49bf548cbe88acf8fcf2fa91200fbd19a3dfe66c75a7d9b51d86f6b06b94ed3a
size 2322154304

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d8e3b1122ec98f41d63b52ab0d8b3f0991baedb7bb92ee8e9c79dec03f22f6d7
size 2269512512

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:cc77440b0db36c5b420a6b5c8ae0a655f4751ac60e045937c20014c49349bed2
size 2643854144

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d8296373bcb44d4246408061bafd970a1420bbea35d8f310d253253647cb4f96
size 2739276608

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:df2a8ceef09ea94a851a3c52e51e650dec50d32245334e1806276efbbef437ef
size 3421899584

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:fa587c5f1be5b76e97832c95ff3d692ec361e4fa8c5f83aa902c65e02ecb6add
size 6433688064

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:04e7a0c8c6cea663c011519ac69b39450d75da949d92c8e3e2e691d675017f79
size 2988390

177
README.md Normal file
View File

@@ -0,0 +1,177 @@
---
quantized_by: bartowski
pipeline_tag: text-generation
metrics:
- accuracy
language:
- en
base_model: Menlo/ReZero-v0.1-llama-3.2-3b-it-grpo-250404
base_model_relation: quantized
---
## Llamacpp imatrix Quantizations of ReZero-v0.1-llama-3.2-3b-it-grpo-250404 by Menlo
Using <a href="https://github.com/ggerganov/llama.cpp/">llama.cpp</a> release <a href="https://github.com/ggerganov/llama.cpp/releases/tag/b5132">b5132</a> for quantization.
Original model: https://huggingface.co/Menlo/ReZero-v0.1-llama-3.2-3b-it-grpo-250404
All quants made using imatrix option with dataset from [here](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8)
Run them in [LM Studio](https://lmstudio.ai/)
Run them directly with [llama.cpp](https://github.com/ggerganov/llama.cpp), or any other llama.cpp based project
## Prompt format
```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 16 Apr 2025
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```
## Download a file (not the whole branch) from below:
| Filename | Quant type | File Size | Split | Description |
| -------- | ---------- | --------- | ----- | ----------- |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-bf16.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-bf16.gguf) | bf16 | 6.43GB | false | Full BF16 weights. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q8_0.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q8_0.gguf) | Q8_0 | 3.42GB | false | Extremely high quality, generally unneeded but max available quant. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q6_K_L.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q6_K_L.gguf) | Q6_K_L | 2.74GB | false | Uses Q8_0 for embed and output weights. Very high quality, near perfect, *recommended*. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q6_K.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q6_K.gguf) | Q6_K | 2.64GB | false | Very high quality, near perfect, *recommended*. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q5_K_L.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q5_K_L.gguf) | Q5_K_L | 2.42GB | false | Uses Q8_0 for embed and output weights. High quality, *recommended*. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q5_K_M.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q5_K_M.gguf) | Q5_K_M | 2.32GB | false | High quality, *recommended*. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q5_K_S.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q5_K_S.gguf) | Q5_K_S | 2.27GB | false | High quality, *recommended*. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_K_L.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_K_L.gguf) | Q4_K_L | 2.11GB | false | Uses Q8_0 for embed and output weights. Good quality, *recommended*. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_1.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_1.gguf) | Q4_1 | 2.09GB | false | Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_K_M.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_K_M.gguf) | Q4_K_M | 2.02GB | false | Good quality, default size for most use cases, *recommended*. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_K_S.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_K_S.gguf) | Q4_K_S | 1.93GB | false | Slightly lower quality with more space savings, *recommended*. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_0.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_0.gguf) | Q4_0 | 1.92GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ4_NL.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ4_NL.gguf) | IQ4_NL | 1.92GB | false | Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_XL.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_XL.gguf) | Q3_K_XL | 1.91GB | false | Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ4_XS.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ4_XS.gguf) | IQ4_XS | 1.83GB | false | Decent quality, smaller than Q4_K_S with similar performance, *recommended*. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_L.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_L.gguf) | Q3_K_L | 1.82GB | false | Lower quality but usable, good for low RAM availability. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_M.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_M.gguf) | Q3_K_M | 1.69GB | false | Low quality. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ3_M.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ3_M.gguf) | IQ3_M | 1.60GB | false | Medium-low quality, new method with decent performance comparable to Q3_K_M. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_S.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q3_K_S.gguf) | Q3_K_S | 1.54GB | false | Low quality, not recommended. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ3_XS.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ3_XS.gguf) | IQ3_XS | 1.48GB | false | Lower quality, new method with decent performance, slightly better than Q3_K_S. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q2_K_L.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q2_K_L.gguf) | Q2_K_L | 1.46GB | false | Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q2_K.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q2_K.gguf) | Q2_K | 1.36GB | false | Very low quality but surprisingly usable. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ3_XXS.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ3_XXS.gguf) | IQ3_XXS | 1.35GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. |
| [ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ2_M.gguf](https://huggingface.co/bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF/blob/main/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-IQ2_M.gguf) | IQ2_M | 1.23GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. |
## Embed/output weights
Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.
## Downloading using huggingface-cli
<details>
<summary>Click to view download instructions</summary>
First, make sure you have hugginface-cli installed:
```
pip install -U "huggingface_hub[cli]"
```
Then, you can target the specific file you want:
```
huggingface-cli download bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF --include "Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q4_K_M.gguf" --local-dir ./
```
If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:
```
huggingface-cli download bartowski/Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-GGUF --include "Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q8_0/*" --local-dir ./
```
You can either specify a new local-dir (Menlo_ReZero-v0.1-llama-3.2-3b-it-grpo-250404-Q8_0) or download them all in place (./)
</details>
## ARM/AVX information
Previously, you would download Q4_0_4_4/4_8/8_8, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass.
Now, however, there is something called "online repacking" for weights. details in [this PR](https://github.com/ggerganov/llama.cpp/pull/9921). If you use Q4_0 and your hardware would benefit from repacking weights, it will do it automatically on the fly.
As of llama.cpp build [b4282](https://github.com/ggerganov/llama.cpp/releases/tag/b4282) you will not be able to run the Q4_0_X_X files and will instead need to use Q4_0.
Additionally, if you want to get slightly better quality for , you can use IQ4_NL thanks to [this PR](https://github.com/ggerganov/llama.cpp/pull/10541) which will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower but it will result in an overall speed incrase.
<details>
<summary>Click to view Q4_0_X_X information (deprecated</summary>
I'm keeping this section to show the potential theoretical uplift in performance from using the Q4_0 with online repacking.
<details>
<summary>Click to view benchmarks on an AVX2 system (EPYC7702)</summary>
| model | size | params | backend | threads | test | t/s | % (vs Q4_0) |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% |
Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
</details>
</details>
## Which file should I choose?
<details>
<summary>Click here for details</summary>
A great write up with charts showing various performances is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
If you want to get more into the weeds, you can check out this extremely useful feature chart:
[llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix)
But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
</details>
## Credits
Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
Thank you ZeroWw for the inspiration to experiment with embed/output.
Thank you to LM Studio for sponsoring my work.
Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski