初始化项目,由ModelHub XC社区提供模型

Model: bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-06-20 07:26:14 +08:00
commit e7ebb87fa7
30 changed files with 314 additions and 0 deletions

62
.gitattributes vendored Normal file
View File

@@ -0,0 +1,62 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-f16.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q6_K_L.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q6_K.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q5_K_L.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q5_K_S.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q4_K_L.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q4_K_S.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q4_0_8_8.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q4_0_4_8.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q4_0_4_4.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-IQ4_NL.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-IQ4_XS.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q3_K_XL.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q3_K_L.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q3_K_M.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-IQ3_M.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q3_K_S.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-IQ3_XS.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q2_K_L.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-Q2_K.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-IQ2_M.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct-f32.gguf filter=lfs diff=lfs merge=lfs -text
FuseChat-Llama-3.2-3B-Instruct.imatrix filter=lfs diff=lfs merge=lfs -text

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:96116d3c591482335805984c1c007bd4339efc90a9e284454cd94a27f363b183
size 1229028320

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c3c74e6ab189386a4782250b32bca146f8dade499e75d30ab3ec058ffd2d7246
size 1599665120

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c2def315220cc0c24c4c9b7b804dbb15271253db3a9b46756054331f3f3643e8
size 1476785120

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3dba070b4aca253c0ccb9210ad7d78e4176d7856351f5ef3d2b9d89c93c569d1
size 1917187040

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b932b06e8c53828bce0c61fd29ee8324c8771e4cfec26452362f65fd68ce3b96
size 1829106656

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:db92fb14472db19cda4d34899c1e1c19aeedd80f6d39211787b570b575201885
size 1363932128

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:8818f581e62d68d93e598197aaeca7c3f7fec5001824834489a2984513e077c9
size 1459354592

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c4c3ec4a744c96525d4b138630efbda3669e17d3f29a56d6dc8560655332a97a
size 1815344096

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:cd3d5f2399c4a185c89cf4181519c8d91fec8a314d9301cc00b4b95a8f6036fd
size 1687155680

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:73050a5667e3796679c8c503a7f7dd1eaf60c530b8a19d7254f3ba546707552e
size 1542845408

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:85f3f6da69070dfeb2f155aac143dd3f013af2e8d884254414db37359451ee03
size 1910766560

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:df719178d3d09a00ff5c3cb11e48a3bbe17d03239dd056e57d9508a6e33259ca
size 1921905632

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f3af9ee1699b7f99609bfbc4e62288c8ab2adeabbbf9d264e85bcad9eb3a6d4e
size 1917187040

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:fa2b4a68eab81b41d9a88ad5cb5cb55cb8557c236ee24b2adc45b63d32f8fc30
size 1917187040

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b099c6e5f7216913a0b2b7ef1f7f6bf1643e41df3015fe60a554706e2a161ef4
size 1917187040

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f665a50e3fa6a5b0f78785daa4800e492c0ae8a9a407f6030d99be154ad48e05
size 2114796512

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:a4f0e9a905b74886b79b72622c06a3219d6812818a564a53c39fc49032d7f842
size 2019374048

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:df9bac3036ae38119a90116061f87280d76b4a136eef2fe2ef1f76d0556960d3
size 1928197088

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:70e22622ea554c3fa72dc367032628281c8ff4397c9debcd37e315449f9d3d04
size 2417572832

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:1f19b921ff1b77be6eee5de2ce86672c3281ffc1185b8da15ea49c86e3e76711
size 2322150368

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:33cdcf063e59dc8ba9ace37080305aaaef319e3d2327122ff402effebdf76c8e
size 2269508576

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f124ee5e60feb6fe62dacf72ab459fe8a843582f50d132807a8135dcd4dd1e2d
size 2643850208

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:83851272f19f1c332701f95e13bc05f19e8924b70046bffed5db7f83647da75c
size 2739272672

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:a52a1301520d5165f02e8aed26fb2cd144f19b85c8a5dfb3fbaf7b6c4c607fa7
size 3421895648

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:229a15d936ba74647192a29595a7d9e50b36404ee697e6166e1169da86b70e8b
size 6433684448

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:18c5ee4ad75f6d3964e71d37f26833017c693ec221ab98704f3585fab566d160
size 12858833600

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2a5a26087199020ba684565c73111dbbbce9f20e5fc02d53711e0292a46e9bd4
size 2988390

170
README.md Normal file
View File

@@ -0,0 +1,170 @@
---
quantized_by: bartowski
pipeline_tag: text-generation
base_model: FuseAI/FuseChat-Llama-3.2-3B-Instruct
---
## Llamacpp imatrix Quantizations of FuseChat-Llama-3.2-3B-Instruct
Using <a href="https://github.com/ggerganov/llama.cpp/">llama.cpp</a> release <a href="https://github.com/ggerganov/llama.cpp/releases/tag/b4273">b4273</a> for quantization.
Original model: https://huggingface.co/FuseAI/FuseChat-Llama-3.2-3B-Instruct
All quants made using imatrix option with dataset from [here](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8)
Run them in [LM Studio](https://lmstudio.ai/)
## Prompt format
```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```
## Download a file (not the whole branch) from below:
| Filename | Quant type | File Size | Split | Description |
| -------- | ---------- | --------- | ----- | ----------- |
| [FuseChat-Llama-3.2-3B-Instruct-f32.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-f32.gguf) | f32 | 12.86GB | false | Full F32 weights. |
| [FuseChat-Llama-3.2-3B-Instruct-f16.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-f16.gguf) | f16 | 6.43GB | false | Full F16 weights. |
| [FuseChat-Llama-3.2-3B-Instruct-Q8_0.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q8_0.gguf) | Q8_0 | 3.42GB | false | Extremely high quality, generally unneeded but max available quant. |
| [FuseChat-Llama-3.2-3B-Instruct-Q6_K_L.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q6_K_L.gguf) | Q6_K_L | 2.74GB | false | Uses Q8_0 for embed and output weights. Very high quality, near perfect, *recommended*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q6_K.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q6_K.gguf) | Q6_K | 2.64GB | false | Very high quality, near perfect, *recommended*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q5_K_L.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q5_K_L.gguf) | Q5_K_L | 2.42GB | false | Uses Q8_0 for embed and output weights. High quality, *recommended*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q5_K_M.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q5_K_M.gguf) | Q5_K_M | 2.32GB | false | High quality, *recommended*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q5_K_S.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q5_K_S.gguf) | Q5_K_S | 2.27GB | false | High quality, *recommended*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q4_K_L.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q4_K_L.gguf) | Q4_K_L | 2.11GB | false | Uses Q8_0 for embed and output weights. Good quality, *recommended*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q4_K_M.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q4_K_M.gguf) | Q4_K_M | 2.02GB | false | Good quality, default size for most use cases, *recommended*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q4_K_S.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q4_K_S.gguf) | Q4_K_S | 1.93GB | false | Slightly lower quality with more space savings, *recommended*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q4_0_8_8.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q4_0_8_8.gguf) | Q4_0_8_8 | 1.92GB | false | Optimized for ARM and AVX inference. Requires 'sve' support for ARM (see details below). *Don't use on Mac*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q4_0_4_8.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q4_0_4_8.gguf) | Q4_0_4_8 | 1.92GB | false | Optimized for ARM inference. Requires 'i8mm' support (see details below). *Don't use on Mac*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q4_0_4_4.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q4_0_4_4.gguf) | Q4_0_4_4 | 1.92GB | false | Optimized for ARM inference. Should work well on all ARM chips, not for use with GPUs. *Don't use on Mac*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q4_0.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q4_0.gguf) | Q4_0 | 1.92GB | false | Legacy format, offers online repacking for ARM CPU inference. |
| [FuseChat-Llama-3.2-3B-Instruct-IQ4_NL.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-IQ4_NL.gguf) | IQ4_NL | 1.92GB | false | Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference. |
| [FuseChat-Llama-3.2-3B-Instruct-Q3_K_XL.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q3_K_XL.gguf) | Q3_K_XL | 1.91GB | false | Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
| [FuseChat-Llama-3.2-3B-Instruct-IQ4_XS.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-IQ4_XS.gguf) | IQ4_XS | 1.83GB | false | Decent quality, smaller than Q4_K_S with similar performance, *recommended*. |
| [FuseChat-Llama-3.2-3B-Instruct-Q3_K_L.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q3_K_L.gguf) | Q3_K_L | 1.82GB | false | Lower quality but usable, good for low RAM availability. |
| [FuseChat-Llama-3.2-3B-Instruct-Q3_K_M.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q3_K_M.gguf) | Q3_K_M | 1.69GB | false | Low quality. |
| [FuseChat-Llama-3.2-3B-Instruct-IQ3_M.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-IQ3_M.gguf) | IQ3_M | 1.60GB | false | Medium-low quality, new method with decent performance comparable to Q3_K_M. |
| [FuseChat-Llama-3.2-3B-Instruct-Q3_K_S.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q3_K_S.gguf) | Q3_K_S | 1.54GB | false | Low quality, not recommended. |
| [FuseChat-Llama-3.2-3B-Instruct-IQ3_XS.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-IQ3_XS.gguf) | IQ3_XS | 1.48GB | false | Lower quality, new method with decent performance, slightly better than Q3_K_S. |
| [FuseChat-Llama-3.2-3B-Instruct-Q2_K_L.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q2_K_L.gguf) | Q2_K_L | 1.46GB | false | Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
| [FuseChat-Llama-3.2-3B-Instruct-Q2_K.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-Q2_K.gguf) | Q2_K | 1.36GB | false | Very low quality but surprisingly usable. |
| [FuseChat-Llama-3.2-3B-Instruct-IQ2_M.gguf](https://huggingface.co/bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF/blob/main/FuseChat-Llama-3.2-3B-Instruct-IQ2_M.gguf) | IQ2_M | 1.23GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. |
## Embed/output weights
Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.
## Downloading using huggingface-cli
<details>
<summary>Click to view download instructions</summary>
First, make sure you have hugginface-cli installed:
```
pip install -U "huggingface_hub[cli]"
```
Then, you can target the specific file you want:
```
huggingface-cli download bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF --include "FuseChat-Llama-3.2-3B-Instruct-Q4_K_M.gguf" --local-dir ./
```
If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:
```
huggingface-cli download bartowski/FuseChat-Llama-3.2-3B-Instruct-GGUF --include "FuseChat-Llama-3.2-3B-Instruct-Q8_0/*" --local-dir ./
```
You can either specify a new local-dir (FuseChat-Llama-3.2-3B-Instruct-Q8_0) or download them all in place (./)
</details>
## Q4_0_X_X information
New: Thanks to efforts made to have online repacking of weights in [this PR](https://github.com/ggerganov/llama.cpp/pull/9921), you can now just use Q4_0 if your llama.cpp has been compiled for your ARM device.
Similarly, if you want to get slightly better performance, you can use IQ4_NL thanks to [this PR](https://github.com/ggerganov/llama.cpp/pull/10541) which will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower but it will result in an overall speed incrase.
<details>
<summary>Click to view Q4_0_X_X information</summary>
These are *NOT* for Metal (Apple) or GPU (nvidia/AMD/intel) offloading, only ARM chips (and certain AVX2/AVX512 CPUs).
If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons [on the original pull request](https://github.com/ggerganov/llama.cpp/pull/5780#pullrequestreview-21657544660)
To check which one would work best for your ARM chip, you can check [AArch64 SoC features](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) (thanks EloyOn!).
If you're using a CPU that supports AVX2 or AVX512 (typically server CPUs and AMD's latest Zen5 CPUs) and are not offloading to a GPU, the Q4_0_8_8 may offer a nice speed as well:
<details>
<summary>Click to view benchmarks on an AVX2 system (EPYC7702)</summary>
| model | size | params | backend | threads | test | t/s | % (vs Q4_0) |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% |
Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
</details>
</details>
## Which file should I choose?
<details>
<summary>Click here for details</summary>
A great write up with charts showing various performances is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
If you want to get more into the weeds, you can check out this extremely useful feature chart:
[llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix)
But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
The I-quants are *not* compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
</details>
## Credits
Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
Thank you ZeroWw for the inspiration to experiment with embed/output.
Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

1
configuration.json Normal file
View File

@@ -0,0 +1 @@
{"framework": "pytorch", "task": "text-generation", "allow_remote": true}