初始化项目,由ModelHub XC社区提供模型

Model: bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-11 21:44:22 +08:00
commit 566aa93cce
29 changed files with 313 additions and 0 deletions

62
.gitattributes vendored Normal file
View File

@@ -0,0 +1,62 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q6_K_L.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q6_K.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q5_K_L.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q5_K_S.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q4_K_L.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q4_K_S.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q4_1.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-IQ4_NL.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-IQ4_XS.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q3_K_XL.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q3_K_L.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q3_K_M.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-IQ3_M.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q3_K_S.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-IQ3_XS.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-IQ3_XXS.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q2_K_L.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-Q2_K.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-IQ2_M.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-bf16.gguf filter=lfs diff=lfs merge=lfs -text
Qwen_Qwen3-VL-4B-Instruct-imatrix.gguf filter=lfs diff=lfs merge=lfs -text
mmproj-Qwen_Qwen3-VL-4B-Instruct-f16.gguf filter=lfs diff=lfs merge=lfs -text
mmproj-Qwen_Qwen3-VL-4B-Instruct-bf16.gguf filter=lfs diff=lfs merge=lfs -text

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:34f34a8f3915eba1c73135b6410e5800f6b34151b0d07dd630ba482515b0463e
size 1512984992

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:ed9bac8de221d35061e97289ce5e02d72caa3e4547e91c8a3343cc542d1d1a5e
size 1962897312

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:47cb1b2a95488562a163945688e502506f0033e64c18fa570f6fc193c8a83428
size 1814376352

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:7f0a2e5c819ce82e026cd51c6b61ae6e90b67edd42c9c6c5ab166fae03b5123f
size 1670189472

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c78d6fd7f94fd5fd20d810e86f4d31ff61b9d33dc48853a08c68ab4136a35e4a
size 2381344672

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:dd6a6672d12121c4a5a8962dde2d8e253f21770974da7cf694cbc9b9f80e0996
size 2270752672

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3136b45fef1d64c359fd4934eea5db1bd9113919e872f2f1e556a6d6291fe357
size 1669500832

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e7b54acdf7c0765112f459ecd88641eb916954be3f00352a20fbeaf338416899
size 1763701152

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c63c54c65145523d35bfb9de677a858f43210bf4078137fcbe60c4c6843542dd
size 2239786912

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:ed78f134bf3bd0af41c28e8c5d0d92b986653939efdeba3b7ef1435b0d05d702
size 2075619232

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:941f4349ad3e8287b80d0c8c2efce50645498389a5f04aa6229f6a4737a8a63c
size 1886998432

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f2d13ad473a587c607fcba797b447628abd89a331857a770e70b7317b9c3ff20
size 2333987232

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e3cd49fa176a75aa91c01dc1922aec7cbb69d83e6e7b0c954aca3758ad32d094
size 2375774112

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:a81c365cb27e7456a036029bf766d1786ad956d769d35dc9b6ac7f3e0647c601
size 2596630432

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c8c5bb6f9820af7f66484473b5346ec6535459a4a27b6beaee3332e24e2a02ac
size 2591482272

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:ceb08f2fd0bae71733c552d42bd79e08a6335328e295a486b50640a8d573ab51
size 2497281952

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:43dfb82b7d14a8894de6fe94cc6a785471270b3d3fe9fcb1b7e3e639ac00838a
size 2383310752

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3b0a2fa4a112e822c447ecfdee51bab84107776f2a159147449c21e7880c257b
size 2983715232

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:026ad07482ca4a1121349fbb7b77abf9b81c0cbc156e3cf52ee7bb8d0229cde4
size 2889514912

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b2b2dc55bfa346486e5b75ddedd5c2573d241b2e29175d4470692f1e98813944
size 2823712672

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:27e3089402499416c730b5e4bc1a6763d0cbd37b4ad33748778c6e1bec0811dd
size 3306262432

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:bc49bd3a2e74d3e645f50b924fd63d04b57bf405c8bd82190622b1ce6777bb5a
size 3400462752

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e825bf42274cbcdea1d817a3ba1a3066fe9b779a3e9bfad46a396ed9a72e4611
size 4280406432

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b2ca346f351b2086ba88d2a075610aed5fc33dd1d976fd9cd975cd042b9584ad
size 8051286144

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d51dd4d8c7183c30c334a8493f848fe9f8da39d6c1b282651fdcb0624cec7877
size 3872640

170
README.md Normal file
View File

@@ -0,0 +1,170 @@
---
quantized_by: bartowski
pipeline_tag: image-text-to-text
base_model_relation: quantized
base_model: Qwen/Qwen3-VL-4B-Instruct
---
## Llamacpp imatrix Quantizations of Qwen3-VL-4B-Instruct by Qwen
Using <a href="https://github.com/ggml-org/llama.cpp/">llama.cpp</a> release <a href="https://github.com/ggml-org/llama.cpp/releases/tag/b6888">b6888</a> for quantization.
Original model: https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct
All quants made using imatrix option with dataset from [here](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8) combined with a subset of combined_all_small.parquet from Ed Addario [here](https://huggingface.co/datasets/eaddario/imatrix-calibration/blob/main/combined_all_small.parquet)
Run them in [LM Studio](https://lmstudio.ai/)
Run them directly with [llama.cpp](https://github.com/ggml-org/llama.cpp), or any other llama.cpp based project
## Prompt format
```
<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
```
## Download a file (not the whole branch) from below:
| Filename | Quant type | File Size | Split | Description |
| -------- | ---------- | --------- | ----- | ----------- |
| [Qwen3-VL-4B-Instruct-bf16.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-bf16.gguf) | bf16 | 8.05GB | false | Full BF16 weights. |
| [Qwen3-VL-4B-Instruct-Q8_0.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q8_0.gguf) | Q8_0 | 4.28GB | false | Extremely high quality, generally unneeded but max available quant. |
| [Qwen3-VL-4B-Instruct-Q6_K_L.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q6_K_L.gguf) | Q6_K_L | 3.40GB | false | Uses Q8_0 for embed and output weights. Very high quality, near perfect, *recommended*. |
| [Qwen3-VL-4B-Instruct-Q6_K.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q6_K.gguf) | Q6_K | 3.31GB | false | Very high quality, near perfect, *recommended*. |
| [Qwen3-VL-4B-Instruct-Q5_K_L.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q5_K_L.gguf) | Q5_K_L | 2.98GB | false | Uses Q8_0 for embed and output weights. High quality, *recommended*. |
| [Qwen3-VL-4B-Instruct-Q5_K_M.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q5_K_M.gguf) | Q5_K_M | 2.89GB | false | High quality, *recommended*. |
| [Qwen3-VL-4B-Instruct-Q5_K_S.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q5_K_S.gguf) | Q5_K_S | 2.82GB | false | High quality, *recommended*. |
| [Qwen3-VL-4B-Instruct-Q4_1.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q4_1.gguf) | Q4_1 | 2.60GB | false | Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon. |
| [Qwen3-VL-4B-Instruct-Q4_K_L.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q4_K_L.gguf) | Q4_K_L | 2.59GB | false | Uses Q8_0 for embed and output weights. Good quality, *recommended*. |
| [Qwen3-VL-4B-Instruct-Q4_K_M.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q4_K_M.gguf) | Q4_K_M | 2.50GB | false | Good quality, default size for most use cases, *recommended*. |
| [Qwen3-VL-4B-Instruct-Q4_K_S.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q4_K_S.gguf) | Q4_K_S | 2.38GB | false | Slightly lower quality with more space savings, *recommended*. |
| [Qwen3-VL-4B-Instruct-Q4_0.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q4_0.gguf) | Q4_0 | 2.38GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. |
| [Qwen3-VL-4B-Instruct-IQ4_NL.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-IQ4_NL.gguf) | IQ4_NL | 2.38GB | false | Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference. |
| [Qwen3-VL-4B-Instruct-Q3_K_XL.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q3_K_XL.gguf) | Q3_K_XL | 2.33GB | false | Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
| [Qwen3-VL-4B-Instruct-IQ4_XS.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-IQ4_XS.gguf) | IQ4_XS | 2.27GB | false | Decent quality, smaller than Q4_K_S with similar performance, *recommended*. |
| [Qwen3-VL-4B-Instruct-Q3_K_L.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q3_K_L.gguf) | Q3_K_L | 2.24GB | false | Lower quality but usable, good for low RAM availability. |
| [Qwen3-VL-4B-Instruct-Q3_K_M.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q3_K_M.gguf) | Q3_K_M | 2.08GB | false | Low quality. |
| [Qwen3-VL-4B-Instruct-IQ3_M.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-IQ3_M.gguf) | IQ3_M | 1.96GB | false | Medium-low quality, new method with decent performance comparable to Q3_K_M. |
| [Qwen3-VL-4B-Instruct-Q3_K_S.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q3_K_S.gguf) | Q3_K_S | 1.89GB | false | Low quality, not recommended. |
| [Qwen3-VL-4B-Instruct-IQ3_XS.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-IQ3_XS.gguf) | IQ3_XS | 1.81GB | false | Lower quality, new method with decent performance, slightly better than Q3_K_S. |
| [Qwen3-VL-4B-Instruct-Q2_K_L.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q2_K_L.gguf) | Q2_K_L | 1.76GB | false | Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
| [Qwen3-VL-4B-Instruct-IQ3_XXS.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-IQ3_XXS.gguf) | IQ3_XXS | 1.67GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. |
| [Qwen3-VL-4B-Instruct-Q2_K.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-Q2_K.gguf) | Q2_K | 1.67GB | false | Very low quality but surprisingly usable. |
| [Qwen3-VL-4B-Instruct-IQ2_M.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF/blob/main/Qwen_Qwen3-VL-4B-Instruct-IQ2_M.gguf) | IQ2_M | 1.51GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. |
## Embed/output weights
Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.
## Downloading using huggingface-cli
<details>
<summary>Click to view download instructions</summary>
First, make sure you have hugginface-cli installed:
```
pip install -U "huggingface_hub[cli]"
```
Then, you can target the specific file you want:
```
huggingface-cli download bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF --include "Qwen_Qwen3-VL-4B-Instruct-Q4_K_M.gguf" --local-dir ./
```
If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:
```
huggingface-cli download bartowski/Qwen_Qwen3-VL-4B-Instruct-GGUF --include "Qwen_Qwen3-VL-4B-Instruct-Q8_0/*" --local-dir ./
```
You can either specify a new local-dir (Qwen_Qwen3-VL-4B-Instruct-Q8_0) or download them all in place (./)
</details>
## ARM/AVX information
Previously, you would download Q4_0_4_4/4_8/8_8, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass.
Now, however, there is something called "online repacking" for weights. details in [this PR](https://github.com/ggml-org/llama.cpp/pull/9921). If you use Q4_0 and your hardware would benefit from repacking weights, it will do it automatically on the fly.
As of llama.cpp build [b4282](https://github.com/ggml-org/llama.cpp/releases/tag/b4282) you will not be able to run the Q4_0_X_X files and will instead need to use Q4_0.
Additionally, if you want to get slightly better quality for , you can use IQ4_NL thanks to [this PR](https://github.com/ggml-org/llama.cpp/pull/10541) which will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower but it will result in an overall speed incrase.
<details>
<summary>Click to view Q4_0_X_X information (deprecated</summary>
I'm keeping this section to show the potential theoretical uplift in performance from using the Q4_0 with online repacking.
<details>
<summary>Click to view benchmarks on an AVX2 system (EPYC7702)</summary>
| model | size | params | backend | threads | test | t/s | % (vs Q4_0) |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% |
Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
</details>
</details>
## Which file should I choose?
<details>
<summary>Click here for details</summary>
A great write up with charts showing various performances is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
If you want to get more into the weeds, you can check out this extremely useful feature chart:
[llama.cpp feature matrix](https://github.com/ggml-org/llama.cpp/wiki/Feature-matrix)
But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
</details>
## Credits
Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
Thank you ZeroWw for the inspiration to experiment with embed/output.
Thank you to LM Studio for sponsoring my work.
Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f8178bbf582305017843c53396cc5e0ed0b867855f25e0b15ae48311cc379f08
size 839325984

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:5d5e698305f6e092c682d3e8e8deb1db88915e69a7dd66c0279e7ce810bfee48
size 836180256