初始化项目,由ModelHub XC社区提供模型

Model: bartowski/OpenGVLab_InternVL3_5-4B-GGUF
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-06-04 00:20:19 +08:00
commit b83d3d350f
29 changed files with 307 additions and 0 deletions

62
.gitattributes vendored Normal file
View File

@@ -0,0 +1,62 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q6_K_L.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q6_K.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q5_K_L.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q5_K_S.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q4_K_L.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q4_K_S.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q4_1.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-IQ4_NL.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-IQ4_XS.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q3_K_XL.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q3_K_L.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q3_K_M.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-IQ3_M.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q3_K_S.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-IQ3_XS.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-IQ3_XXS.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q2_K_L.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-Q2_K.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-IQ2_M.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-bf16.gguf filter=lfs diff=lfs merge=lfs -text
OpenGVLab_InternVL3_5-4B-imatrix.gguf filter=lfs diff=lfs merge=lfs -text
mmproj-OpenGVLab_InternVL3_5-4B-f16.gguf filter=lfs diff=lfs merge=lfs -text
mmproj-OpenGVLab_InternVL3_5-4B-bf16.gguf filter=lfs diff=lfs merge=lfs -text

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:bf127546725e5514c70f3f75671dedc5cfc4389f08bacb37197d49cace632404
size 1680110272

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3fb94a495c4aaf43b0310fcbd3e8fe85a488e07f1542d5ca8b913f21080ad6b4
size 2130022592

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:aa39f972adfef041143506b1e5b465b52d8a00027cd7f2bd9a07d09cd8d892ce
size 1981501632

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:ce461603aa42bdf8f53338d715c326d6bc0fee19f6d530cca1ae6561ba1dba6c
size 1837314752

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:739f16dac7e7f05f620432bf8d495797843302d145fe94c2e9361a1a825065fe
size 2600128192

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6a4b93b338ec2ef737a968a65f4b9ff6f20604d24f6feb8d139fc180330e0f03
size 2477381312

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:aa5cb96774e811bdc67a13f42d00e1ab5487993a964dcfa58766d299f0319fbc
size 1797122752

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:42be0fef12262a4f33bb554c9f725f1cd60ff9681671b1f57174dfff0476e49b
size 2176962752

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:1153547dc5d7e68f9b5bf574dda342fedab3d5b964f4c90fb3be4bc0c3246967
size 2406912192

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:05c42d941c43082bf8a4d42e0ae9d347f7a499ca42996237314f24f0904d2896
size 2242744512

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2a541f59081cd34e339b97a6961e14e98e62b698796b456ca933c33934e82b5d
size 2054123712

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:1c57fac71f6a2ce1540449e14de38293daec6d218ba84fa274cce06ad1934454
size 2747248832

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:eb10e53ae463d0d0e54b666aefefce35293d1b7696dcecf12c546c0d0c021337
size 2594557632

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:7330efa1818892af481c6a9cb0b279a9d90ff8f24fcb9522072a8c6e76032aa3
size 2839723712

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:afa6b06307244f651ebdb2eb527cc85580161d7d0f65e1726e502c6324809d35
size 3004743872

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:7c1612b6896ad14caa501238e72afa17a600651d0984225e3ff78b39de86099c
size 2716065472

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c6b2918021b959a51b6a76a42b2c262903f477082d57eec8c8effaf61ca57a7b
size 2602094272

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:edc1dc20d9a674c228139238e60840896b5494d05372d70ee1086e3a2bd3962e
size 3396976832

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:cc820484268ac0060471787e882f159af15f30952dc7adfadc04f08996bc8af4
size 3156917952

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:65e21353eb29293d9c4d752db84ba3eaefe6e0e0df3ef2bad7d976079b21b68f
size 3091115712

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:a9d3fac3267f214152762bc74911808b34e1c60c24b4c63d3b76ac1e1dbebbb4
size 3625323712

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:157ad29bf7c7a9fa5ea778e207e77e21a12e1644231822741aae3c01604b77f2
size 3813724352

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:ece87031e20486b1a4b86a0ba0f06b8b3b6eed676c8c6842e31041524489992d
size 4693668032

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:161ff4491525e55d64ef9191b7226566f74e03c848aed0eb0a39d39aea9a63f0
size 8829194144

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0a967cf43c649250cfa2c5dfc0653abd7bc39ae3862562f32adf4245b82834e3
size 3872640

164
README.md Normal file
View File

@@ -0,0 +1,164 @@
---
quantized_by: bartowski
pipeline_tag: image-text-to-text
base_model: OpenGVLab/InternVL3_5-4B
base_model_relation: quantized
---
## Llamacpp imatrix Quantizations of InternVL3_5-4B by OpenGVLab
Using <a href="https://github.com/ggml-org/llama.cpp/">llama.cpp</a> release <a href="https://github.com/ggml-org/llama.cpp/releases/tag/b6258">b6258</a> for quantization.
Original model: https://huggingface.co/OpenGVLab/InternVL3_5-4B
All quants made using imatrix option with dataset from [here](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8) combined with a subset of combined_all_small.parquet from Ed Addario [here](https://huggingface.co/datasets/eaddario/imatrix-calibration/blob/main/combined_all_small.parquet)
Run them in [LM Studio](https://lmstudio.ai/)
Run them directly with [llama.cpp](https://github.com/ggml-org/llama.cpp), or any other llama.cpp based project
## Prompt format
No prompt format found, check original model page
## Download a file (not the whole branch) from below:
| Filename | Quant type | File Size | Split | Description |
| -------- | ---------- | --------- | ----- | ----------- |
| [InternVL3_5-4B-bf16.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-bf16.gguf) | bf16 | 8.83GB | false | Full BF16 weights. |
| [InternVL3_5-4B-Q8_0.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q8_0.gguf) | Q8_0 | 4.69GB | false | Extremely high quality, generally unneeded but max available quant. |
| [InternVL3_5-4B-Q6_K_L.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q6_K_L.gguf) | Q6_K_L | 3.81GB | false | Uses Q8_0 for embed and output weights. Very high quality, near perfect, *recommended*. |
| [InternVL3_5-4B-Q6_K.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q6_K.gguf) | Q6_K | 3.63GB | false | Very high quality, near perfect, *recommended*. |
| [InternVL3_5-4B-Q5_K_L.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q5_K_L.gguf) | Q5_K_L | 3.40GB | false | Uses Q8_0 for embed and output weights. High quality, *recommended*. |
| [InternVL3_5-4B-Q5_K_M.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q5_K_M.gguf) | Q5_K_M | 3.16GB | false | High quality, *recommended*. |
| [InternVL3_5-4B-Q5_K_S.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q5_K_S.gguf) | Q5_K_S | 3.09GB | false | High quality, *recommended*. |
| [InternVL3_5-4B-Q4_K_L.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q4_K_L.gguf) | Q4_K_L | 3.00GB | false | Uses Q8_0 for embed and output weights. Good quality, *recommended*. |
| [InternVL3_5-4B-Q4_1.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q4_1.gguf) | Q4_1 | 2.84GB | false | Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon. |
| [InternVL3_5-4B-Q3_K_XL.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q3_K_XL.gguf) | Q3_K_XL | 2.75GB | false | Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
| [InternVL3_5-4B-Q4_K_M.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q4_K_M.gguf) | Q4_K_M | 2.72GB | false | Good quality, default size for most use cases, *recommended*. |
| [InternVL3_5-4B-Q4_K_S.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q4_K_S.gguf) | Q4_K_S | 2.60GB | false | Slightly lower quality with more space savings, *recommended*. |
| [InternVL3_5-4B-IQ4_NL.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-IQ4_NL.gguf) | IQ4_NL | 2.60GB | false | Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference. |
| [InternVL3_5-4B-Q4_0.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q4_0.gguf) | Q4_0 | 2.59GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. |
| [InternVL3_5-4B-IQ4_XS.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-IQ4_XS.gguf) | IQ4_XS | 2.48GB | false | Decent quality, smaller than Q4_K_S with similar performance, *recommended*. |
| [InternVL3_5-4B-Q3_K_L.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q3_K_L.gguf) | Q3_K_L | 2.41GB | false | Lower quality but usable, good for low RAM availability. |
| [InternVL3_5-4B-Q3_K_M.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q3_K_M.gguf) | Q3_K_M | 2.24GB | false | Low quality. |
| [InternVL3_5-4B-Q2_K_L.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q2_K_L.gguf) | Q2_K_L | 2.18GB | false | Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
| [InternVL3_5-4B-IQ3_M.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-IQ3_M.gguf) | IQ3_M | 2.13GB | false | Medium-low quality, new method with decent performance comparable to Q3_K_M. |
| [InternVL3_5-4B-Q3_K_S.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q3_K_S.gguf) | Q3_K_S | 2.05GB | false | Low quality, not recommended. |
| [InternVL3_5-4B-IQ3_XS.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-IQ3_XS.gguf) | IQ3_XS | 1.98GB | false | Lower quality, new method with decent performance, slightly better than Q3_K_S. |
| [InternVL3_5-4B-IQ3_XXS.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-IQ3_XXS.gguf) | IQ3_XXS | 1.84GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. |
| [InternVL3_5-4B-Q2_K.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-Q2_K.gguf) | Q2_K | 1.80GB | false | Very low quality but surprisingly usable. |
| [InternVL3_5-4B-IQ2_M.gguf](https://huggingface.co/bartowski/OpenGVLab_InternVL3_5-4B-GGUF/blob/main/OpenGVLab_InternVL3_5-4B-IQ2_M.gguf) | IQ2_M | 1.68GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. |
## Embed/output weights
Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.
## Downloading using huggingface-cli
<details>
<summary>Click to view download instructions</summary>
First, make sure you have hugginface-cli installed:
```
pip install -U "huggingface_hub[cli]"
```
Then, you can target the specific file you want:
```
huggingface-cli download bartowski/OpenGVLab_InternVL3_5-4B-GGUF --include "OpenGVLab_InternVL3_5-4B-Q4_K_M.gguf" --local-dir ./
```
If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:
```
huggingface-cli download bartowski/OpenGVLab_InternVL3_5-4B-GGUF --include "OpenGVLab_InternVL3_5-4B-Q8_0/*" --local-dir ./
```
You can either specify a new local-dir (OpenGVLab_InternVL3_5-4B-Q8_0) or download them all in place (./)
</details>
## ARM/AVX information
Previously, you would download Q4_0_4_4/4_8/8_8, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass.
Now, however, there is something called "online repacking" for weights. details in [this PR](https://github.com/ggml-org/llama.cpp/pull/9921). If you use Q4_0 and your hardware would benefit from repacking weights, it will do it automatically on the fly.
As of llama.cpp build [b4282](https://github.com/ggml-org/llama.cpp/releases/tag/b4282) you will not be able to run the Q4_0_X_X files and will instead need to use Q4_0.
Additionally, if you want to get slightly better quality for , you can use IQ4_NL thanks to [this PR](https://github.com/ggml-org/llama.cpp/pull/10541) which will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower but it will result in an overall speed incrase.
<details>
<summary>Click to view Q4_0_X_X information (deprecated</summary>
I'm keeping this section to show the potential theoretical uplift in performance from using the Q4_0 with online repacking.
<details>
<summary>Click to view benchmarks on an AVX2 system (EPYC7702)</summary>
| model | size | params | backend | threads | test | t/s | % (vs Q4_0) |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% |
Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
</details>
</details>
## Which file should I choose?
<details>
<summary>Click here for details</summary>
A great write up with charts showing various performances is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
If you want to get more into the weeds, you can check out this extremely useful feature chart:
[llama.cpp feature matrix](https://github.com/ggml-org/llama.cpp/wiki/Feature-matrix)
But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
</details>
## Credits
Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
Thank you ZeroWw for the inspiration to experiment with embed/output.
Thank you to LM Studio for sponsoring my work.
Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c697d9fb3090cde25cd3c24c1cfad6970ef823e6f3fd072174200034f667d143
size 646227360

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0f9704972fcb9cb0a4f2c0f4eb7fe4f58e53ccd4b06ec17cf7a80271aa963eb7
size 645023136