qwen2.5-1.5b-gguf-experiments/README.md

---
base_model: Qwen/Qwen2.5-1.5B-Instruct
language:
- en
license: apache-2.0
tags:
- llama.cpp
- gguf
- quantized
- qwen2.5
- text-generation
pipeline_tag: text-generation
---

# Qwen2.5-1.5B-Instruct — GGUF Quantization Experiments

This repo contains **Qwen2.5-1.5B-Instruct** quantized into multiple GGUF formats using [llama.cpp](https://github.com/ggerganov/llama.cpp). It was created as part of a hands-on quantization experiment documenting the full process from raw HuggingFace weights → multiple GGUF formats → quality evaluation.

---

## What's in This Repo

```
gguf/
├── qwen2.5-1.5b-f16.gguf          ~2.9 GB   source of truth — full precision
├── qwen2.5-1.5b-Q8_0.gguf         ~1.6 GB   near-lossless
├── qwen2.5-1.5b-Q5_K_M.gguf       ~1.0 GB   great quality/size tradeoff
├── qwen2.5-1.5b-Q4_K_M.gguf       ~935 MB   recommended — sweet spot ★
├── qwen2.5-1.5b-Q4_K_S.gguf       ~865 MB   leaner 4-bit variant
├── qwen2.5-1.5b-Q2_K.gguf         ~530 MB   aggressive K-quant baseline
├── qwen2.5-1.5b-Q2_K_S.gguf       ~530 MB   aggressive K-quant — needs imatrix
├── qwen2.5-1.5b-IQ3_M.gguf        ~680 MB   importance-weighted 3-bit
├── qwen2.5-1.5b-IQ2_XS.gguf       ~480 MB   importance-weighted 2-bit — needs imatrix
├── qwen2.5-1.5b-IQ2_XXS.gguf      ~420 MB   most aggressive — needs imatrix
├── qwen2.5-1.5b-IQ2_S.gguf        ~450 MB   importance-weighted 2.5-bit — needs imatrix
├── qwen2.5-1.5b-IQ1_M.gguf        ~300 MB   extreme 1.75-bit — needs imatrix
└── qwen2.5-1.5b-IQ1_S.gguf        ~280 MB   extreme 1.56-bit — needs imatrix
```

> **Note on f16:** The F16 file is included as the reference baseline for perplexity comparisons. It is not intended for general inference use — at 2.9 GB it offers no practical advantage over Q8_0 for local deployment.

---

## Which File Should I Use?

| Use Case | Recommended Format |
|---|---|
| Best quality, VRAM not a concern | `Q8_0` |
| Daily driver — best quality/size tradeoff | `Q4_K_M` ← start here |
| Tight on memory, want decent quality | `Q4_K_S` or `Q2_K` |
| Edge deployment / very limited RAM | `IQ2_XS` or `IQ2_XXS` |
| Research / extreme compression testing | `IQ1_M` or `IQ1_S` |
| Partial GPU offload (CPU + GPU split) | `Q4_K_M` or `IQ3_M` |

If you're not sure, **start with Q4_K_M**. It's the most tested format in the community and gives you ~68% size reduction with minimal quality loss.

> ⚠️ **IQ1 and IQ2 formats** (`IQ1_S`, `IQ1_M`, `IQ2_S`, `IQ2_XS`, `IQ2_XXS`, `Q2_K_S`) were all generated **with an importance matrix**. Without one, these formats produce significantly degraded output. See the Imatrix Calibration section below for details.

---

## Format Guide

### K-Quant Family (Q\*\_K\_\*)
Standard llama.cpp quantization using superblocks of 256 weights. The suffix means:
- `_S` (Small) — more aggressive, smaller file
- `_M` (Medium) — mixed-precision, smarter assignment of bits to sensitive layers

Despite the "4" in Q4_K_M, it is **not** uniform 4-bit. Critical tensors like the embedding table and output projection are bumped to 6-bit internally. The "4" is the average bits-per-weight.

### IQ Family (IQ\*\_\*)
Importance-weighted quantization. These formats use an **importance matrix** — calibration data was run through the base model to identify which weights matter most, and precision was distributed accordingly. This is why IQ formats punch above their weight class at the same file size compared to K-quants.

The IQ2 files in this repo were generated with a WikiText-2 calibration dataset (see below). Without an importance matrix, these formats produce near-incoherent output — the imatrix is what makes them viable.

---

## Quantization Details

**Base model:** `Qwen/Qwen2.5-1.5B-Instruct`  
**Quantization tool:** llama.cpp build `7074` (commit `22e1ce2f8`)  
**Source precision:** F16 GGUF (converted from original SafeTensors)  
**Platform:** Apple Silicon (arm64)  

### Imatrix Calibration

The IQ2 formats were quantized using an importance matrix generated from WikiText-2:

```python
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

with open("calibration.txt", "w") as f:
    for row in dataset:
        text = row["text"].strip()
        if len(text) > 100:
            f.write(text + "\n")
```

```bash
./build/bin/llama-imatrix \
    -m qwen2.5-1.5b-f16.gguf \
    -f calibration.txt \
    -o imatrix.dat \
    --ctx-size 512 \
    -ngl -1 \
    --chunks 100
```

---

## How to Run

### llama.cpp CLI

```bash
./build/bin/llama-cli \
    -m qwen2.5-1.5b-Q4_K_M.gguf \
    -n 512 \
    -ngl 99 \
    --prompt "Explain the difference between supervised and unsupervised learning."
```

### llama.cpp Server (OpenAI-compatible)

```bash
./build/bin/llama-server \
    -m qwen2.5-1.5b-Q4_K_M.gguf \
    -ngl 99 \
    --port 8080
```

Then hit `http://localhost:8080/v1/chat/completions` like any OpenAI endpoint.

### Python (llama-cpp-python)

```python
from llama_cpp import Llama

llm = Llama(
    model_path="qwen2.5-1.5b-Q4_K_M.gguf",
    n_gpu_layers=-1,  # full GPU offload
    n_ctx=4096,
)

output = llm(
    "Explain quantization in simple terms:",
    max_tokens=256,
    temperature=0.7,
)
print(output["choices"][0]["text"])
```

### Ollama

```bash
ollama run hf.co/your-username/qwen2.5-1.5b-gguf-experiments:Q4_K_M
```

---

## Model Architecture (from metadata)

| Parameter | Value |
|---|---|
| Architecture | Qwen2 |
| Parameters | 1.5B |
| Layers | 28 |
| Hidden dimension | 1536 |
| FFN intermediate | 8960 |
| Attention heads (Q) | 12 |
| Attention heads (KV) | 2 |
| Attention type | Grouped Query Attention (GQA) |
| Context length | 32768 |
| Vocabulary size | 151,936 |
| Tokenizer | GPT-2 BPE (Qwen2 variant) |
---

## License

The quantized weights in this repo are derived from `Qwen/Qwen2.5-1.5B-Instruct` and inherit its [Apache 2.0 license](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE).

---

## Citation

If you use these files in your work, please also cite the original Qwen2.5 model:

```bibtex
@misc{qwen2.5,
    title  = {Qwen2.5: A Party of Foundation Models},
    author = {Qwen Team},
    year   = {2024},
    url    = {https://qwenlm.github.io/blog/qwen2.5/}
}
```
初始化项目，由ModelHub XC社区提供模型 Model: rkumar70900/qwen2.5-1.5b-gguf-experiments Source: Original Platform 2026-04-12 16:28:55 +08:00			`---`
			`base_model: Qwen/Qwen2.5-1.5B-Instruct`
			`language:`
			`- en`
			`license: apache-2.0`
			`tags:`
			`- llama.cpp`
			`- gguf`
			`- quantized`
			`- qwen2.5`
			`- text-generation`
			`pipeline_tag: text-generation`
			`---`

			`# Qwen2.5-1.5B-Instruct — GGUF Quantization Experiments`

			`This repo contains Qwen2.5-1.5B-Instruct quantized into multiple GGUF formats using [llama.cpp](https://github.com/ggerganov/llama.cpp). It was created as part of a hands-on quantization experiment documenting the full process from raw HuggingFace weights → multiple GGUF formats → quality evaluation.`

			`---`

			`## What's in This Repo`

			```
			`gguf/`
			`├── qwen2.5-1.5b-f16.gguf ~2.9 GB source of truth — full precision`
			`├── qwen2.5-1.5b-Q8_0.gguf ~1.6 GB near-lossless`
			`├── qwen2.5-1.5b-Q5_K_M.gguf ~1.0 GB great quality/size tradeoff`
			`├── qwen2.5-1.5b-Q4_K_M.gguf ~935 MB recommended — sweet spot ★`
			`├── qwen2.5-1.5b-Q4_K_S.gguf ~865 MB leaner 4-bit variant`
			`├── qwen2.5-1.5b-Q2_K.gguf ~530 MB aggressive K-quant baseline`
			`├── qwen2.5-1.5b-Q2_K_S.gguf ~530 MB aggressive K-quant — needs imatrix`
			`├── qwen2.5-1.5b-IQ3_M.gguf ~680 MB importance-weighted 3-bit`
			`├── qwen2.5-1.5b-IQ2_XS.gguf ~480 MB importance-weighted 2-bit — needs imatrix`
			`├── qwen2.5-1.5b-IQ2_XXS.gguf ~420 MB most aggressive — needs imatrix`
			`├── qwen2.5-1.5b-IQ2_S.gguf ~450 MB importance-weighted 2.5-bit — needs imatrix`
			`├── qwen2.5-1.5b-IQ1_M.gguf ~300 MB extreme 1.75-bit — needs imatrix`
			`└── qwen2.5-1.5b-IQ1_S.gguf ~280 MB extreme 1.56-bit — needs imatrix`
			```

			`> Note on f16: The F16 file is included as the reference baseline for perplexity comparisons. It is not intended for general inference use — at 2.9 GB it offers no practical advantage over Q8_0 for local deployment.`

			`---`

			`## Which File Should I Use?`

			`\| Use Case \| Recommended Format \|`
			`\|---\|---\|`
			\| Best quality, VRAM not a concern \| `Q8_0` \|
			\| Daily driver — best quality/size tradeoff \| `Q4_K_M` ← start here \|
			\| Tight on memory, want decent quality \| `Q4_K_S` or `Q2_K` \|
			\| Edge deployment / very limited RAM \| `IQ2_XS` or `IQ2_XXS` \|
			\| Research / extreme compression testing \| `IQ1_M` or `IQ1_S` \|
			\| Partial GPU offload (CPU + GPU split) \| `Q4_K_M` or `IQ3_M` \|

			`If you're not sure, start with Q4_K_M. It's the most tested format in the community and gives you ~68% size reduction with minimal quality loss.`

			> ⚠️ IQ1 and IQ2 formats (`IQ1_S`, `IQ1_M`, `IQ2_S`, `IQ2_XS`, `IQ2_XXS`, `Q2_K_S`) were all generated with an importance matrix. Without one, these formats produce significantly degraded output. See the Imatrix Calibration section below for details.

			`---`

			`## Format Guide`

			`### K-Quant Family (Q\\_K\_\)`
			`Standard llama.cpp quantization using superblocks of 256 weights. The suffix means:`
			- `_S` (Small) — more aggressive, smaller file
			- `_M` (Medium) — mixed-precision, smarter assignment of bits to sensitive layers

			`Despite the "4" in Q4_K_M, it is not uniform 4-bit. Critical tensors like the embedding table and output projection are bumped to 6-bit internally. The "4" is the average bits-per-weight.`

			`### IQ Family (IQ\\_\)`
			`Importance-weighted quantization. These formats use an importance matrix — calibration data was run through the base model to identify which weights matter most, and precision was distributed accordingly. This is why IQ formats punch above their weight class at the same file size compared to K-quants.`

			`The IQ2 files in this repo were generated with a WikiText-2 calibration dataset (see below). Without an importance matrix, these formats produce near-incoherent output — the imatrix is what makes them viable.`

			`---`

			`## Quantization Details`

			Base model: `Qwen/Qwen2.5-1.5B-Instruct`
			Quantization tool: llama.cpp build `7074` (commit `22e1ce2f8`)
			`Source precision: F16 GGUF (converted from original SafeTensors)`
			`Platform: Apple Silicon (arm64)`

			`### Imatrix Calibration`

			`The IQ2 formats were quantized using an importance matrix generated from WikiText-2:`

			```python
			`from datasets import load_dataset`

			`dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")`

			`with open("calibration.txt", "w") as f:`
			`for row in dataset:`
			`text = row["text"].strip()`
			`if len(text) > 100:`
			`f.write(text + "\n")`
			```

			```bash
			`./build/bin/llama-imatrix \`
			`-m qwen2.5-1.5b-f16.gguf \`
			`-f calibration.txt \`
			`-o imatrix.dat \`
			`--ctx-size 512 \`
			`-ngl -1 \`
			`--chunks 100`
			```

			`---`

			`## How to Run`

			`### llama.cpp CLI`

			```bash
			`./build/bin/llama-cli \`
			`-m qwen2.5-1.5b-Q4_K_M.gguf \`
			`-n 512 \`
			`-ngl 99 \`
			`--prompt "Explain the difference between supervised and unsupervised learning."`
			```

			`### llama.cpp Server (OpenAI-compatible)`

			```bash
			`./build/bin/llama-server \`
			`-m qwen2.5-1.5b-Q4_K_M.gguf \`
			`-ngl 99 \`
			`--port 8080`
			```

			Then hit `http://localhost:8080/v1/chat/completions` like any OpenAI endpoint.

			`### Python (llama-cpp-python)`

			```python
			`from llama_cpp import Llama`

			`llm = Llama(`
			`model_path="qwen2.5-1.5b-Q4_K_M.gguf",`
			`n_gpu_layers=-1, # full GPU offload`
			`n_ctx=4096,`
			`)`

			`output = llm(`
			`"Explain quantization in simple terms:",`
			`max_tokens=256,`
			`temperature=0.7,`
			`)`
			`print(output["choices"][0]["text"])`
			```

			`### Ollama`

			```bash
			`ollama run hf.co/your-username/qwen2.5-1.5b-gguf-experiments:Q4_K_M`
			```

			`---`

			`## Model Architecture (from metadata)`

			`\| Parameter \| Value \|`
			`\|---\|---\|`
			`\| Architecture \| Qwen2 \|`
			`\| Parameters \| 1.5B \|`
			`\| Layers \| 28 \|`
			`\| Hidden dimension \| 1536 \|`
			`\| FFN intermediate \| 8960 \|`
			`\| Attention heads (Q) \| 12 \|`
			`\| Attention heads (KV) \| 2 \|`
			`\| Attention type \| Grouped Query Attention (GQA) \|`
			`\| Context length \| 32768 \|`
			`\| Vocabulary size \| 151,936 \|`
			`\| Tokenizer \| GPT-2 BPE (Qwen2 variant) \|`
			`---`

			`## License`

			The quantized weights in this repo are derived from `Qwen/Qwen2.5-1.5B-Instruct` and inherit its [Apache 2.0 license](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE).

			`---`

			`## Citation`

			`If you use these files in your work, please also cite the original Qwen2.5 model:`

			```bibtex
			`@misc{qwen2.5,`
			`title = {Qwen2.5: A Party of Foundation Models},`
			`author = {Qwen Team},`
			`year = {2024},`
			`url = {https://qwenlm.github.io/blog/qwen2.5/}`
			`}`
			```