2025-12-10 12:05:39 +08:00
# Quantization Guide
>Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
2026-01-07 15:39:51 +08:00
## Support Matrix
< table border = "1" style = "border-collapse: collapse; width: auto; margin: 0 0 0 0; text-align: center;" >
< thead >
< tr >
2026-01-27 19:56:22 +08:00
< td colspan = "2" style = "padding: 10px; font-weight: bold; border: 1px solid #000 ;" > Compressed-Tensors (w8a8-Int8)</ td >
2026-01-07 15:39:51 +08:00
< td colspan = "4" style = "padding: 10px; font-weight: bold; border: 1px solid #000 ;" > Weight only (w4a16/w8a16)</ td >
< / tr >
< tr >
< td style = "padding: 10px; border: 1px solid #000 ;" > Dynamic</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > Static</ td >
2026-01-27 19:56:22 +08:00
< td colspan = "1" style = "padding: 10px; border: 1px solid #000 ;" > AWQ (w4a16)</ td >
2026-01-07 15:39:51 +08:00
< td colspan = "2" style = "padding: 10px; border: 1px solid #000 ;" > GPTQ (w4a16/w8a16)</ td >
2026-01-27 19:56:22 +08:00
< td colspan = "1" style = "padding: 10px; border: 1px solid #000 ;" > Compressed-Tensors (w4a16)</ td >
2026-01-07 15:39:51 +08:00
< / tr >
< tr >
2026-01-27 19:56:22 +08:00
< td style = "padding: 10px; border: 1px solid #000 ;" > Dense/MoE</ td >
2026-01-07 15:39:51 +08:00
< td style = "padding: 10px; border: 1px solid #000 ;" > Dense/MoE</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > Dense/MoE</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > Dense</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > MoE</ td >
2026-01-27 19:56:22 +08:00
< td style = "padding: 10px; border: 1px solid #000 ;" > Dense/MoE</ td >
2026-01-07 15:39:51 +08:00
< / tr >
< / thead >
< tbody >
< tr style = "height: 40px;" >
< td style = "padding: 10px; border: 1px solid #000 ;" > ✅</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > ✅</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > ✅</ td >
2026-01-26 18:56:05 +08:00
< td style = "padding: 10px; border: 1px solid #000 ;" > ✅</ td >
2026-01-07 15:39:51 +08:00
< td style = "padding: 10px; border: 1px solid #000 ;" > WIP</ td >
2026-01-27 19:56:22 +08:00
< td style = "padding: 10px; border: 1px solid #000 ;" > ✅</ td >
2026-01-07 15:39:51 +08:00
< / tr >
< / tbody >
< / table >
2025-12-10 12:05:39 +08:00
2026-01-27 19:56:22 +08:00
+ Compressed-Tensors w8a8-Int8 dynamic and static quantization are supported for all LLMs and VLMs.
+ Compressed-Tensors w4a16 are supported for all LLMs and VLMs.
+ AWQ(w4a16) quantization is supported for all LLMs and VLMs.
+ GPTQ (w4a16/w8a16) quantization is supported for all dense models.
2025-12-10 12:05:39 +08:00
2026-01-07 15:39:51 +08:00
## Usages
2025-12-10 12:05:39 +08:00
2026-01-07 15:39:51 +08:00
### Compressed-tensor
To run a `compressed-tensors` model with vLLM-Kunlun, you can use `Qwen/Qwen3-30B-A3B-Int8` with the following command:
2025-12-10 12:05:39 +08:00
```Bash
python -m vllm.entrypoints.openai.api_server \
2026-01-07 15:39:51 +08:00
--model Qwen/Qwen3-30B-A3B-Int8 \
2025-12-10 12:05:39 +08:00
--quantization compressed-tensors
```
2026-01-07 15:39:51 +08:00
2025-12-10 12:05:39 +08:00
### AWQ
2026-01-07 15:39:51 +08:00
To run an `AWQ` model with vLLM-Kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
2025-12-10 12:05:39 +08:00
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-32B-AWQ \
--quantization awq
```
2026-01-07 15:39:51 +08:00
2025-12-10 12:05:39 +08:00
### GPTQ
2026-01-07 15:39:51 +08:00
To run a `GPTQ` model with vLLM-Kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
2025-12-10 12:05:39 +08:00
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
--quantization gptq
```