2025-12-10 12:05:39 +08:00
# Quantization Guide
>Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
2026-01-07 15:39:51 +08:00
## Support Matrix
< table border = "1" style = "border-collapse: collapse; width: auto; margin: 0 0 0 0; text-align: center;" >
< thead >
< tr >
< td colspan = "2" style = "padding: 10px; font-weight: bold; border: 1px solid #000 ;" > Compressed-Tensor (w8a8)</ td >
< td colspan = "4" style = "padding: 10px; font-weight: bold; border: 1px solid #000 ;" > Weight only (w4a16/w8a16)</ td >
< / tr >
< tr >
< td style = "padding: 10px; border: 1px solid #000 ;" > Dynamic</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > Static</ td >
< td colspan = "2" style = "padding: 10px; border: 1px solid #000 ;" > AWQ (w4a16)</ td >
< td colspan = "2" style = "padding: 10px; border: 1px solid #000 ;" > GPTQ (w4a16/w8a16)</ td >
< / tr >
< tr >
< td style = "padding: 10px; border: 1px solid #000 ;" > Dense/MoE</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > Dense/MoE</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > Dense</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > MoE</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > Dense</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > MoE</ td >
< / tr >
< / thead >
< tbody >
< tr style = "height: 40px;" >
< td style = "padding: 10px; border: 1px solid #000 ;" > ✅</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > ✅</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > ✅</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > WIP</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > ✅</ td >
< td style = "padding: 10px; border: 1px solid #000 ;" > WIP</ td >
< / tr >
< / tbody >
< / table >
2025-12-10 12:05:39 +08:00
2026-01-07 15:39:51 +08:00
+ W8A8 dynamic and static quantization are now supported for all LLMs and VLMs.
+ AWQ/GPTQ quantization is supported for all dense models.
2025-12-10 12:05:39 +08:00
2026-01-07 15:39:51 +08:00
## Usages
2025-12-10 12:05:39 +08:00
2026-01-07 15:39:51 +08:00
### Compressed-tensor
To run a `compressed-tensors` model with vLLM-Kunlun, you can use `Qwen/Qwen3-30B-A3B-Int8` with the following command:
2025-12-10 12:05:39 +08:00
```Bash
python -m vllm.entrypoints.openai.api_server \
2026-01-07 15:39:51 +08:00
--model Qwen/Qwen3-30B-A3B-Int8 \
2025-12-10 12:05:39 +08:00
--quantization compressed-tensors
```
2026-01-07 15:39:51 +08:00
2025-12-10 12:05:39 +08:00
### AWQ
2026-01-07 15:39:51 +08:00
To run an `AWQ` model with vLLM-Kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
2025-12-10 12:05:39 +08:00
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-32B-AWQ \
--quantization awq
```
2026-01-07 15:39:51 +08:00
2025-12-10 12:05:39 +08:00
### GPTQ
2026-01-07 15:39:51 +08:00
To run a `GPTQ` model with vLLM-Kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
2025-12-10 12:05:39 +08:00
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
--quantization gptq
```