# Quantization Guide
>Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
## Support Matrix
| Compressed-Tensors (w8a8-Int8) |
Weight only (w4a16/w8a16) |
| Dynamic |
Static |
AWQ (w4a16) |
GPTQ (w4a16/w8a16) |
Compressed-Tensors (w4a16) |
| Dense/MoE |
Dense/MoE |
Dense/MoE |
Dense |
MoE |
Dense/MoE |
| ✅ |
✅ |
✅ |
✅ |
WIP |
✅ |
+ Compressed-Tensors w8a8-Int8 dynamic and static quantization are supported for all LLMs and VLMs.
+ Compressed-Tensors w4a16 are supported for all LLMs and VLMs.
+ AWQ(w4a16) quantization is supported for all LLMs and VLMs.
+ GPTQ (w4a16/w8a16) quantization is supported for all dense models.
## Usages
### Compressed-tensor
To run a `compressed-tensors` model with vLLM-Kunlun, you can use `Qwen/Qwen3-30B-A3B-Int8` with the following command:
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-30B-A3B-Int8 \
--quantization compressed-tensors
```
### AWQ
To run an `AWQ` model with vLLM-Kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-32B-AWQ \
--quantization awq
```
### GPTQ
To run a `GPTQ` model with vLLM-Kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
--quantization gptq
```