2.9 KiB
2.9 KiB
Quantization Guide
Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
Support Matrix
| Compressed-Tensor (w8a8) | Weight only (w4a16/w8a16) | ||||
| Dynamic | Static | AWQ (w4a16) | GPTQ (w4a16/w8a16) | ||
| Dense/MoE | Dense/MoE | Dense | MoE | Dense | MoE |
| ✅ | ✅ | ✅ | WIP | ✅ | WIP |
- W8A8 dynamic and static quantization are now supported for all LLMs and VLMs.
- AWQ/GPTQ quantization is supported for all dense models.
Usages
Compressed-tensor
To run a compressed-tensors model with vLLM-Kunlun, you can use Qwen/Qwen3-30B-A3B-Int8 with the following command:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-30B-A3B-Int8 \
--quantization compressed-tensors
AWQ
To run an AWQ model with vLLM-Kunlun, you can use Qwen/Qwen3-32B-AWQ with the following command:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-32B-AWQ \
--quantization awq
GPTQ
To run a GPTQ model with vLLM-Kunlun, you can use Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 with the following command:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
--quantization gptq