# Quantization Guide >Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement. Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy. ## Support Matrix
Compressed-Tensors (w8a8-Int8) Weight only (w4a16/w8a16)
Dynamic Static AWQ (w4a16) GPTQ (w4a16/w8a16) Compressed-Tensors (w4a16)
Dense/MoE Dense/MoE Dense/MoE Dense MoE Dense/MoE
WIP
+ Compressed-Tensors w8a8-Int8 dynamic and static quantization are supported for all LLMs and VLMs. + Compressed-Tensors w4a16 are supported for all LLMs and VLMs. + AWQ(w4a16) quantization is supported for all LLMs and VLMs. + GPTQ (w4a16/w8a16) quantization is supported for all dense models. ## Usages ### Compressed-tensor To run a `compressed-tensors` model with vLLM-Kunlun, you can use `Qwen/Qwen3-30B-A3B-Int8` with the following command: ```Bash python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen3-30B-A3B-Int8 \ --quantization compressed-tensors ``` ### AWQ To run an `AWQ` model with vLLM-Kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command: ```Bash python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen3-32B-AWQ \ --quantization awq ``` ### GPTQ To run a `GPTQ` model with vLLM-Kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command: ```Bash python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \ --quantization gptq ```