Files

Li Wei 71bd70ad6c [Feature] support compressed-tensors w4a16 quantization (#154 )

- native int4 kimi model inference is supported

Signed-off-by: Li Wei <liwei.109@outlook.com>

2026-01-27 19:56:22 +08:00

3.2 KiB

Raw Blame History

Quantization Guide

Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.

Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.

Support Matrix

Compressed-Tensors (w8a8-Int8)		Weight only (w4a16/w8a16)
Dynamic	Static	AWQ (w4a16)	GPTQ (w4a16/w8a16)		Compressed-Tensors (w4a16)
Dense/MoE	Dense/MoE	Dense/MoE	Dense	MoE	Dense/MoE
✅	✅	✅	✅	WIP	✅

Compressed-Tensors w8a8-Int8 dynamic and static quantization are supported for all LLMs and VLMs.
Compressed-Tensors w4a16 are supported for all LLMs and VLMs.
AWQ(w4a16) quantization is supported for all LLMs and VLMs.
GPTQ (w4a16/w8a16) quantization is supported for all dense models.

Usages

Compressed-tensor

To run a compressed-tensors model with vLLM-Kunlun, you can use Qwen/Qwen3-30B-A3B-Int8 with the following command:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-30B-A3B-Int8 \
    --quantization compressed-tensors

AWQ

To run an AWQ model with vLLM-Kunlun, you can use Qwen/Qwen3-32B-AWQ with the following command:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-32B-AWQ \
    --quantization awq

GPTQ

To run a GPTQ model with vLLM-Kunlun, you can use Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 with the following command:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --quantization gptq

3.2 KiB Raw Blame History