xc-llm-kunlun/docs/source/user_guide/feature_guide/quantization.md

# Quantization Guide
>Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.

Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.


## Support Matrix
<table border="1" style="border-collapse: collapse; width: auto; margin: 0 0 0 0; text-align: center;">
  <thead>
    <tr>
      <td colspan="2" style="padding: 10px; font-weight: bold; border: 1px solid #000;">Compressed-Tensors (w8a8-Int8)</td>
      <td colspan="4" style="padding: 10px; font-weight: bold; border: 1px solid #000;">Weight only (w4a16/w8a16)</td>
    </tr>
    <tr>
      <td style="padding: 10px; border: 1px solid #000;">Dynamic</td>
      <td style="padding: 10px; border: 1px solid #000;">Static</td>
      <td colspan="1" style="padding: 10px; border: 1px solid #000;">AWQ (w4a16)</td>
      <td colspan="2" style="padding: 10px; border: 1px solid #000;">GPTQ (w4a16/w8a16)</td>
       <td colspan="1" style="padding: 10px; border: 1px solid #000;">Compressed-Tensors (w4a16)</td>
    </tr>
    <tr>
      <td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
      <td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
      <td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
      <td style="padding: 10px; border: 1px solid #000;">Dense</td>
      <td style="padding: 10px; border: 1px solid #000;">MoE</td>
      <td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
    </tr>
  </thead>
  <tbody>
    <tr style="height: 40px;">
      <td style="padding: 10px; border: 1px solid #000;">✅</td>
      <td style="padding: 10px; border: 1px solid #000;">✅</td>
      <td style="padding: 10px; border: 1px solid #000;">✅</td>
      <td style="padding: 10px; border: 1px solid #000;">✅</td>
      <td style="padding: 10px; border: 1px solid #000;">WIP</td>
      <td style="padding: 10px; border: 1px solid #000;">✅</td>
    </tr>
  </tbody>
</table>

+ Compressed-Tensors w8a8-Int8 dynamic and static quantization are supported for all LLMs and VLMs.
+ Compressed-Tensors w4a16 are supported for all LLMs and VLMs.
+ AWQ(w4a16) quantization is supported for all LLMs and VLMs.
+ GPTQ (w4a16/w8a16) quantization is supported for all dense models.

## Usages

### Compressed-tensor
To run a `compressed-tensors` model with vLLM-Kunlun, you can use `Qwen/Qwen3-30B-A3B-Int8` with the following command:

```Bash
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-30B-A3B-Int8 \
    --quantization compressed-tensors
```


### AWQ

To run an `AWQ` model with vLLM-Kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:

```Bash
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-32B-AWQ \
    --quantization awq
```


### GPTQ

To run a `GPTQ` model with vLLM-Kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:

```Bash
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --quantization gptq
```
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
+								# Quantization Guide
 								>Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
 								Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								## Support Matrix
 								<table border="1" style="border-collapse: collapse; width: auto; margin: 0 0 0 0; text-align: center;">
 								  <thead>
 								    <tr>
-												[Feature] support compressed-tensors w4a16 quantization (#154)

- native int4 kimi model inference is supported

Signed-off-by: Li Wei <liwei.109@outlook.com>
											
										
										
											2026-01-27 19:56:22 +08:00
+								      <td colspan="2" style="padding: 10px; font-weight: bold; border: 1px solid #000;">Compressed-Tensors (w8a8-Int8)</td>
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								      <td colspan="4" style="padding: 10px; font-weight: bold; border: 1px solid #000;">Weight only (w4a16/w8a16)</td>
 								    </tr>
 								    <tr>
 								      <td style="padding: 10px; border: 1px solid #000;">Dynamic</td>
 								      <td style="padding: 10px; border: 1px solid #000;">Static</td>
-												[Feature] support compressed-tensors w4a16 quantization (#154)

- native int4 kimi model inference is supported

Signed-off-by: Li Wei <liwei.109@outlook.com>
											
										
										
											2026-01-27 19:56:22 +08:00
+								      <td colspan="1" style="padding: 10px; border: 1px solid #000;">AWQ (w4a16)</td>
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								      <td colspan="2" style="padding: 10px; border: 1px solid #000;">GPTQ (w4a16/w8a16)</td>
-												[Feature] support compressed-tensors w4a16 quantization (#154)

- native int4 kimi model inference is supported

Signed-off-by: Li Wei <liwei.109@outlook.com>
											
										
										
											2026-01-27 19:56:22 +08:00
+								       <td colspan="1" style="padding: 10px; border: 1px solid #000;">Compressed-Tensors (w4a16)</td>
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								    </tr>
 								    <tr>
-												[Feature] support compressed-tensors w4a16 quantization (#154)

- native int4 kimi model inference is supported

Signed-off-by: Li Wei <liwei.109@outlook.com>
											
										
										
											2026-01-27 19:56:22 +08:00
+								      <td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								      <td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
 								      <td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
 								      <td style="padding: 10px; border: 1px solid #000;">Dense</td>
 								      <td style="padding: 10px; border: 1px solid #000;">MoE</td>
-												[Feature] support compressed-tensors w4a16 quantization (#154)

- native int4 kimi model inference is supported

Signed-off-by: Li Wei <liwei.109@outlook.com>
											
										
										
											2026-01-27 19:56:22 +08:00
+								      <td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								    </tr>
 								  </thead>
 								  <tbody>
 								    <tr style="height: 40px;">
 								      <td style="padding: 10px; border: 1px solid #000;">✅</td>
 								      <td style="padding: 10px; border: 1px solid #000;">✅</td>
 								      <td style="padding: 10px; border: 1px solid #000;">✅</td>
-												[Feature] Support AWQ MoE W4A16 Quantization (#142)

Signed-off-by: tangshiwen <tangshiwen@baidu.com>
Co-authored-by: Li Wei <liwei.109@outlook.com>
											
										
										
											2026-01-26 18:56:05 +08:00
+								      <td style="padding: 10px; border: 1px solid #000;">✅</td>
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								      <td style="padding: 10px; border: 1px solid #000;">WIP</td>
-												[Feature] support compressed-tensors w4a16 quantization (#154)

- native int4 kimi model inference is supported

Signed-off-by: Li Wei <liwei.109@outlook.com>
											
										
										
											2026-01-27 19:56:22 +08:00
+								      <td style="padding: 10px; border: 1px solid #000;">✅</td>
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								    </tr>
 								  </tbody>
 								</table>
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
-												[Feature] support compressed-tensors w4a16 quantization (#154)

- native int4 kimi model inference is supported

Signed-off-by: Li Wei <liwei.109@outlook.com>
											
										
										
											2026-01-27 19:56:22 +08:00
+								+ Compressed-Tensors w8a8-Int8 dynamic and static quantization are supported for all LLMs and VLMs.
 								+ Compressed-Tensors w4a16 are supported for all LLMs and VLMs.
 								+ AWQ(w4a16) quantization is supported for all LLMs and VLMs.
 								+ GPTQ (w4a16/w8a16) quantization is supported for all dense models.
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								## Usages
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								### Compressed-tensor
 								To run a `compressed-tensors` model with vLLM-Kunlun, you can use `Qwen/Qwen3-30B-A3B-Int8` with the following command:
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
 								```Bash
 								python -m vllm.entrypoints.openai.api_server \
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								    --model Qwen/Qwen3-30B-A3B-Int8 \
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
+								    --quantization compressed-tensors
 								```
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
+								### AWQ
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								To run an `AWQ` model with vLLM-Kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
 								```Bash
 								python -m vllm.entrypoints.openai.api_server \
 								    --model Qwen/Qwen3-32B-AWQ \
 								    --quantization awq
 								```
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
+								### GPTQ
-												[doc] update quantization guide doc (#88)


											
										
										
											2026-01-07 15:39:51 +08:00
+								To run a `GPTQ` model with vLLM-Kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
 								```Bash
 								python -m vllm.entrypoints.openai.api_server \
 								    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
 								    --quantization gptq
 								```