[doc] update quantization guide doc (#88)

This commit is contained in:
Li Wei
2026-01-07 15:39:51 +08:00
committed by GitHub
parent eb40e8a07a
commit c403d921ff
2 changed files with 52 additions and 21 deletions

View File

@@ -4,28 +4,58 @@
Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy. Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
## Support Matrix
<table border="1" style="border-collapse: collapse; width: auto; margin: 0 0 0 0; text-align: center;">
<thead>
<tr>
<td colspan="2" style="padding: 10px; font-weight: bold; border: 1px solid #000;">Compressed-Tensor (w8a8)</td>
<td colspan="4" style="padding: 10px; font-weight: bold; border: 1px solid #000;">Weight only (w4a16/w8a16)</td>
</tr>
<tr>
<td style="padding: 10px; border: 1px solid #000;">Dynamic</td>
<td style="padding: 10px; border: 1px solid #000;">Static</td>
<td colspan="2" style="padding: 10px; border: 1px solid #000;">AWQ (w4a16)</td>
<td colspan="2" style="padding: 10px; border: 1px solid #000;">GPTQ (w4a16/w8a16)</td>
</tr>
<tr>
<td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
<td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
<td style="padding: 10px; border: 1px solid #000;">Dense</td>
<td style="padding: 10px; border: 1px solid #000;">MoE</td>
<td style="padding: 10px; border: 1px solid #000;">Dense</td>
<td style="padding: 10px; border: 1px solid #000;">MoE</td>
</tr>
</thead>
<tbody>
<tr style="height: 40px;">
<td style="padding: 10px; border: 1px solid #000;"></td>
<td style="padding: 10px; border: 1px solid #000;"></td>
<td style="padding: 10px; border: 1px solid #000;"></td>
<td style="padding: 10px; border: 1px solid #000;">WIP</td>
<td style="padding: 10px; border: 1px solid #000;"></td>
<td style="padding: 10px; border: 1px solid #000;">WIP</td>
</tr>
</tbody>
</table>
+ W8A8 dynamic and static quantization are now supported for all LLMs and VLMs.
+ AWQ/GPTQ quantization is supported for all dense models.
## Usages ## Usages
### Compressed-tensor ### Compressed-tensor
To run a `compressed-tensors` model with vLLM-kunlun, you should first add the below configuration to the model's `config.json`: To run a `compressed-tensors` model with vLLM-Kunlun, you can use `Qwen/Qwen3-30B-A3B-Int8` with the following command:
```Bash
"quantization_config": {
"quant_method": "compressed-tensors"
}
```
Then you run `Qwen/Qwen3-30B-A3B` with dynamic W8A8 quantization with the following command:
```Bash ```Bash
python -m vllm.entrypoints.openai.api_server \ python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-30B-A3B \ --model Qwen/Qwen3-30B-A3B-Int8 \
--quantization compressed-tensors --quantization compressed-tensors
``` ```
### AWQ ### AWQ
To run an `AWQ` model with vLLM-kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command: To run an `AWQ` model with vLLM-Kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
```Bash ```Bash
python -m vllm.entrypoints.openai.api_server \ python -m vllm.entrypoints.openai.api_server \
@@ -33,9 +63,10 @@ python -m vllm.entrypoints.openai.api_server \
--quantization awq --quantization awq
``` ```
### GPTQ ### GPTQ
To run a `GPTQ` model with vLLM-kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command: To run a `GPTQ` model with vLLM-Kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
```Bash ```Bash
python -m vllm.entrypoints.openai.api_server \ python -m vllm.entrypoints.openai.api_server \

View File

@@ -2,14 +2,14 @@
## Generative Models ## Generative Models
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph | | Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
| :------------ | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- | | :------------ | :------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
| Qwen3 | ✅ | | ✅ | ✅ | | ✅ | ✅ | | Qwen3 | ✅ | | ✅ | ✅ | | ✅ | ✅ |
| Qwen3-Moe | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Qwen3-Moe | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Qwen3-Next | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Qwen3-Next | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Deepseek v3.2 | ✅ | ✅ | | ✅ | | ✅ | ✅ |
## Multimodal Language Models ## Multimodal Language Models
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph | | Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
| :----------- | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- | | :------- | :------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
| Qwen3-VL | ✅ | | | ✅ | | ✅ | ✅ | | Qwen3-VL | ✅ | | | ✅ | | ✅ | ✅ |