[doc] update quantization guide doc (#88)
This commit is contained in:
@@ -4,28 +4,58 @@
|
|||||||
Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
|
Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
|
||||||
|
|
||||||
|
|
||||||
|
## Support Matrix
|
||||||
|
<table border="1" style="border-collapse: collapse; width: auto; margin: 0 0 0 0; text-align: center;">
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<td colspan="2" style="padding: 10px; font-weight: bold; border: 1px solid #000;">Compressed-Tensor (w8a8)</td>
|
||||||
|
<td colspan="4" style="padding: 10px; font-weight: bold; border: 1px solid #000;">Weight only (w4a16/w8a16)</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">Dynamic</td>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">Static</td>
|
||||||
|
<td colspan="2" style="padding: 10px; border: 1px solid #000;">AWQ (w4a16)</td>
|
||||||
|
<td colspan="2" style="padding: 10px; border: 1px solid #000;">GPTQ (w4a16/w8a16)</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">Dense/MoE</td>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">Dense</td>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">MoE</td>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">Dense</td>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">MoE</td>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr style="height: 40px;">
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">✅</td>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">✅</td>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">✅</td>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">WIP</td>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">✅</td>
|
||||||
|
<td style="padding: 10px; border: 1px solid #000;">WIP</td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
+ W8A8 dynamic and static quantization are now supported for all LLMs and VLMs.
|
||||||
|
+ AWQ/GPTQ quantization is supported for all dense models.
|
||||||
|
|
||||||
## Usages
|
## Usages
|
||||||
|
|
||||||
### Compressed-tensor
|
### Compressed-tensor
|
||||||
To run a `compressed-tensors` model with vLLM-kunlun, you should first add the below configuration to the model's `config.json`:
|
To run a `compressed-tensors` model with vLLM-Kunlun, you can use `Qwen/Qwen3-30B-A3B-Int8` with the following command:
|
||||||
|
|
||||||
```Bash
|
|
||||||
"quantization_config": {
|
|
||||||
"quant_method": "compressed-tensors"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Then you run `Qwen/Qwen3-30B-A3B` with dynamic W8A8 quantization with the following command:
|
|
||||||
|
|
||||||
```Bash
|
```Bash
|
||||||
python -m vllm.entrypoints.openai.api_server \
|
python -m vllm.entrypoints.openai.api_server \
|
||||||
--model Qwen/Qwen3-30B-A3B \
|
--model Qwen/Qwen3-30B-A3B-Int8 \
|
||||||
--quantization compressed-tensors
|
--quantization compressed-tensors
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### AWQ
|
### AWQ
|
||||||
|
|
||||||
To run an `AWQ` model with vLLM-kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
|
To run an `AWQ` model with vLLM-Kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
|
||||||
|
|
||||||
```Bash
|
```Bash
|
||||||
python -m vllm.entrypoints.openai.api_server \
|
python -m vllm.entrypoints.openai.api_server \
|
||||||
@@ -33,9 +63,10 @@ python -m vllm.entrypoints.openai.api_server \
|
|||||||
--quantization awq
|
--quantization awq
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### GPTQ
|
### GPTQ
|
||||||
|
|
||||||
To run a `GPTQ` model with vLLM-kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
|
To run a `GPTQ` model with vLLM-Kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
|
||||||
|
|
||||||
```Bash
|
```Bash
|
||||||
python -m vllm.entrypoints.openai.api_server \
|
python -m vllm.entrypoints.openai.api_server \
|
||||||
|
|||||||
@@ -2,14 +2,14 @@
|
|||||||
|
|
||||||
## Generative Models
|
## Generative Models
|
||||||
|
|
||||||
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
|
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
|
||||||
| :------------ | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
|
| :------------ | :------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
|
||||||
| Qwen3 | ✅ | | ✅ | ✅ | | ✅ | ✅ |
|
| Qwen3 | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ |
|
||||||
| Qwen3-Moe | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| Qwen3-Moe | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| Qwen3-Next | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| Qwen3-Next | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
|
| Deepseek v3.2 | ✅ | ✅ | | ✅ | | ✅ | ✅ |
|
||||||
|
|
||||||
## Multimodal Language Models
|
## Multimodal Language Models
|
||||||
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
|
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
|
||||||
| :----------- | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
|
| :------- | :------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
|
||||||
| Qwen3-VL | ✅ | | | ✅ | | ✅ | ✅ |
|
| Qwen3-VL | ✅ | ✅ | | ✅ | | ✅ | ✅ |
|
||||||
|
|||||||
Reference in New Issue
Block a user