diff --git a/docs/source/user_guide/feature_guide/quantization.md b/docs/source/user_guide/feature_guide/quantization.md index 9851cd3..be5b793 100644 --- a/docs/source/user_guide/feature_guide/quantization.md +++ b/docs/source/user_guide/feature_guide/quantization.md @@ -4,28 +4,58 @@ Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy. +## Support Matrix + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Compressed-Tensor (w8a8)Weight only (w4a16/w8a16)
DynamicStaticAWQ (w4a16)GPTQ (w4a16/w8a16)
Dense/MoEDense/MoEDenseMoEDenseMoE
WIPWIP
+ ++ W8A8 dynamic and static quantization are now supported for all LLMs and VLMs. ++ AWQ/GPTQ quantization is supported for all dense models. + ## Usages ### Compressed-tensor -To run a `compressed-tensors` model with vLLM-kunlun, you should first add the below configuration to the model's `config.json`: - -```Bash -"quantization_config": { - "quant_method": "compressed-tensors" - } -``` - -Then you run `Qwen/Qwen3-30B-A3B` with dynamic W8A8 quantization with the following command: +To run a `compressed-tensors` model with vLLM-Kunlun, you can use `Qwen/Qwen3-30B-A3B-Int8` with the following command: ```Bash python -m vllm.entrypoints.openai.api_server \ - --model Qwen/Qwen3-30B-A3B \ + --model Qwen/Qwen3-30B-A3B-Int8 \ --quantization compressed-tensors ``` + ### AWQ -To run an `AWQ` model with vLLM-kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command: +To run an `AWQ` model with vLLM-Kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command: ```Bash python -m vllm.entrypoints.openai.api_server \ @@ -33,9 +63,10 @@ python -m vllm.entrypoints.openai.api_server \ --quantization awq ``` + ### GPTQ -To run a `GPTQ` model with vLLM-kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command: +To run a `GPTQ` model with vLLM-Kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command: ```Bash python -m vllm.entrypoints.openai.api_server \ diff --git a/docs/source/user_guide/support_matrix/supported_models.md b/docs/source/user_guide/support_matrix/supported_models.md index fb86800..a3315bd 100644 --- a/docs/source/user_guide/support_matrix/supported_models.md +++ b/docs/source/user_guide/support_matrix/supported_models.md @@ -2,14 +2,14 @@ ## Generative Models -| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph | -| :------------ | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- | -| Qwen3 | ✅ | | ✅ | ✅ | | ✅ | ✅ | -| Qwen3-Moe | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| Qwen3-Next | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | - +| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph | +| :------------ | :------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- | +| Qwen3 | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | +| Qwen3-Moe | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | +| Qwen3-Next | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | +| Deepseek v3.2 | ✅ | ✅ | | ✅ | | ✅ | ✅ | ## Multimodal Language Models -| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph | -| :----------- | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- | -| Qwen3-VL | ✅ | | | ✅ | | ✅ | ✅ | +| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph | +| :------- | :------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- | +| Qwen3-VL | ✅ | ✅ | | ✅ | | ✅ | ✅ |