[Doc]modify the quantization user guide and add a quantization adaptation developer guide (#5554)

### What this PR does / why we need it? This PR makes the following modifications: 1.delete the `user_guide/feature_guide/quantization-llm-compressor.md` and merge it into `user_guide/feature_guide/quantization.md`. 2.update the content of `user_guide/feature_guide/quantization.md`. 3.add guidance `developer_guide/feature_guide/quantization.md' on the adaptation of quantization algorithms and quantized models. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: 7157596103 --------- Signed-off-by: IncSec <1790766300@qq.com> Signed-off-by: InSec <1790766300@qq.com>
2026-01-05 09:12:11 +08:00
parent 96775a27a8
commit 7cf65d0581
10 changed files with 204 additions and 116 deletions
--- a/docs/source/assets/quantization/get_quant_method.png
+++ b/docs/source/assets/quantization/get_quant_method.png
--- a/docs/source/assets/quantization/quant_algorithm_overview.png
+++ b/docs/source/assets/quantization/quant_algorithm_overview.png
--- a/docs/source/assets/quantization/quant_method_base_class.png
+++ b/docs/source/assets/quantization/quant_method_base_class.png
--- a/docs/source/assets/quantization/quant_method_call_flow.png
+++ b/docs/source/assets/quantization/quant_method_call_flow.png
--- a/docs/source/assets/quantization/quant_methods_overview.png
+++ b/docs/source/assets/quantization/quant_methods_overview.png
--- a/docs/source/developer_guide/feature_guide/index.md
+++ b/docs/source/developer_guide/feature_guide/index.md
@@ -14,4 +14,5 @@ ACL_Graph
 KV_Cache_Pool_Guide
 add_custom_aclnn_op
 context_parallel
+quantization
 :::
--- a/docs/source/developer_guide/feature_guide/quantization.md
+++ b/docs/source/developer_guide/feature_guide/quantization.md
@@ -0,0 +1,111 @@
+# Quantization Adaptation Guide
+
+This document provides guidance for adapting quantization algorithms and models related to **ModelSlim**.
+
+## Quantization Feature Introduction
+
+### Quantization Inference Process
+
+The current process for registering and obtaining quantization methods in vLLM Ascend is as follows:
+
+![get_quant_method](../../assets/quantization/get_quant_method.png)
+
+vLLM Ascend registers a custom ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendQuantConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute.
+
+Currently supported quantization methods include `AscendLinearMethod`, `AscendFusedMoEMethod`, `AscendEmbeddingMethod`, and their corresponding non-quantized methods:
+
+![quant_methods_overview](../../assets/quantization/quant_methods_overview.png)
+
+The quantization method base class defined by vLLM  and the overall call flow of quantization methods are as follows:
+
+![quant_method_call_flow](../../assets/quantization/quant_method_call_flow.png)
+
+The `embedding` method is generally not implemented for quantization, focusing only on the other three methods.
+
+The `create_weights` method is used for weight initialization; the `process_weights_after_loading` method is used for weight post-processing, such as transposition, format conversion, data type conversion, etc.; the `apply` method is used to perform activation quantization and quantized matrix multiplication calculations during the forward process.
+
+We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**).
+
+**Supplemnet**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example:
+
+```json
+{
+    "model.layers.0.linear_attn.dt_bias": "FLOAT",
+    "model.layers.0.linear_attn.A_log": "FLOAT",
+    "model.layers.0.linear_attn.conv1d.weight": "FLOAT",
+    "model.layers.0.linear_attn.in_proj_qkvz.weight": "W8A8_DYNAMIC",
+    "model.layers.0.linear_attn.in_proj_qkvz.weight_scale": "W8A8_DYNAMIC",
+    "model.layers.0.linear_attn.in_proj_qkvz.weight_offset": "W8A8_DYNAMIC",
+    "model.layers.0.linear_attn.in_proj_ba.weight": "FLOAT",
+    "model.layers.0.linear_attn.norm.weight": "FLOAT",
+    "model.layers.0.linear_attn.out_proj.weight": "FLOAT",
+    "model.layers.0.mlp.gate.weight": "FLOAT",
+    "model.layers.0.mlp.experts.0.gate_proj.weight": "W8A8_DYNAMIC",
+    "model.layers.0.mlp.experts.0.gate_proj.weight_scale": "W8A8_DYNAMIC",
+    "model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC",
+}
+```
+
+Based on the above content, we present a brief description of the adaptation process for quantization algorithms and quantized models.
+
+### Quantization Algorithm Adaptation
+
+- **Step 1: Algorithm Design**. Define the algorithm ID (e.g., `W4A8_DYNAMIC`), determine supported layers (linear, moe, attention), and design the quantization scheme (static/dynamic, pertensor/perchannel/pergroup).
+- **Step 2: Registration**. Add the algorithm ID to `ASCEND_QUANTIZATION_METHOD_MAP` in `vllm_ascend/quantization/utils.py` and associate it with the corresponding method class.
+
+```python
+ASCEND_QUANTIZATION_METHOD_MAP: Dict[str, Dict[str, Type[Any]]] = {
+    "W4A8_DYNAMIC": {
+        "linear": AscendW4A8DynamicLinearMethod,
+        "moe": AscendW4A8DynamicFusedMoEMethod,
+    },
+}
+```
+
+- **Step 3: Implementation**. Create an algorithm implementation file, such as `vllm_ascend/quantization/w4a8_dynamic.py`, and implement the method class and logic.
+- **Step 4: Testing**. Use your algorithm to generate quantization configurations and verify correctness and performance on target models and hardware.
+
+### Quantized Model Adaptation
+
+Adapting a new quantized model requires ensuring the following three points:
+
+- The original model has been successfully adapted in `vLLM Ascend`.
+- **Fused Module Mapping**: Add the model's `model_type` to `packed_modules_model_mapping` in `vllm_ascend/quantization/quant_config.py` (e.g., `qkv_proj`, `gate_up_proj`, `experts`) to ensure sharding consistency and correct loading.
+
+```python
+packed_modules_model_mapping = {
+    "qwen3_moe": {
+        "qkv_proj": [
+            "q_proj",
+            "k_proj",
+            "v_proj",
+        ],
+        "gate_up_proj": [
+            "gate_proj",
+            "up_proj",
+        ],
+        "experts":
+        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
+    },
+}
+```
+
+- All quantization algorithms used by the quantized model have been integrated into the `quantization` module.
+
+## Currently Supported Quantization Algorithms
+
+vLLM Ascend supports multiple quantization algorithms. The following table provides an overview of each quantization algorithm based on the implementation in the `vllm_ascend.quantization` module:
+
+| Algorithm                | Weight | Activation | Weight Granularity | Activation Granularity | Type    | Description                                                                                                                                                        |
+| ------------------------ | ------ | ---------- | ------------------ | ---------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `W4A16`                  | INT4   | FP16/BF16  | Per-Group          | Per-Tensor             | Static  | 4-bit weight quantization with 16-bit activation precision, specifically designed for MoE model expert layers, supporting int32 format weight packing              |
+| `W8A16`                  | INT8   | FP16/BF16  | Per-Channel        | Per-Tensor             | Static  | 8-bit weight quantization with 16-bit activation precision, balancing accuracy and performance, suitable for linear layers                                         |
+| `W8A8`                   | INT8   | INT8       | Per-Channel        | Per-Tensor             | Static  | Static activation quantization, suitable for scenarios requiring high precision                                                                                    |
+| `W8A8_DYNAMIC`           | INT8   | INT8       | Per-Channel        | Per-Token              | Dynamic | Dynamic activation quantization with per-token scaling factor calculation                                                                                          |
+| `W4A8_DYNAMIC`           | INT4   | INT8       | Per-Group          | Per-Token              | Dynamic | Supports both direct per-channel quantization to 4-bit and two-step quantization (per-channel to 8-bit then per-group to 4-bit)                                    |
+| `W4A4_FLATQUANT_DYNAMIC` | INT4   | INT4       | Per-Channel        | Per-Token              | Dynamic | Uses FlatQuant for activation distribution smoothing before 4-bit dynamic quantization, with additional matrix multiplications for precision preservation          |
+| `W8A8_MIX`               | INT8   | INT8       | Per-Channel        | Per-Tensor/Token       | Mixed   | PD Colocation Scenario uses dynamic quantization for both P node and D node; PD Disaggregation Scenario uses dynamic quantization for P node and static for D node |
+
+**Static vs Dynamic:** Static quantization uses pre-computed scaling factors with better performance, while dynamic quantization computes scaling factors on-the-fly for each token/activation tensor with higher precision.
+
+**Granularity:** Refers to the scope of scaling factor computation (e.g., per-tensor, per-channel, per-group).
--- a/docs/source/user_guide/feature_guide/index.md
+++ b/docs/source/user_guide/feature_guide/index.md
@@ -7,7 +7,6 @@ This section provides a detailed usage guide of vLLM Ascend features.
 :maxdepth: 1
 graph_mode
 quantization
-quantization-llm-compressor
 sleep_mode
 structured_output
 lora
--- a/docs/source/user_guide/feature_guide/quantization-llm-compressor.md
+++ b/docs/source/user_guide/feature_guide/quantization-llm-compressor.md
@@ -1,65 +0,0 @@
-# llm-compressor Quantization Guide
-
-Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
-
-## Supported llm-compressor Quantization Types
-
-Support CompressedTensorsW8A8 static weight
-
-weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric.
-
-Support CompressedTensorsW8A8Dynamic weight
-
-weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic.
-
-## Install llm-compressor
-
-To quantize a model, you should install [llm-compressor](https://github.com/vllm-project/llm-compressor/blob/main/README.md). It is a unified library for creating compressed models for faster inference with vLLM.
-
-Install llm-compressor
-
-```bash
-pip install llmcompressor
-```
-
-### Generate the W8A8 weights
-
-```bash
-cd examples/quantization/llm-compressor
-
-python3 w8a8_int8_dynamic.py
-```
-
-for more details, see the [Official Sample](https://github.com/vllm-project/llm-compressor/tree/main/examples).
-
-## Run the model
-
-Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows:
-
-### Offline inference
-
-```python
-import torch
-
-from vllm import LLM, SamplingParams
-
-prompts = [
-    "Hello, my name is",
-    "The future of AI is",
-]
-sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
-
-llm = LLM(model="{quantized_model_save_path}",
-          max_model_len=2048,
-          trust_remote_code=True)
-
-outputs = llm.generate(prompts, sampling_params)
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-```
-
-### Online inference
-
-Start the quantized model using vLLM Ascend; no modifications to the startup command are required.
--- a/docs/source/user_guide/feature_guide/quantization.md
+++ b/docs/source/user_guide/feature_guide/quantization.md
@@ -1,71 +1,96 @@
 # Quantization Guide

-Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
+Model quantization is a technique that reduces model size and computational overhead by lowering the numerical precision of weights and activations, thereby saving memory and improving inference speed.

-Since version 0.9.0rc2, the quantization feature is experimentally supported by vLLM Ascend. Users can enable the quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We will support more quantization algorithms and models in the future.
+`vLLM Ascend` supports multiple quantization methods. This guide provides instructions for using different quantization tools and running quantized models on vLLM Ascend.

-## Install ModelSlim
+> **Note**
+>
+> You can choose to convert the model yourself or use the quantized model we uploaded.
+> See <https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8>.
+> Before you quantize a model, ensure that the RAM size is enough.

-To quantize a model, you should install [ModelSlim](https://gitcode.com/Ascend/msit/tree/master) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
+## Quantization Tools

-Install ModelSlim:
+vLLM Ascend supports models quantized by two main tools: `ModelSlim` and `LLM-Compressor`.
+
+### 1. ModelSlim (Recommended)
+
+[ModelSlim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md) is an Ascend-friendly compression tool focused on acceleration, using compression techniques, and built for Ascend hardware. It includes a series of inference optimization technologies such as quantization and compression, aiming to accelerate large language dense models, MoE models, multimodal understanding models, multimodal generation models, etc.
+
+#### Installation
+
+To use ModelSlim for model quantization, install it from its [Git repository](https://gitcode.com/Ascend/msit):

 ```bash
-# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
-git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitcode.com/Ascend/msit/tree/master
+# Install br_release_MindStudio_8.3.0_20261231 version
+git clone https://gitcode.com/Ascend/msit.git -b br_release_MindStudio_8.3.0_20261231

 cd msit/msmodelslim

 bash install.sh
-pip install accelerate
 ```

-## Quantize model
+#### Model Quantization

-:::{note}
-You can choose to convert the model yourself or use the quantized model we uploaded.
-See https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8.
-This conversion process requires a larger CPU memory, ensure that the RAM size is greater than 2 TB.
-:::
+The following example shows how to generate W8A8 quantized weights for the [Qwen3-MoE model](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/Qwen3-MOE/README.md).

-### Adapts and changes
-1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
-2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder.
-
-### Generate the W8A8 weights
+**Quantization Script:**

 ```bash
-cd example/DeepSeek
+cd example/Qwen3-MOE

+# Support multi-card quantization
 export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
-export MODEL_PATH="/root/.cache/Kimi-K2-Instruct"
-export SAVE_PATH="/root/.cache/Kimi-K2-Instruct-W8A8"

-python3 quant_deepseek_w8a8.py --model_path $MODEL_PATH --save_path $SAVE_PATH --batch_size 4
+# Set model and save paths
+export MODEL_PATH="/path/to/your/model"
+export SAVE_PATH="/path/to/your/quantized_model"
+
+# Run quantization script
+python3 quant_qwen_moe_w8a8.py --model_path $MODEL_PATH \
+--save_path $SAVE_PATH \
+--anti_dataset ../common/qwen3-moe_anti_prompt_50.json \
+--calib_dataset ../common/qwen3-moe_calib_prompt_50.json \
+--trust_remote_code True
 ```

-Here is the full converted model files except safetensors:
+After quantization completes, the output directory will contain the quantized model files.
+
+For more examples, refer to the [official examples](https://gitcode.com/Ascend/msit/tree/master/msmodelslim/example).
+
+### 2. LLM-Compressor
+
+[LLM-Compressor](https://github.com/vllm-project/llm-compressor) is a unified compressed model library for faster vLLM inference.
+
+#### Installation

 ```bash
-.
-|-- config.json
-|-- configuration.json
-|-- configuration_deepseek.py
-|-- generation_config.json
-|-- modeling_deepseek.py
-|-- quant_model_description.json
-|-- quant_model_weight_w8a8_dynamic.safetensors.index.json
-|-- tiktoken.model
-|-- tokenization_kimi.py
-`-- tokenizer_config.json
+pip install llmcompressor
 ```

-## Run the model
+#### Model Quantization

-Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows:
+`LLM-Compressor` provides various quantization scheme examples. To generate W8A8 dynamic quantized weights:

-### Offline inference
+```bash
+# Navigate to LLM-Compressor examples directory
+cd examples/quantization/llm-compressor
+
+# Run quantization script
+python3 w8a8_int8_dynamic.py
+```
+
+For more content, refer to the [official examples](https://github.com/vllm-project/llm-compressor/tree/main/examples).
+
+Currently supported quantization types by LLM-Compressor: `W8A8` and `W8A8_DYNAMIC`.
+
+## Running Quantized Models
+
+Once you have a quantized model which is generated by **ModelSlim**, you can use vLLM Ascend for inference by specifying the `--quantization ascend` parameter to enable quantization features, while for models quantized by **LLM-Compressor**, do not need to add this parameter.
+
+### Offline Inference

 ```python
 import torch
@@ -76,12 +101,20 @@ prompts = [
    "Hello, my name is",
    "The future of AI is",
 ]
+# Set sampling parameters
 sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)

-llm = LLM(model="{quantized_model_save_path}",
-          max_model_len=2048,
+llm = LLM(model="/path/to/your/quantized_model",
+          max_model_len=4096,
          trust_remote_code=True,
-          # Enable quantization by specifying `quantization="ascend"`
+          # Set appropriate TP and DP values
+          tensor_parallel_size=2,
+          data_parallel_size=1,
+          # Set an unused port
+          port=8000,
+          # Set serving model name
+          served_model_name="quantized_model",
+          # Specify `quantization="ascend"` to enable quantization for models quantized by ModelSlim
          quantization="ascend")

 outputs = llm.generate(prompts, sampling_params)
@@ -91,16 +124,25 @@ for output in outputs:
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 ```

-### Online inference
+### Online Inference

-Enable quantization by specifying `--quantization ascend`, for more details, see the [DeepSeek-V3-W8A8 Tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html).
+```bash
+# Corresponding to offline inference
+python -m vllm.entrypoints.api_server \
+    --model /path/to/your/quantized_model \
+    --max-model-len 4096 \
+    --port 8000 \
+    --tensor-parallel-size 2 \
+    --data-parallel-size 1 \
+    --served-model-name quantized_model \
+    --trust-remote-code \
+    --quantization ascend
+```

-## FAQs
+The above commands are for reference only. For more details, consult the [official guide](../../tutorials/index.md).

-### 1. How to solve the KeyError "xxx.layers.0.self_attn.q_proj.weight"?
+## References

-First, make sure you specify `ascend` as the quantization method. Second, check if your model is converted by the `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim version. Finally, if it still does not work, submit an issue. Maybe some new models need to be adapted.
-
-### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
-
-Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim, where the missing configuration_deepseek.py error has been fixed.
+- [ModelSlim Documentation](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md)
+- [LLM-Compressor GitHub](https://github.com/vllm-project/llm-compressor)
+- [vLLM Quantization Guide](https://docs.vllm.ai/en/latest/quantization/)