[Doc]modify the quantization user guide and add a quantization adaptation developer guide (#5554)

### What this PR does / why we need it?
This PR makes the following modifications:
1.delete the `user_guide/feature_guide/quantization-llm-compressor.md`
and merge it into `user_guide/feature_guide/quantization.md`.
2.update the content of `user_guide/feature_guide/quantization.md`.
3.add guidance `developer_guide/feature_guide/quantization.md' on the
adaptation of quantization algorithms and quantized models.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

---------

Signed-off-by: IncSec <1790766300@qq.com>
Signed-off-by: InSec <1790766300@qq.com>
This commit is contained in:
InSec
2026-01-05 09:12:11 +08:00
committed by GitHub
parent 96775a27a8
commit 7cf65d0581
10 changed files with 204 additions and 116 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

View File

@@ -14,4 +14,5 @@ ACL_Graph
KV_Cache_Pool_Guide
add_custom_aclnn_op
context_parallel
quantization
:::

View File

@@ -0,0 +1,111 @@
# Quantization Adaptation Guide
This document provides guidance for adapting quantization algorithms and models related to **ModelSlim**.
## Quantization Feature Introduction
### Quantization Inference Process
The current process for registering and obtaining quantization methods in vLLM Ascend is as follows:
![get_quant_method](../../assets/quantization/get_quant_method.png)
vLLM Ascend registers a custom ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendQuantConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute.
Currently supported quantization methods include `AscendLinearMethod`, `AscendFusedMoEMethod`, `AscendEmbeddingMethod`, and their corresponding non-quantized methods:
![quant_methods_overview](../../assets/quantization/quant_methods_overview.png)
The quantization method base class defined by vLLM and the overall call flow of quantization methods are as follows:
![quant_method_call_flow](../../assets/quantization/quant_method_call_flow.png)
The `embedding` method is generally not implemented for quantization, focusing only on the other three methods.
The `create_weights` method is used for weight initialization; the `process_weights_after_loading` method is used for weight post-processing, such as transposition, format conversion, data type conversion, etc.; the `apply` method is used to perform activation quantization and quantized matrix multiplication calculations during the forward process.
We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**).
**Supplemnet**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example:
```json
{
"model.layers.0.linear_attn.dt_bias": "FLOAT",
"model.layers.0.linear_attn.A_log": "FLOAT",
"model.layers.0.linear_attn.conv1d.weight": "FLOAT",
"model.layers.0.linear_attn.in_proj_qkvz.weight": "W8A8_DYNAMIC",
"model.layers.0.linear_attn.in_proj_qkvz.weight_scale": "W8A8_DYNAMIC",
"model.layers.0.linear_attn.in_proj_qkvz.weight_offset": "W8A8_DYNAMIC",
"model.layers.0.linear_attn.in_proj_ba.weight": "FLOAT",
"model.layers.0.linear_attn.norm.weight": "FLOAT",
"model.layers.0.linear_attn.out_proj.weight": "FLOAT",
"model.layers.0.mlp.gate.weight": "FLOAT",
"model.layers.0.mlp.experts.0.gate_proj.weight": "W8A8_DYNAMIC",
"model.layers.0.mlp.experts.0.gate_proj.weight_scale": "W8A8_DYNAMIC",
"model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC",
}
```
Based on the above content, we present a brief description of the adaptation process for quantization algorithms and quantized models.
### Quantization Algorithm Adaptation
- **Step 1: Algorithm Design**. Define the algorithm ID (e.g., `W4A8_DYNAMIC`), determine supported layers (linear, moe, attention), and design the quantization scheme (static/dynamic, pertensor/perchannel/pergroup).
- **Step 2: Registration**. Add the algorithm ID to `ASCEND_QUANTIZATION_METHOD_MAP` in `vllm_ascend/quantization/utils.py` and associate it with the corresponding method class.
```python
ASCEND_QUANTIZATION_METHOD_MAP: Dict[str, Dict[str, Type[Any]]] = {
"W4A8_DYNAMIC": {
"linear": AscendW4A8DynamicLinearMethod,
"moe": AscendW4A8DynamicFusedMoEMethod,
},
}
```
- **Step 3: Implementation**. Create an algorithm implementation file, such as `vllm_ascend/quantization/w4a8_dynamic.py`, and implement the method class and logic.
- **Step 4: Testing**. Use your algorithm to generate quantization configurations and verify correctness and performance on target models and hardware.
### Quantized Model Adaptation
Adapting a new quantized model requires ensuring the following three points:
- The original model has been successfully adapted in `vLLM Ascend`.
- **Fused Module Mapping**: Add the model's `model_type` to `packed_modules_model_mapping` in `vllm_ascend/quantization/quant_config.py` (e.g., `qkv_proj`, `gate_up_proj`, `experts`) to ensure sharding consistency and correct loading.
```python
packed_modules_model_mapping = {
"qwen3_moe": {
"qkv_proj": [
"q_proj",
"k_proj",
"v_proj",
],
"gate_up_proj": [
"gate_proj",
"up_proj",
],
"experts":
["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
},
}
```
- All quantization algorithms used by the quantized model have been integrated into the `quantization` module.
## Currently Supported Quantization Algorithms
vLLM Ascend supports multiple quantization algorithms. The following table provides an overview of each quantization algorithm based on the implementation in the `vllm_ascend.quantization` module:
| Algorithm | Weight | Activation | Weight Granularity | Activation Granularity | Type | Description |
| ------------------------ | ------ | ---------- | ------------------ | ---------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `W4A16` | INT4 | FP16/BF16 | Per-Group | Per-Tensor | Static | 4-bit weight quantization with 16-bit activation precision, specifically designed for MoE model expert layers, supporting int32 format weight packing |
| `W8A16` | INT8 | FP16/BF16 | Per-Channel | Per-Tensor | Static | 8-bit weight quantization with 16-bit activation precision, balancing accuracy and performance, suitable for linear layers |
| `W8A8` | INT8 | INT8 | Per-Channel | Per-Tensor | Static | Static activation quantization, suitable for scenarios requiring high precision |
| `W8A8_DYNAMIC` | INT8 | INT8 | Per-Channel | Per-Token | Dynamic | Dynamic activation quantization with per-token scaling factor calculation |
| `W4A8_DYNAMIC` | INT4 | INT8 | Per-Group | Per-Token | Dynamic | Supports both direct per-channel quantization to 4-bit and two-step quantization (per-channel to 8-bit then per-group to 4-bit) |
| `W4A4_FLATQUANT_DYNAMIC` | INT4 | INT4 | Per-Channel | Per-Token | Dynamic | Uses FlatQuant for activation distribution smoothing before 4-bit dynamic quantization, with additional matrix multiplications for precision preservation |
| `W8A8_MIX` | INT8 | INT8 | Per-Channel | Per-Tensor/Token | Mixed | PD Colocation Scenario uses dynamic quantization for both P node and D node; PD Disaggregation Scenario uses dynamic quantization for P node and static for D node |
**Static vs Dynamic:** Static quantization uses pre-computed scaling factors with better performance, while dynamic quantization computes scaling factors on-the-fly for each token/activation tensor with higher precision.
**Granularity:** Refers to the scope of scaling factor computation (e.g., per-tensor, per-channel, per-group).

View File

@@ -7,7 +7,6 @@ This section provides a detailed usage guide of vLLM Ascend features.
:maxdepth: 1
graph_mode
quantization
quantization-llm-compressor
sleep_mode
structured_output
lora

View File

@@ -1,65 +0,0 @@
# llm-compressor Quantization Guide
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
## Supported llm-compressor Quantization Types
Support CompressedTensorsW8A8 static weight
weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric.
Support CompressedTensorsW8A8Dynamic weight
weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic.
## Install llm-compressor
To quantize a model, you should install [llm-compressor](https://github.com/vllm-project/llm-compressor/blob/main/README.md). It is a unified library for creating compressed models for faster inference with vLLM.
Install llm-compressor
```bash
pip install llmcompressor
```
### Generate the W8A8 weights
```bash
cd examples/quantization/llm-compressor
python3 w8a8_int8_dynamic.py
```
for more details, see the [Official Sample](https://github.com/vllm-project/llm-compressor/tree/main/examples).
## Run the model
Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows:
### Offline inference
```python
import torch
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="{quantized_model_save_path}",
max_model_len=2048,
trust_remote_code=True)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
### Online inference
Start the quantized model using vLLM Ascend; no modifications to the startup command are required.

View File

@@ -1,71 +1,96 @@
# Quantization Guide
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
Model quantization is a technique that reduces model size and computational overhead by lowering the numerical precision of weights and activations, thereby saving memory and improving inference speed.
Since version 0.9.0rc2, the quantization feature is experimentally supported by vLLM Ascend. Users can enable the quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We will support more quantization algorithms and models in the future.
`vLLM Ascend` supports multiple quantization methods. This guide provides instructions for using different quantization tools and running quantized models on vLLM Ascend.
## Install ModelSlim
> **Note**
>
> You can choose to convert the model yourself or use the quantized model we uploaded.
> See <https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8>.
> Before you quantize a model, ensure that the RAM size is enough.
To quantize a model, you should install [ModelSlim](https://gitcode.com/Ascend/msit/tree/master) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
## Quantization Tools
Install ModelSlim:
vLLM Ascend supports models quantized by two main tools: `ModelSlim` and `LLM-Compressor`.
### 1. ModelSlim (Recommended)
[ModelSlim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md) is an Ascend-friendly compression tool focused on acceleration, using compression techniques, and built for Ascend hardware. It includes a series of inference optimization technologies such as quantization and compression, aiming to accelerate large language dense models, MoE models, multimodal understanding models, multimodal generation models, etc.
#### Installation
To use ModelSlim for model quantization, install it from its [Git repository](https://gitcode.com/Ascend/msit):
```bash
# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitcode.com/Ascend/msit/tree/master
# Install br_release_MindStudio_8.3.0_20261231 version
git clone https://gitcode.com/Ascend/msit.git -b br_release_MindStudio_8.3.0_20261231
cd msit/msmodelslim
bash install.sh
pip install accelerate
```
## Quantize model
#### Model Quantization
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded.
See https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8.
This conversion process requires a larger CPU memory, ensure that the RAM size is greater than 2 TB.
:::
The following example shows how to generate W8A8 quantized weights for the [Qwen3-MoE model](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/Qwen3-MOE/README.md).
### Adapts and changes
1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder.
### Generate the W8A8 weights
**Quantization Script:**
```bash
cd example/DeepSeek
cd example/Qwen3-MOE
# Support multi-card quantization
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
export MODEL_PATH="/root/.cache/Kimi-K2-Instruct"
export SAVE_PATH="/root/.cache/Kimi-K2-Instruct-W8A8"
python3 quant_deepseek_w8a8.py --model_path $MODEL_PATH --save_path $SAVE_PATH --batch_size 4
# Set model and save paths
export MODEL_PATH="/path/to/your/model"
export SAVE_PATH="/path/to/your/quantized_model"
# Run quantization script
python3 quant_qwen_moe_w8a8.py --model_path $MODEL_PATH \
--save_path $SAVE_PATH \
--anti_dataset ../common/qwen3-moe_anti_prompt_50.json \
--calib_dataset ../common/qwen3-moe_calib_prompt_50.json \
--trust_remote_code True
```
Here is the full converted model files except safetensors:
After quantization completes, the output directory will contain the quantized model files.
For more examples, refer to the [official examples](https://gitcode.com/Ascend/msit/tree/master/msmodelslim/example).
### 2. LLM-Compressor
[LLM-Compressor](https://github.com/vllm-project/llm-compressor) is a unified compressed model library for faster vLLM inference.
#### Installation
```bash
.
|-- config.json
|-- configuration.json
|-- configuration_deepseek.py
|-- generation_config.json
|-- modeling_deepseek.py
|-- quant_model_description.json
|-- quant_model_weight_w8a8_dynamic.safetensors.index.json
|-- tiktoken.model
|-- tokenization_kimi.py
`-- tokenizer_config.json
pip install llmcompressor
```
## Run the model
#### Model Quantization
Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows:
`LLM-Compressor` provides various quantization scheme examples. To generate W8A8 dynamic quantized weights:
### Offline inference
```bash
# Navigate to LLM-Compressor examples directory
cd examples/quantization/llm-compressor
# Run quantization script
python3 w8a8_int8_dynamic.py
```
For more content, refer to the [official examples](https://github.com/vllm-project/llm-compressor/tree/main/examples).
Currently supported quantization types by LLM-Compressor: `W8A8` and `W8A8_DYNAMIC`.
## Running Quantized Models
Once you have a quantized model which is generated by **ModelSlim**, you can use vLLM Ascend for inference by specifying the `--quantization ascend` parameter to enable quantization features, while for models quantized by **LLM-Compressor**, do not need to add this parameter.
### Offline Inference
```python
import torch
@@ -76,12 +101,20 @@ prompts = [
"Hello, my name is",
"The future of AI is",
]
# Set sampling parameters
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="{quantized_model_save_path}",
max_model_len=2048,
llm = LLM(model="/path/to/your/quantized_model",
max_model_len=4096,
trust_remote_code=True,
# Enable quantization by specifying `quantization="ascend"`
# Set appropriate TP and DP values
tensor_parallel_size=2,
data_parallel_size=1,
# Set an unused port
port=8000,
# Set serving model name
served_model_name="quantized_model",
# Specify `quantization="ascend"` to enable quantization for models quantized by ModelSlim
quantization="ascend")
outputs = llm.generate(prompts, sampling_params)
@@ -91,16 +124,25 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
### Online inference
### Online Inference
Enable quantization by specifying `--quantization ascend`, for more details, see the [DeepSeek-V3-W8A8 Tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html).
```bash
# Corresponding to offline inference
python -m vllm.entrypoints.api_server \
--model /path/to/your/quantized_model \
--max-model-len 4096 \
--port 8000 \
--tensor-parallel-size 2 \
--data-parallel-size 1 \
--served-model-name quantized_model \
--trust-remote-code \
--quantization ascend
```
## FAQs
The above commands are for reference only. For more details, consult the [official guide](../../tutorials/index.md).
### 1. How to solve the KeyError "xxx.layers.0.self_attn.q_proj.weight"?
## References
First, make sure you specify `ascend` as the quantization method. Second, check if your model is converted by the `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim version. Finally, if it still does not work, submit an issue. Maybe some new models need to be adapted.
### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim, where the missing configuration_deepseek.py error has been fixed.
- [ModelSlim Documentation](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md)
- [LLM-Compressor GitHub](https://github.com/vllm-project/llm-compressor)
- [vLLM Quantization Guide](https://docs.vllm.ai/en/latest/quantization/)