[Doc]modify the quantization user guide and add a quantization adaptation developer guide (#5554)
### What this PR does / why we need it?
This PR makes the following modifications:
1.delete the `user_guide/feature_guide/quantization-llm-compressor.md`
and merge it into `user_guide/feature_guide/quantization.md`.
2.update the content of `user_guide/feature_guide/quantization.md`.
3.add guidance `developer_guide/feature_guide/quantization.md' on the
adaptation of quantization algorithms and quantized models.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: IncSec <1790766300@qq.com>
Signed-off-by: InSec <1790766300@qq.com>
This commit is contained in:
BIN
docs/source/assets/quantization/get_quant_method.png
Normal file
BIN
docs/source/assets/quantization/get_quant_method.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 11 KiB |
BIN
docs/source/assets/quantization/quant_algorithm_overview.png
Normal file
BIN
docs/source/assets/quantization/quant_algorithm_overview.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 28 KiB |
BIN
docs/source/assets/quantization/quant_method_base_class.png
Normal file
BIN
docs/source/assets/quantization/quant_method_base_class.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 3.5 KiB |
BIN
docs/source/assets/quantization/quant_method_call_flow.png
Normal file
BIN
docs/source/assets/quantization/quant_method_call_flow.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 15 KiB |
BIN
docs/source/assets/quantization/quant_methods_overview.png
Normal file
BIN
docs/source/assets/quantization/quant_methods_overview.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 18 KiB |
@@ -14,4 +14,5 @@ ACL_Graph
|
||||
KV_Cache_Pool_Guide
|
||||
add_custom_aclnn_op
|
||||
context_parallel
|
||||
quantization
|
||||
:::
|
||||
|
||||
111
docs/source/developer_guide/feature_guide/quantization.md
Normal file
111
docs/source/developer_guide/feature_guide/quantization.md
Normal file
@@ -0,0 +1,111 @@
|
||||
# Quantization Adaptation Guide
|
||||
|
||||
This document provides guidance for adapting quantization algorithms and models related to **ModelSlim**.
|
||||
|
||||
## Quantization Feature Introduction
|
||||
|
||||
### Quantization Inference Process
|
||||
|
||||
The current process for registering and obtaining quantization methods in vLLM Ascend is as follows:
|
||||
|
||||

|
||||
|
||||
vLLM Ascend registers a custom ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendQuantConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute.
|
||||
|
||||
Currently supported quantization methods include `AscendLinearMethod`, `AscendFusedMoEMethod`, `AscendEmbeddingMethod`, and their corresponding non-quantized methods:
|
||||
|
||||

|
||||
|
||||
The quantization method base class defined by vLLM and the overall call flow of quantization methods are as follows:
|
||||
|
||||

|
||||
|
||||
The `embedding` method is generally not implemented for quantization, focusing only on the other three methods.
|
||||
|
||||
The `create_weights` method is used for weight initialization; the `process_weights_after_loading` method is used for weight post-processing, such as transposition, format conversion, data type conversion, etc.; the `apply` method is used to perform activation quantization and quantized matrix multiplication calculations during the forward process.
|
||||
|
||||
We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**).
|
||||
|
||||
**Supplemnet**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example:
|
||||
|
||||
```json
|
||||
{
|
||||
"model.layers.0.linear_attn.dt_bias": "FLOAT",
|
||||
"model.layers.0.linear_attn.A_log": "FLOAT",
|
||||
"model.layers.0.linear_attn.conv1d.weight": "FLOAT",
|
||||
"model.layers.0.linear_attn.in_proj_qkvz.weight": "W8A8_DYNAMIC",
|
||||
"model.layers.0.linear_attn.in_proj_qkvz.weight_scale": "W8A8_DYNAMIC",
|
||||
"model.layers.0.linear_attn.in_proj_qkvz.weight_offset": "W8A8_DYNAMIC",
|
||||
"model.layers.0.linear_attn.in_proj_ba.weight": "FLOAT",
|
||||
"model.layers.0.linear_attn.norm.weight": "FLOAT",
|
||||
"model.layers.0.linear_attn.out_proj.weight": "FLOAT",
|
||||
"model.layers.0.mlp.gate.weight": "FLOAT",
|
||||
"model.layers.0.mlp.experts.0.gate_proj.weight": "W8A8_DYNAMIC",
|
||||
"model.layers.0.mlp.experts.0.gate_proj.weight_scale": "W8A8_DYNAMIC",
|
||||
"model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC",
|
||||
}
|
||||
```
|
||||
|
||||
Based on the above content, we present a brief description of the adaptation process for quantization algorithms and quantized models.
|
||||
|
||||
### Quantization Algorithm Adaptation
|
||||
|
||||
- **Step 1: Algorithm Design**. Define the algorithm ID (e.g., `W4A8_DYNAMIC`), determine supported layers (linear, moe, attention), and design the quantization scheme (static/dynamic, pertensor/perchannel/pergroup).
|
||||
- **Step 2: Registration**. Add the algorithm ID to `ASCEND_QUANTIZATION_METHOD_MAP` in `vllm_ascend/quantization/utils.py` and associate it with the corresponding method class.
|
||||
|
||||
```python
|
||||
ASCEND_QUANTIZATION_METHOD_MAP: Dict[str, Dict[str, Type[Any]]] = {
|
||||
"W4A8_DYNAMIC": {
|
||||
"linear": AscendW4A8DynamicLinearMethod,
|
||||
"moe": AscendW4A8DynamicFusedMoEMethod,
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
- **Step 3: Implementation**. Create an algorithm implementation file, such as `vllm_ascend/quantization/w4a8_dynamic.py`, and implement the method class and logic.
|
||||
- **Step 4: Testing**. Use your algorithm to generate quantization configurations and verify correctness and performance on target models and hardware.
|
||||
|
||||
### Quantized Model Adaptation
|
||||
|
||||
Adapting a new quantized model requires ensuring the following three points:
|
||||
|
||||
- The original model has been successfully adapted in `vLLM Ascend`.
|
||||
- **Fused Module Mapping**: Add the model's `model_type` to `packed_modules_model_mapping` in `vllm_ascend/quantization/quant_config.py` (e.g., `qkv_proj`, `gate_up_proj`, `experts`) to ensure sharding consistency and correct loading.
|
||||
|
||||
```python
|
||||
packed_modules_model_mapping = {
|
||||
"qwen3_moe": {
|
||||
"qkv_proj": [
|
||||
"q_proj",
|
||||
"k_proj",
|
||||
"v_proj",
|
||||
],
|
||||
"gate_up_proj": [
|
||||
"gate_proj",
|
||||
"up_proj",
|
||||
],
|
||||
"experts":
|
||||
["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
- All quantization algorithms used by the quantized model have been integrated into the `quantization` module.
|
||||
|
||||
## Currently Supported Quantization Algorithms
|
||||
|
||||
vLLM Ascend supports multiple quantization algorithms. The following table provides an overview of each quantization algorithm based on the implementation in the `vllm_ascend.quantization` module:
|
||||
|
||||
| Algorithm | Weight | Activation | Weight Granularity | Activation Granularity | Type | Description |
|
||||
| ------------------------ | ------ | ---------- | ------------------ | ---------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `W4A16` | INT4 | FP16/BF16 | Per-Group | Per-Tensor | Static | 4-bit weight quantization with 16-bit activation precision, specifically designed for MoE model expert layers, supporting int32 format weight packing |
|
||||
| `W8A16` | INT8 | FP16/BF16 | Per-Channel | Per-Tensor | Static | 8-bit weight quantization with 16-bit activation precision, balancing accuracy and performance, suitable for linear layers |
|
||||
| `W8A8` | INT8 | INT8 | Per-Channel | Per-Tensor | Static | Static activation quantization, suitable for scenarios requiring high precision |
|
||||
| `W8A8_DYNAMIC` | INT8 | INT8 | Per-Channel | Per-Token | Dynamic | Dynamic activation quantization with per-token scaling factor calculation |
|
||||
| `W4A8_DYNAMIC` | INT4 | INT8 | Per-Group | Per-Token | Dynamic | Supports both direct per-channel quantization to 4-bit and two-step quantization (per-channel to 8-bit then per-group to 4-bit) |
|
||||
| `W4A4_FLATQUANT_DYNAMIC` | INT4 | INT4 | Per-Channel | Per-Token | Dynamic | Uses FlatQuant for activation distribution smoothing before 4-bit dynamic quantization, with additional matrix multiplications for precision preservation |
|
||||
| `W8A8_MIX` | INT8 | INT8 | Per-Channel | Per-Tensor/Token | Mixed | PD Colocation Scenario uses dynamic quantization for both P node and D node; PD Disaggregation Scenario uses dynamic quantization for P node and static for D node |
|
||||
|
||||
**Static vs Dynamic:** Static quantization uses pre-computed scaling factors with better performance, while dynamic quantization computes scaling factors on-the-fly for each token/activation tensor with higher precision.
|
||||
|
||||
**Granularity:** Refers to the scope of scaling factor computation (e.g., per-tensor, per-channel, per-group).
|
||||
@@ -7,7 +7,6 @@ This section provides a detailed usage guide of vLLM Ascend features.
|
||||
:maxdepth: 1
|
||||
graph_mode
|
||||
quantization
|
||||
quantization-llm-compressor
|
||||
sleep_mode
|
||||
structured_output
|
||||
lora
|
||||
|
||||
@@ -1,65 +0,0 @@
|
||||
# llm-compressor Quantization Guide
|
||||
|
||||
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
|
||||
|
||||
## Supported llm-compressor Quantization Types
|
||||
|
||||
Support CompressedTensorsW8A8 static weight
|
||||
|
||||
weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric.
|
||||
|
||||
Support CompressedTensorsW8A8Dynamic weight
|
||||
|
||||
weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic.
|
||||
|
||||
## Install llm-compressor
|
||||
|
||||
To quantize a model, you should install [llm-compressor](https://github.com/vllm-project/llm-compressor/blob/main/README.md). It is a unified library for creating compressed models for faster inference with vLLM.
|
||||
|
||||
Install llm-compressor
|
||||
|
||||
```bash
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
### Generate the W8A8 weights
|
||||
|
||||
```bash
|
||||
cd examples/quantization/llm-compressor
|
||||
|
||||
python3 w8a8_int8_dynamic.py
|
||||
```
|
||||
|
||||
for more details, see the [Official Sample](https://github.com/vllm-project/llm-compressor/tree/main/examples).
|
||||
|
||||
## Run the model
|
||||
|
||||
Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows:
|
||||
|
||||
### Offline inference
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
|
||||
llm = LLM(model="{quantized_model_save_path}",
|
||||
max_model_len=2048,
|
||||
trust_remote_code=True)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
### Online inference
|
||||
|
||||
Start the quantized model using vLLM Ascend; no modifications to the startup command are required.
|
||||
@@ -1,71 +1,96 @@
|
||||
# Quantization Guide
|
||||
|
||||
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
|
||||
Model quantization is a technique that reduces model size and computational overhead by lowering the numerical precision of weights and activations, thereby saving memory and improving inference speed.
|
||||
|
||||
Since version 0.9.0rc2, the quantization feature is experimentally supported by vLLM Ascend. Users can enable the quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We will support more quantization algorithms and models in the future.
|
||||
`vLLM Ascend` supports multiple quantization methods. This guide provides instructions for using different quantization tools and running quantized models on vLLM Ascend.
|
||||
|
||||
## Install ModelSlim
|
||||
> **Note**
|
||||
>
|
||||
> You can choose to convert the model yourself or use the quantized model we uploaded.
|
||||
> See <https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8>.
|
||||
> Before you quantize a model, ensure that the RAM size is enough.
|
||||
|
||||
To quantize a model, you should install [ModelSlim](https://gitcode.com/Ascend/msit/tree/master) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
|
||||
## Quantization Tools
|
||||
|
||||
Install ModelSlim:
|
||||
vLLM Ascend supports models quantized by two main tools: `ModelSlim` and `LLM-Compressor`.
|
||||
|
||||
### 1. ModelSlim (Recommended)
|
||||
|
||||
[ModelSlim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md) is an Ascend-friendly compression tool focused on acceleration, using compression techniques, and built for Ascend hardware. It includes a series of inference optimization technologies such as quantization and compression, aiming to accelerate large language dense models, MoE models, multimodal understanding models, multimodal generation models, etc.
|
||||
|
||||
#### Installation
|
||||
|
||||
To use ModelSlim for model quantization, install it from its [Git repository](https://gitcode.com/Ascend/msit):
|
||||
|
||||
```bash
|
||||
# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
|
||||
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitcode.com/Ascend/msit/tree/master
|
||||
# Install br_release_MindStudio_8.3.0_20261231 version
|
||||
git clone https://gitcode.com/Ascend/msit.git -b br_release_MindStudio_8.3.0_20261231
|
||||
|
||||
cd msit/msmodelslim
|
||||
|
||||
bash install.sh
|
||||
pip install accelerate
|
||||
```
|
||||
|
||||
## Quantize model
|
||||
#### Model Quantization
|
||||
|
||||
:::{note}
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded.
|
||||
See https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8.
|
||||
This conversion process requires a larger CPU memory, ensure that the RAM size is greater than 2 TB.
|
||||
:::
|
||||
The following example shows how to generate W8A8 quantized weights for the [Qwen3-MoE model](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/Qwen3-MOE/README.md).
|
||||
|
||||
### Adapts and changes
|
||||
1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
|
||||
2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder.
|
||||
|
||||
### Generate the W8A8 weights
|
||||
**Quantization Script:**
|
||||
|
||||
```bash
|
||||
cd example/DeepSeek
|
||||
cd example/Qwen3-MOE
|
||||
|
||||
# Support multi-card quantization
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
|
||||
export MODEL_PATH="/root/.cache/Kimi-K2-Instruct"
|
||||
export SAVE_PATH="/root/.cache/Kimi-K2-Instruct-W8A8"
|
||||
|
||||
python3 quant_deepseek_w8a8.py --model_path $MODEL_PATH --save_path $SAVE_PATH --batch_size 4
|
||||
# Set model and save paths
|
||||
export MODEL_PATH="/path/to/your/model"
|
||||
export SAVE_PATH="/path/to/your/quantized_model"
|
||||
|
||||
# Run quantization script
|
||||
python3 quant_qwen_moe_w8a8.py --model_path $MODEL_PATH \
|
||||
--save_path $SAVE_PATH \
|
||||
--anti_dataset ../common/qwen3-moe_anti_prompt_50.json \
|
||||
--calib_dataset ../common/qwen3-moe_calib_prompt_50.json \
|
||||
--trust_remote_code True
|
||||
```
|
||||
|
||||
Here is the full converted model files except safetensors:
|
||||
After quantization completes, the output directory will contain the quantized model files.
|
||||
|
||||
For more examples, refer to the [official examples](https://gitcode.com/Ascend/msit/tree/master/msmodelslim/example).
|
||||
|
||||
### 2. LLM-Compressor
|
||||
|
||||
[LLM-Compressor](https://github.com/vllm-project/llm-compressor) is a unified compressed model library for faster vLLM inference.
|
||||
|
||||
#### Installation
|
||||
|
||||
```bash
|
||||
.
|
||||
|-- config.json
|
||||
|-- configuration.json
|
||||
|-- configuration_deepseek.py
|
||||
|-- generation_config.json
|
||||
|-- modeling_deepseek.py
|
||||
|-- quant_model_description.json
|
||||
|-- quant_model_weight_w8a8_dynamic.safetensors.index.json
|
||||
|-- tiktoken.model
|
||||
|-- tokenization_kimi.py
|
||||
`-- tokenizer_config.json
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
## Run the model
|
||||
#### Model Quantization
|
||||
|
||||
Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows:
|
||||
`LLM-Compressor` provides various quantization scheme examples. To generate W8A8 dynamic quantized weights:
|
||||
|
||||
### Offline inference
|
||||
```bash
|
||||
# Navigate to LLM-Compressor examples directory
|
||||
cd examples/quantization/llm-compressor
|
||||
|
||||
# Run quantization script
|
||||
python3 w8a8_int8_dynamic.py
|
||||
```
|
||||
|
||||
For more content, refer to the [official examples](https://github.com/vllm-project/llm-compressor/tree/main/examples).
|
||||
|
||||
Currently supported quantization types by LLM-Compressor: `W8A8` and `W8A8_DYNAMIC`.
|
||||
|
||||
## Running Quantized Models
|
||||
|
||||
Once you have a quantized model which is generated by **ModelSlim**, you can use vLLM Ascend for inference by specifying the `--quantization ascend` parameter to enable quantization features, while for models quantized by **LLM-Compressor**, do not need to add this parameter.
|
||||
|
||||
### Offline Inference
|
||||
|
||||
```python
|
||||
import torch
|
||||
@@ -76,12 +101,20 @@ prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
# Set sampling parameters
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
|
||||
llm = LLM(model="{quantized_model_save_path}",
|
||||
max_model_len=2048,
|
||||
llm = LLM(model="/path/to/your/quantized_model",
|
||||
max_model_len=4096,
|
||||
trust_remote_code=True,
|
||||
# Enable quantization by specifying `quantization="ascend"`
|
||||
# Set appropriate TP and DP values
|
||||
tensor_parallel_size=2,
|
||||
data_parallel_size=1,
|
||||
# Set an unused port
|
||||
port=8000,
|
||||
# Set serving model name
|
||||
served_model_name="quantized_model",
|
||||
# Specify `quantization="ascend"` to enable quantization for models quantized by ModelSlim
|
||||
quantization="ascend")
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
@@ -91,16 +124,25 @@ for output in outputs:
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
### Online inference
|
||||
### Online Inference
|
||||
|
||||
Enable quantization by specifying `--quantization ascend`, for more details, see the [DeepSeek-V3-W8A8 Tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html).
|
||||
```bash
|
||||
# Corresponding to offline inference
|
||||
python -m vllm.entrypoints.api_server \
|
||||
--model /path/to/your/quantized_model \
|
||||
--max-model-len 4096 \
|
||||
--port 8000 \
|
||||
--tensor-parallel-size 2 \
|
||||
--data-parallel-size 1 \
|
||||
--served-model-name quantized_model \
|
||||
--trust-remote-code \
|
||||
--quantization ascend
|
||||
```
|
||||
|
||||
## FAQs
|
||||
The above commands are for reference only. For more details, consult the [official guide](../../tutorials/index.md).
|
||||
|
||||
### 1. How to solve the KeyError "xxx.layers.0.self_attn.q_proj.weight"?
|
||||
## References
|
||||
|
||||
First, make sure you specify `ascend` as the quantization method. Second, check if your model is converted by the `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim version. Finally, if it still does not work, submit an issue. Maybe some new models need to be adapted.
|
||||
|
||||
### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
|
||||
|
||||
Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim, where the missing configuration_deepseek.py error has been fixed.
|
||||
- [ModelSlim Documentation](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md)
|
||||
- [LLM-Compressor GitHub](https://github.com/vllm-project/llm-compressor)
|
||||
- [vLLM Quantization Guide](https://docs.vllm.ai/en/latest/quantization/)
|
||||
|
||||
Reference in New Issue
Block a user