diff --git a/docs/source/assets/quantization/get_quant_method.png b/docs/source/assets/quantization/get_quant_method.png new file mode 100644 index 00000000..45a9a8d1 Binary files /dev/null and b/docs/source/assets/quantization/get_quant_method.png differ diff --git a/docs/source/assets/quantization/quant_algorithm_overview.png b/docs/source/assets/quantization/quant_algorithm_overview.png new file mode 100644 index 00000000..aec9be0c Binary files /dev/null and b/docs/source/assets/quantization/quant_algorithm_overview.png differ diff --git a/docs/source/assets/quantization/quant_method_base_class.png b/docs/source/assets/quantization/quant_method_base_class.png new file mode 100644 index 00000000..d9489efd Binary files /dev/null and b/docs/source/assets/quantization/quant_method_base_class.png differ diff --git a/docs/source/assets/quantization/quant_method_call_flow.png b/docs/source/assets/quantization/quant_method_call_flow.png new file mode 100644 index 00000000..3df57606 Binary files /dev/null and b/docs/source/assets/quantization/quant_method_call_flow.png differ diff --git a/docs/source/assets/quantization/quant_methods_overview.png b/docs/source/assets/quantization/quant_methods_overview.png new file mode 100644 index 00000000..de152a73 Binary files /dev/null and b/docs/source/assets/quantization/quant_methods_overview.png differ diff --git a/docs/source/developer_guide/feature_guide/index.md b/docs/source/developer_guide/feature_guide/index.md index 20c9e179..6ceb74cd 100644 --- a/docs/source/developer_guide/feature_guide/index.md +++ b/docs/source/developer_guide/feature_guide/index.md @@ -14,4 +14,5 @@ ACL_Graph KV_Cache_Pool_Guide add_custom_aclnn_op context_parallel +quantization ::: diff --git a/docs/source/developer_guide/feature_guide/quantization.md b/docs/source/developer_guide/feature_guide/quantization.md new file mode 100644 index 00000000..e84db9c2 --- /dev/null +++ b/docs/source/developer_guide/feature_guide/quantization.md @@ -0,0 +1,111 @@ +# Quantization Adaptation Guide + +This document provides guidance for adapting quantization algorithms and models related to **ModelSlim**. + +## Quantization Feature Introduction + +### Quantization Inference Process + +The current process for registering and obtaining quantization methods in vLLM Ascend is as follows: + +![get_quant_method](../../assets/quantization/get_quant_method.png) + +vLLM Ascend registers a custom ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendQuantConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute. + +Currently supported quantization methods include `AscendLinearMethod`, `AscendFusedMoEMethod`, `AscendEmbeddingMethod`, and their corresponding non-quantized methods: + +![quant_methods_overview](../../assets/quantization/quant_methods_overview.png) + +The quantization method base class defined by vLLM and the overall call flow of quantization methods are as follows: + +![quant_method_call_flow](../../assets/quantization/quant_method_call_flow.png) + +The `embedding` method is generally not implemented for quantization, focusing only on the other three methods. + +The `create_weights` method is used for weight initialization; the `process_weights_after_loading` method is used for weight post-processing, such as transposition, format conversion, data type conversion, etc.; the `apply` method is used to perform activation quantization and quantized matrix multiplication calculations during the forward process. + +We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**). + +**Supplemnet**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example: + +```json +{ + "model.layers.0.linear_attn.dt_bias": "FLOAT", + "model.layers.0.linear_attn.A_log": "FLOAT", + "model.layers.0.linear_attn.conv1d.weight": "FLOAT", + "model.layers.0.linear_attn.in_proj_qkvz.weight": "W8A8_DYNAMIC", + "model.layers.0.linear_attn.in_proj_qkvz.weight_scale": "W8A8_DYNAMIC", + "model.layers.0.linear_attn.in_proj_qkvz.weight_offset": "W8A8_DYNAMIC", + "model.layers.0.linear_attn.in_proj_ba.weight": "FLOAT", + "model.layers.0.linear_attn.norm.weight": "FLOAT", + "model.layers.0.linear_attn.out_proj.weight": "FLOAT", + "model.layers.0.mlp.gate.weight": "FLOAT", + "model.layers.0.mlp.experts.0.gate_proj.weight": "W8A8_DYNAMIC", + "model.layers.0.mlp.experts.0.gate_proj.weight_scale": "W8A8_DYNAMIC", + "model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC", +} +``` + +Based on the above content, we present a brief description of the adaptation process for quantization algorithms and quantized models. + +### Quantization Algorithm Adaptation + +- **Step 1: Algorithm Design**. Define the algorithm ID (e.g., `W4A8_DYNAMIC`), determine supported layers (linear, moe, attention), and design the quantization scheme (static/dynamic, pertensor/perchannel/pergroup). +- **Step 2: Registration**. Add the algorithm ID to `ASCEND_QUANTIZATION_METHOD_MAP` in `vllm_ascend/quantization/utils.py` and associate it with the corresponding method class. + +```python +ASCEND_QUANTIZATION_METHOD_MAP: Dict[str, Dict[str, Type[Any]]] = { + "W4A8_DYNAMIC": { + "linear": AscendW4A8DynamicLinearMethod, + "moe": AscendW4A8DynamicFusedMoEMethod, + }, +} +``` + +- **Step 3: Implementation**. Create an algorithm implementation file, such as `vllm_ascend/quantization/w4a8_dynamic.py`, and implement the method class and logic. +- **Step 4: Testing**. Use your algorithm to generate quantization configurations and verify correctness and performance on target models and hardware. + +### Quantized Model Adaptation + +Adapting a new quantized model requires ensuring the following three points: + +- The original model has been successfully adapted in `vLLM Ascend`. +- **Fused Module Mapping**: Add the model's `model_type` to `packed_modules_model_mapping` in `vllm_ascend/quantization/quant_config.py` (e.g., `qkv_proj`, `gate_up_proj`, `experts`) to ensure sharding consistency and correct loading. + +```python +packed_modules_model_mapping = { + "qwen3_moe": { + "qkv_proj": [ + "q_proj", + "k_proj", + "v_proj", + ], + "gate_up_proj": [ + "gate_proj", + "up_proj", + ], + "experts": + ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"], + }, +} +``` + +- All quantization algorithms used by the quantized model have been integrated into the `quantization` module. + +## Currently Supported Quantization Algorithms + +vLLM Ascend supports multiple quantization algorithms. The following table provides an overview of each quantization algorithm based on the implementation in the `vllm_ascend.quantization` module: + +| Algorithm | Weight | Activation | Weight Granularity | Activation Granularity | Type | Description | +| ------------------------ | ------ | ---------- | ------------------ | ---------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `W4A16` | INT4 | FP16/BF16 | Per-Group | Per-Tensor | Static | 4-bit weight quantization with 16-bit activation precision, specifically designed for MoE model expert layers, supporting int32 format weight packing | +| `W8A16` | INT8 | FP16/BF16 | Per-Channel | Per-Tensor | Static | 8-bit weight quantization with 16-bit activation precision, balancing accuracy and performance, suitable for linear layers | +| `W8A8` | INT8 | INT8 | Per-Channel | Per-Tensor | Static | Static activation quantization, suitable for scenarios requiring high precision | +| `W8A8_DYNAMIC` | INT8 | INT8 | Per-Channel | Per-Token | Dynamic | Dynamic activation quantization with per-token scaling factor calculation | +| `W4A8_DYNAMIC` | INT4 | INT8 | Per-Group | Per-Token | Dynamic | Supports both direct per-channel quantization to 4-bit and two-step quantization (per-channel to 8-bit then per-group to 4-bit) | +| `W4A4_FLATQUANT_DYNAMIC` | INT4 | INT4 | Per-Channel | Per-Token | Dynamic | Uses FlatQuant for activation distribution smoothing before 4-bit dynamic quantization, with additional matrix multiplications for precision preservation | +| `W8A8_MIX` | INT8 | INT8 | Per-Channel | Per-Tensor/Token | Mixed | PD Colocation Scenario uses dynamic quantization for both P node and D node; PD Disaggregation Scenario uses dynamic quantization for P node and static for D node | + +**Static vs Dynamic:** Static quantization uses pre-computed scaling factors with better performance, while dynamic quantization computes scaling factors on-the-fly for each token/activation tensor with higher precision. + +**Granularity:** Refers to the scope of scaling factor computation (e.g., per-tensor, per-channel, per-group). diff --git a/docs/source/user_guide/feature_guide/index.md b/docs/source/user_guide/feature_guide/index.md index a209828d..0a763eae 100644 --- a/docs/source/user_guide/feature_guide/index.md +++ b/docs/source/user_guide/feature_guide/index.md @@ -7,7 +7,6 @@ This section provides a detailed usage guide of vLLM Ascend features. :maxdepth: 1 graph_mode quantization -quantization-llm-compressor sleep_mode structured_output lora diff --git a/docs/source/user_guide/feature_guide/quantization-llm-compressor.md b/docs/source/user_guide/feature_guide/quantization-llm-compressor.md deleted file mode 100644 index a97b4de2..00000000 --- a/docs/source/user_guide/feature_guide/quantization-llm-compressor.md +++ /dev/null @@ -1,65 +0,0 @@ -# llm-compressor Quantization Guide - -Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed. - -## Supported llm-compressor Quantization Types - -Support CompressedTensorsW8A8 static weight - -weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. - -Support CompressedTensorsW8A8Dynamic weight - -weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. - -## Install llm-compressor - -To quantize a model, you should install [llm-compressor](https://github.com/vllm-project/llm-compressor/blob/main/README.md). It is a unified library for creating compressed models for faster inference with vLLM. - -Install llm-compressor - -```bash -pip install llmcompressor -``` - -### Generate the W8A8 weights - -```bash -cd examples/quantization/llm-compressor - -python3 w8a8_int8_dynamic.py -``` - -for more details, see the [Official Sample](https://github.com/vllm-project/llm-compressor/tree/main/examples). - -## Run the model - -Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows: - -### Offline inference - -```python -import torch - -from vllm import LLM, SamplingParams - -prompts = [ - "Hello, my name is", - "The future of AI is", -] -sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) - -llm = LLM(model="{quantized_model_save_path}", - max_model_len=2048, - trust_remote_code=True) - -outputs = llm.generate(prompts, sampling_params) -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") -``` - -### Online inference - -Start the quantized model using vLLM Ascend; no modifications to the startup command are required. diff --git a/docs/source/user_guide/feature_guide/quantization.md b/docs/source/user_guide/feature_guide/quantization.md index 8212bb96..3bdfacc7 100644 --- a/docs/source/user_guide/feature_guide/quantization.md +++ b/docs/source/user_guide/feature_guide/quantization.md @@ -1,71 +1,96 @@ # Quantization Guide -Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed. +Model quantization is a technique that reduces model size and computational overhead by lowering the numerical precision of weights and activations, thereby saving memory and improving inference speed. -Since version 0.9.0rc2, the quantization feature is experimentally supported by vLLM Ascend. Users can enable the quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We will support more quantization algorithms and models in the future. +`vLLM Ascend` supports multiple quantization methods. This guide provides instructions for using different quantization tools and running quantized models on vLLM Ascend. -## Install ModelSlim +> **Note** +> +> You can choose to convert the model yourself or use the quantized model we uploaded. +> See . +> Before you quantize a model, ensure that the RAM size is enough. -To quantize a model, you should install [ModelSlim](https://gitcode.com/Ascend/msit/tree/master) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform. +## Quantization Tools -Install ModelSlim: +vLLM Ascend supports models quantized by two main tools: `ModelSlim` and `LLM-Compressor`. + +### 1. ModelSlim (Recommended) + +[ModelSlim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md) is an Ascend-friendly compression tool focused on acceleration, using compression techniques, and built for Ascend hardware. It includes a series of inference optimization technologies such as quantization and compression, aiming to accelerate large language dense models, MoE models, multimodal understanding models, multimodal generation models, etc. + +#### Installation + +To use ModelSlim for model quantization, install it from its [Git repository](https://gitcode.com/Ascend/msit): ```bash -# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified -git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitcode.com/Ascend/msit/tree/master +# Install br_release_MindStudio_8.3.0_20261231 version +git clone https://gitcode.com/Ascend/msit.git -b br_release_MindStudio_8.3.0_20261231 cd msit/msmodelslim bash install.sh -pip install accelerate ``` -## Quantize model +#### Model Quantization -:::{note} -You can choose to convert the model yourself or use the quantized model we uploaded. -See https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8. -This conversion process requires a larger CPU memory, ensure that the RAM size is greater than 2 TB. -::: +The following example shows how to generate W8A8 quantized weights for the [Qwen3-MoE model](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/Qwen3-MOE/README.md). -### Adapts and changes -1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder. -2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder. - -### Generate the W8A8 weights +**Quantization Script:** ```bash -cd example/DeepSeek +cd example/Qwen3-MOE +# Support multi-card quantization export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False -export MODEL_PATH="/root/.cache/Kimi-K2-Instruct" -export SAVE_PATH="/root/.cache/Kimi-K2-Instruct-W8A8" -python3 quant_deepseek_w8a8.py --model_path $MODEL_PATH --save_path $SAVE_PATH --batch_size 4 +# Set model and save paths +export MODEL_PATH="/path/to/your/model" +export SAVE_PATH="/path/to/your/quantized_model" + +# Run quantization script +python3 quant_qwen_moe_w8a8.py --model_path $MODEL_PATH \ +--save_path $SAVE_PATH \ +--anti_dataset ../common/qwen3-moe_anti_prompt_50.json \ +--calib_dataset ../common/qwen3-moe_calib_prompt_50.json \ +--trust_remote_code True ``` -Here is the full converted model files except safetensors: +After quantization completes, the output directory will contain the quantized model files. + +For more examples, refer to the [official examples](https://gitcode.com/Ascend/msit/tree/master/msmodelslim/example). + +### 2. LLM-Compressor + +[LLM-Compressor](https://github.com/vllm-project/llm-compressor) is a unified compressed model library for faster vLLM inference. + +#### Installation ```bash -. -|-- config.json -|-- configuration.json -|-- configuration_deepseek.py -|-- generation_config.json -|-- modeling_deepseek.py -|-- quant_model_description.json -|-- quant_model_weight_w8a8_dynamic.safetensors.index.json -|-- tiktoken.model -|-- tokenization_kimi.py -`-- tokenizer_config.json +pip install llmcompressor ``` -## Run the model +#### Model Quantization -Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows: +`LLM-Compressor` provides various quantization scheme examples. To generate W8A8 dynamic quantized weights: -### Offline inference +```bash +# Navigate to LLM-Compressor examples directory +cd examples/quantization/llm-compressor + +# Run quantization script +python3 w8a8_int8_dynamic.py +``` + +For more content, refer to the [official examples](https://github.com/vllm-project/llm-compressor/tree/main/examples). + +Currently supported quantization types by LLM-Compressor: `W8A8` and `W8A8_DYNAMIC`. + +## Running Quantized Models + +Once you have a quantized model which is generated by **ModelSlim**, you can use vLLM Ascend for inference by specifying the `--quantization ascend` parameter to enable quantization features, while for models quantized by **LLM-Compressor**, do not need to add this parameter. + +### Offline Inference ```python import torch @@ -76,12 +101,20 @@ prompts = [ "Hello, my name is", "The future of AI is", ] +# Set sampling parameters sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) -llm = LLM(model="{quantized_model_save_path}", - max_model_len=2048, +llm = LLM(model="/path/to/your/quantized_model", + max_model_len=4096, trust_remote_code=True, - # Enable quantization by specifying `quantization="ascend"` + # Set appropriate TP and DP values + tensor_parallel_size=2, + data_parallel_size=1, + # Set an unused port + port=8000, + # Set serving model name + served_model_name="quantized_model", + # Specify `quantization="ascend"` to enable quantization for models quantized by ModelSlim quantization="ascend") outputs = llm.generate(prompts, sampling_params) @@ -91,16 +124,25 @@ for output in outputs: print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` -### Online inference +### Online Inference -Enable quantization by specifying `--quantization ascend`, for more details, see the [DeepSeek-V3-W8A8 Tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html). +```bash +# Corresponding to offline inference +python -m vllm.entrypoints.api_server \ + --model /path/to/your/quantized_model \ + --max-model-len 4096 \ + --port 8000 \ + --tensor-parallel-size 2 \ + --data-parallel-size 1 \ + --served-model-name quantized_model \ + --trust-remote-code \ + --quantization ascend +``` -## FAQs +The above commands are for reference only. For more details, consult the [official guide](../../tutorials/index.md). -### 1. How to solve the KeyError "xxx.layers.0.self_attn.q_proj.weight"? +## References -First, make sure you specify `ascend` as the quantization method. Second, check if your model is converted by the `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim version. Finally, if it still does not work, submit an issue. Maybe some new models need to be adapted. - -### 2. How to solve the error "Could not locate the configuration_deepseek.py"? - -Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim, where the missing configuration_deepseek.py error has been fixed. +- [ModelSlim Documentation](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md) +- [LLM-Compressor GitHub](https://github.com/vllm-project/llm-compressor) +- [vLLM Quantization Guide](https://docs.vllm.ai/en/latest/quantization/)