xc-llm-ascend/docs/source/user_guide/feature_guide/quantization.md

# Quantization Guide

Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.

Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.

## Install modelslim

To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.

Install modelslim:

```bash
# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitee.com/ascend/msit

cd msit/msmodelslim

bash install.sh
pip install accelerate
```

## Quantize model

:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
This conversion process will require a larger CPU memory, please ensure that the RAM size is greater than 2TB
:::

### Adapts and change
1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder

### Generate the w8a8 weights

```bash
cd example/DeepSeek

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
export MODEL_PATH="/root/.cache/Kimi-K2-Instruct"
export SAVE_PATH="/root/.cache/Kimi-K2-Instruct-W8A8"

python3 quant_deepseek_w8a8.py --model_path $MODEL_PATH --save_path $SAVE_PATH --batch_size 4
```

Here is the full converted model files except safetensors:

```bash
.
|-- config.json
|-- configuration.json
|-- configuration_deepseek.py
|-- generation_config.json
|-- modeling_deepseek.py
|-- quant_model_description.json
|-- quant_model_weight_w8a8_dynamic.safetensors.index.json
|-- tiktoken.model
|-- tokenization_kimi.py
`-- tokenizer_config.json
```

## Run the model

Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.

### Offline inference

```python
import torch

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)

llm = LLM(model="{quantized_model_save_path}",
          max_model_len=2048,
          trust_remote_code=True,
          # Enable quantization by specifying `quantization="ascend"`
          quantization="ascend")

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

### Online inference

Enable quantization by specifying `--quantization ascend`, for more details, see DeepSeek-V3-W8A8 [tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html)

## FAQs

### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?

First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim version. Finally, if it still doesn't work, please
submit a issue, maybe some new models need to be adapted.

### 2. How to solve the error "Could not locate the configuration_deepseek.py"?

Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim, this version has fixed the missing configuration_deepseek.py error.

### 3. When converting deepseek series models with modelslim, what should you pay attention?

When using the weight generated by modelslim with the `--dynamic` parameter, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.

The operation steps are as follows:

1. Search in the CANN package directory used, for example:
find /usr/local/Ascend/ -name fusion_config.json

2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:

```bash
{
    "Switch":{
        "GraphFusion":{
            "AddRmsNormDynamicQuantFusionPass":"off",
```
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
+								# Quantization Guide
 								Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
 								Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.
 								## Install modelslim
 								To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
 								Install modelslim:
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
+								```bash
-												[Doc] Add stable modelslim branch (#2545)

### What this PR does / why we need it?
The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial
delivery version of modelslim in Q3, and has been verified available
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/7d67a9d9f93f86b74066c64c373405aa088e4a16

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-27 09:05:46 +08:00
+								# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
 								git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitee.com/ascend/msit
-												[Doc] Fix quant documentation to make it reproducible (#2277)

### What this PR does / why we need it?
Fixed the expression of msit for code clone

- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/afa5b7ca0b417abadfa85e32f28969b72e58a885

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-14 17:19:47 +08:00
-												[Doc] Add stable modelslim branch (#2545)

### What this PR does / why we need it?
The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial
delivery version of modelslim in Q3, and has been verified available
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/7d67a9d9f93f86b74066c64c373405aa088e4a16

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-27 09:05:46 +08:00
+								cd msit/msmodelslim
-												[Doc] Support kimi-k2-w8a8 (#2162)

### What this PR does / why we need it?
In fact, the kimi-k2 model is similar to the deepseek model, and we only
need to make a few changes to support it. what does this pr do:
1. Add kimi-k2-w8a8 deployment doc
2. Update quantization doc
3. Upgrade torchair support list
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9edd1db02bc6dce6da503503a373657f3466a78b

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-06 19:28:47 +08:00
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
+								bash install.sh
 								pip install accelerate
 								```
 								## Quantize model
 								:::{note}
-												[Doc] Support kimi-k2-w8a8 (#2162)

### What this PR does / why we need it?
In fact, the kimi-k2 model is similar to the deepseek model, and we only
need to make a few changes to support it. what does this pr do:
1. Add kimi-k2-w8a8 deployment doc
2. Update quantization doc
3. Upgrade torchair support list
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9edd1db02bc6dce6da503503a373657f3466a78b

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-06 19:28:47 +08:00
+								You can choose to convert the model yourself or use the quantized model we uploaded,
 								see https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
 								This conversion process will require a larger CPU memory, please ensure that the RAM size is greater than 2TB
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
+								:::
-												[Doc] Support kimi-k2-w8a8 (#2162)

### What this PR does / why we need it?
In fact, the kimi-k2 model is similar to the deepseek model, and we only
need to make a few changes to support it. what does this pr do:
1. Add kimi-k2-w8a8 deployment doc
2. Update quantization doc
3. Upgrade torchair support list
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9edd1db02bc6dce6da503503a373657f3466a78b

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-06 19:28:47 +08:00
+								### Adapts and change
 . Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
 . The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
-												[Doc] Support kimi-k2-w8a8 (#2162)

### What this PR does / why we need it?
In fact, the kimi-k2 model is similar to the deepseek model, and we only
need to make a few changes to support it. what does this pr do:
1. Add kimi-k2-w8a8 deployment doc
2. Update quantization doc
3. Upgrade torchair support list
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9edd1db02bc6dce6da503503a373657f3466a78b

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-06 19:28:47 +08:00
+								### Generate the w8a8 weights
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
-												[Doc] Support kimi-k2-w8a8 (#2162)

### What this PR does / why we need it?
In fact, the kimi-k2 model is similar to the deepseek model, and we only
need to make a few changes to support it. what does this pr do:
1. Add kimi-k2-w8a8 deployment doc
2. Update quantization doc
3. Upgrade torchair support list
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9edd1db02bc6dce6da503503a373657f3466a78b

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-06 19:28:47 +08:00
+								```bash
 								cd example/DeepSeek
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
-												[Doc] Support kimi-k2-w8a8 (#2162)

### What this PR does / why we need it?
In fact, the kimi-k2 model is similar to the deepseek model, and we only
need to make a few changes to support it. what does this pr do:
1. Add kimi-k2-w8a8 deployment doc
2. Update quantization doc
3. Upgrade torchair support list
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9edd1db02bc6dce6da503503a373657f3466a78b

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-06 19:28:47 +08:00
+								export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
 								export MODEL_PATH="/root/.cache/Kimi-K2-Instruct"
 								export SAVE_PATH="/root/.cache/Kimi-K2-Instruct-W8A8"
 								python3 quant_deepseek_w8a8.py --model_path $MODEL_PATH --save_path $SAVE_PATH --batch_size 4
 								```
 								Here is the full converted model files except safetensors:
-												[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-07-25 22:16:10 +08:00
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
+								```bash
 								.
-												[Doc] Support kimi-k2-w8a8 (#2162)

### What this PR does / why we need it?
In fact, the kimi-k2 model is similar to the deepseek model, and we only
need to make a few changes to support it. what does this pr do:
1. Add kimi-k2-w8a8 deployment doc
2. Update quantization doc
3. Upgrade torchair support list
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9edd1db02bc6dce6da503503a373657f3466a78b

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-06 19:28:47 +08:00
+								|-- config.json
 								|-- configuration.json
 								|-- configuration_deepseek.py
 								|-- generation_config.json
 								|-- modeling_deepseek.py
 								|-- quant_model_description.json
 								|-- quant_model_weight_w8a8_dynamic.safetensors.index.json
 								|-- tiktoken.model
 								|-- tokenization_kimi.py
 								`-- tokenizer_config.json
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
+								```
 								## Run the model
 								Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.
 								### Offline inference
 								```python
 								import torch
 								from vllm import LLM, SamplingParams
 								prompts = [
 								    "Hello, my name is",
 								    "The future of AI is",
 								]
 								sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
 								llm = LLM(model="{quantized_model_save_path}",
 								          max_model_len=2048,
 								          trust_remote_code=True,
-												[CI] Add codespell check for doc (#1314)

Add codespell check test for doc only PR

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-20 16:48:14 +08:00
+								          # Enable quantization by specifying `quantization="ascend"`
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
+								          quantization="ascend")
 								outputs = llm.generate(prompts, sampling_params)
 								for output in outputs:
 								    prompt = output.prompt
 								    generated_text = output.outputs[0].text
 								    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 								```
 								### Online inference
-												[Doc] Support kimi-k2-w8a8 (#2162)

### What this PR does / why we need it?
In fact, the kimi-k2 model is similar to the deepseek model, and we only
need to make a few changes to support it. what does this pr do:
1. Add kimi-k2-w8a8 deployment doc
2. Update quantization doc
3. Upgrade torchair support list
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9edd1db02bc6dce6da503503a373657f3466a78b

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-06 19:28:47 +08:00
+								Enable quantization by specifying `--quantization ascend`, for more details, see DeepSeek-V3-W8A8 [tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html)
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
 								## FAQs
 								### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?
-												[Doc] Add stable modelslim branch (#2545)

### What this PR does / why we need it?
The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial
delivery version of modelslim in Q3, and has been verified available
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/7d67a9d9f93f86b74066c64c373405aa088e4a16

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-27 09:05:46 +08:00
+								First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim version. Finally, if it still doesn't work, please
-												Add user guide for quantization (#1206)

### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
											
										
										
											2025-06-20 15:53:25 +08:00
+								submit a issue, maybe some new models need to be adapted.
 								### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
-												[Doc] Add stable modelslim branch (#2545)

### What this PR does / why we need it?
The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial
delivery version of modelslim in Q3, and has been verified available
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/7d67a9d9f93f86b74066c64c373405aa088e4a16

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-08-27 09:05:46 +08:00
+								Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim, this version has fixed the missing configuration_deepseek.py error.
-												[main][Doc] add mla pertoken quantization FAQ (#2018)

### What this PR does / why we need it?
When using deepseek series models generated by the --dynamic parameter,
if torchair graph mode is enabled, we should modify the configuration
file in the CANN package to prevent incorrect inference results.

- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
											
										
										
											2025-07-27 08:47:51 +08:00
 								### 3. When converting deepseek series models with modelslim, what should you pay attention?
 								When using the weight generated by modelslim with the `--dynamic` parameter, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
 								The operation steps are as follows:
 . Search in the CANN package directory used, for example:
 								find /usr/local/Ascend/ -name fusion_config.json
 . Add `"AddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
 								```bash
 								{
 								    "Switch":{
 								        "GraphFusion":{
 								            "AddRmsNormDynamicQuantFusionPass":"off",
 								```