Quantization Guide

Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.

Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying --quantization ascend. Currently, only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.

Install modelslim

To quantize a model, users should install ModelSlim which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.

Install modelslim:

# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitee.com/ascend/msit

cd msit/msmodelslim

bash install.sh
pip install accelerate

Quantize model

:::{note} You can choose to convert the model yourself or use the quantized model we uploaded, see https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8 This conversion process will require a larger CPU memory, please ensure that the RAM size is greater than 2TB :::

Adapts and change

Ascend does not support the flash_attn library. To run the model, you need to follow the guide and comment out certain parts of the code in modeling_deepseek.py located in the weights folder.
The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the guide and delete the quantization related fields from config.json in the weights folder

Generate the w8a8 weights

cd example/DeepSeek

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
export MODEL_PATH="/root/.cache/Kimi-K2-Instruct"
export SAVE_PATH="/root/.cache/Kimi-K2-Instruct-W8A8"

python3 quant_deepseek_w8a8.py --model_path $MODEL_PATH --save_path $SAVE_PATH --batch_size 4

Here is the full converted model files except safetensors:

.
|-- config.json
|-- configuration.json
|-- configuration_deepseek.py
|-- generation_config.json
|-- modeling_deepseek.py
|-- quant_model_description.json
|-- quant_model_weight_w8a8_dynamic.safetensors.index.json
|-- tiktoken.model
|-- tokenization_kimi.py
`-- tokenizer_config.json

Run the model

Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.

Offline inference

import torch

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)

llm = LLM(model="{quantized_model_save_path}",
          max_model_len=2048,
          trust_remote_code=True,
          # Enable quantization by specifying `quantization="ascend"`
          quantization="ascend")

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Online inference

Enable quantization by specifying --quantization ascend, for more details, see DeepSeek-V3-W8A8 tutorial

FAQs

1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?

First, make sure you specify ascend quantization method. Second, check if your model is converted by this br_release_MindStudio_8.1.RC2_TR5_20260624 modelslim version. Finally, if it still doesn't work, please submit a issue, maybe some new models need to be adapted.

2. How to solve the error "Could not locate the configuration_deepseek.py"?

Please convert DeepSeek series models using br_release_MindStudio_8.1.RC2_TR5_20260624 modelslim, this version has fixed the missing configuration_deepseek.py error.

3. When converting deepseek series models with modelslim, what should you pay attention?

When the mla portion of the weights used W8A8_DYNAMIC quantization, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.

The operation steps are as follows:

Search in the CANN package directory used, for example: find /usr/local/Ascend/ -name fusion_config.json
Add "AddRmsNormDynamicQuantFusionPass":"off", and "MultiAddRmsNormDynamicQuantFusionPass":"off", to the fusion_config.json you find, the location is as follows:

{
    "Switch":{
        "GraphFusion":{
            "AddRmsNormDynamicQuantFusionPass":"off",
            "MultiAddRmsNormDynamicQuantFusionPass":"off",

5.1 KiB Raw Blame History Unescape Escape