Files
xc-llm-ascend/docs/source/tutorials/single_npu_qwen3_w4a4.md
22dimensions e6625bb582 [Doc] add qwen3 w4a4 tutorial (#4076)
### What this PR does / why we need it?
v0.11.0rc1 will introduce w4a4 quantization feature, so add this
tutorial.

### Does this PR introduce _any_ user-facing change?

No


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-11-10 20:30:07 +08:00

4.3 KiB

Single-NPU (Qwen3 32B W4A4)

Introduction

W4A4 Flat Quantization is for better model compression and inference efficiency on Ascend devices. And W4A4 is supported since v0.11.0rc1. For modelslim, W4A4 is supported since tag_MindStudio_8.2.RC1.B120_002.

The following steps will show how to quantize Qwen3 32B to W4A4.

Environment Preparation

Run Docker Container

   :substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash

Install modelslim and Convert Model

:::{note} You can choose to convert the model yourself or use the quantized model we uploaded, see https://www.modelscope.cn/models/vllm-ascend/Qwen3-32B-W4A4 :::

git clone -b tag_MindStudio_8.2.RC1.B120_002 https://gitcode.com/Ascend/msit
cd msit/msmodelslim

# Install by run this script
bash install.sh
pip install accelerate
# transformers 4.51.0 is required for Qwen3 series model
# see https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/Qwen/README.md#%E7%8E%AF%E5%A2%83%E9%85%8D%E7%BD%AE
pip install transformers==4.51.0

cd example/Qwen
# Original weight path, Replace with your local model path
MODEL_PATH=/home/models/Qwen3-32B
# Path to save converted weight, Replace with your local path
SAVE_PATH=/home/models/Qwen3-32B-w4a4

python3 w4a4.py --model_path $MODEL_PATH \
                --save_directory $SAVE_PATH \
                --calib_file ../common/qwen_qwen3_cot_w4a4.json \
                --trust_remote_code True \
                --batch_size 1

Verify the Quantized Model

The converted model files look like:

.
|-- config.json
|-- configuration.json
|-- generation_config.json
|-- quant_model_description.json
|-- quant_model_weight_w4a4_flatquant_dynamic-00001-of-00011.safetensors
|-- quant_model_weight_w4a4_flatquant_dynamic-00002-of-00011.safetensors
|-- quant_model_weight_w4a4_flatquant_dynamic-00003-of-00011.safetensors
|-- quant_model_weight_w4a4_flatquant_dynamic-00004-of-00011.safetensors
|-- quant_model_weight_w4a4_flatquant_dynamic-00005-of-00011.safetensors
|-- quant_model_weight_w4a4_flatquant_dynamic-00006-of-00011.safetensors
|-- quant_model_weight_w4a4_flatquant_dynamic-00007-of-00011.safetensors
|-- quant_model_weight_w4a4_flatquant_dynamic-00008-of-00011.safetensors
|-- quant_model_weight_w4a4_flatquant_dynamic-00009-of-00011.safetensors
|-- quant_model_weight_w4a4_flatquant_dynamic-00010-of-00011.safetensors
|-- quant_model_weight_w4a4_flatquant_dynamic-00011-of-00011.safetensors
|-- quant_model_weight_w4a4_flatquant_dynamic.safetensors.index.json
|-- tokenizer.json
|-- tokenizer_config.json
`-- vocab.json

Deployment

Online Serving on Single NPU

vllm serve /home/models/Qwen3-32B-w4a4 --served-model-name "qwen3-32b-w4a4" --max-model-len 4096 --quantization ascend

Once your server is started, you can query the model with input prompts.

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3-32b-w4a4",
        "prompt": "what is large language model?",
        "max_tokens": "128",
        "top_p": "0.95",
        "top_k": "40",
        "temperature": "0.0"
    }'

Offline Inference on Single NPU

:::{note} To enable quantization for ascend, quantization method must be "ascend". :::


from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)

llm = LLM(model="/home/models/Qwen3-32B-w4a4",
          max_model_len=4096,
          quantization="ascend")

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")