2025-06-20 15:53:25 +08:00
# Quantization Guide
2026-01-05 09:12:11 +08:00
Model quantization is a technique that reduces model size and computational overhead by lowering the numerical precision of weights and activations, thereby saving memory and improving inference speed.
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
`vLLM Ascend` supports multiple quantization methods. This guide provides instructions for using different quantization tools and running quantized models on vLLM Ascend.
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
> **Note**
>
> You can choose to convert the model yourself or use the quantized model we uploaded.
> See <https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8>.
2026-02-13 15:50:05 +08:00
> Before you quantize a model, ensure sufficient RAM is available.
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
## Quantization Tools
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
vLLM Ascend supports models quantized by two main tools: `ModelSlim` and `LLM-Compressor` .
### 1. ModelSlim (Recommended)
[ModelSlim ](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md ) is an Ascend-friendly compression tool focused on acceleration, using compression techniques, and built for Ascend hardware. It includes a series of inference optimization technologies such as quantization and compression, aiming to accelerate large language dense models, MoE models, multimodal understanding models, multimodal generation models, etc.
#### Installation
To use ModelSlim for model quantization, install it from its [Git repository ](https://gitcode.com/Ascend/msit ):
2025-07-25 22:16:10 +08:00
2025-06-20 15:53:25 +08:00
```bash
2026-01-05 09:12:11 +08:00
# Install br_release_MindStudio_8.3.0_20261231 version
git clone https://gitcode.com/Ascend/msit.git -b br_release_MindStudio_8.3.0_20261231
2025-08-14 17:19:47 +08:00
2025-08-27 09:05:46 +08:00
cd msit/msmodelslim
2025-08-06 19:28:47 +08:00
2025-06-20 15:53:25 +08:00
bash install.sh
```
2026-01-05 09:12:11 +08:00
#### Model Quantization
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
The following example shows how to generate W8A8 quantized weights for the [Qwen3-MoE model ](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/Qwen3-MOE/README.md ).
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
**Quantization Script:**
2025-06-20 15:53:25 +08:00
2025-08-06 19:28:47 +08:00
```bash
2026-01-05 09:12:11 +08:00
cd example/Qwen3-MOE
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
# Support multi-card quantization
2025-08-06 19:28:47 +08:00
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
2026-01-05 09:12:11 +08:00
# Set model and save paths
export MODEL_PATH="/path/to/your/model"
export SAVE_PATH="/path/to/your/quantized_model"
# Run quantization script
python3 quant_qwen_moe_w8a8.py --model_path $MODEL_PATH \
--save_path $SAVE_PATH \
--anti_dataset ../common/qwen3-moe_anti_prompt_50.json \
--calib_dataset ../common/qwen3-moe_calib_prompt_50.json \
--trust_remote_code True
```
After quantization completes, the output directory will contain the quantized model files.
For more examples, refer to the [official examples ](https://gitcode.com/Ascend/msit/tree/master/msmodelslim/example ).
### 2. LLM-Compressor
[LLM-Compressor ](https://github.com/vllm-project/llm-compressor ) is a unified compressed model library for faster vLLM inference.
#### Installation
```bash
pip install llmcompressor
2025-08-06 19:28:47 +08:00
```
2026-01-05 09:12:11 +08:00
#### Model Quantization
2026-01-14 09:17:26 +08:00
`LLM-Compressor` provides various quantization scheme examples.
##### Dense Quantization
An example to generate W8A8 dynamic quantized weights for dense model:
2025-07-25 22:16:10 +08:00
2025-06-20 15:53:25 +08:00
```bash
2026-01-05 09:12:11 +08:00
# Navigate to LLM-Compressor examples directory
cd examples/quantization/llm-compressor
# Run quantization script
python3 w8a8_int8_dynamic.py
2025-06-20 15:53:25 +08:00
```
2026-01-14 09:17:26 +08:00
##### MoE Quantization
An example to generate W8A8 dynamic quantized weights for MoE model:
```bash
# Navigate to LLM-Compressor examples directory
cd examples/quantization/llm-compressor
# Run quantization script
python3 w8a8_int8_dynamic_moe.py
```
2026-01-05 09:12:11 +08:00
For more content, refer to the [official examples ](https://github.com/vllm-project/llm-compressor/tree/main/examples ).
Currently supported quantization types by LLM-Compressor: `W8A8` and `W8A8_DYNAMIC` .
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
## Running Quantized Models
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
Once you have a quantized model which is generated by **ModelSlim** , you can use vLLM Ascend for inference by specifying the `--quantization ascend` parameter to enable quantization features, while for models quantized by **LLM-Compressor** , do not need to add this parameter.
### Offline Inference
2025-06-20 15:53:25 +08:00
```python
import torch
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
2026-01-05 09:12:11 +08:00
# Set sampling parameters
2025-06-20 15:53:25 +08:00
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
2026-01-05 09:12:11 +08:00
llm = LLM(model="/path/to/your/quantized_model",
max_model_len=4096,
2025-06-20 15:53:25 +08:00
trust_remote_code=True,
2026-01-05 09:12:11 +08:00
# Set appropriate TP and DP values
tensor_parallel_size=2,
data_parallel_size=1,
# Set an unused port
port=8000,
# Set serving model name
served_model_name="quantized_model",
# Specify `quantization="ascend"` to enable quantization for models quantized by ModelSlim
2025-06-20 15:53:25 +08:00
quantization="ascend")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
2026-01-05 09:12:11 +08:00
### Online Inference
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
```bash
# Corresponding to offline inference
python -m vllm.entrypoints.api_server \
--model /path/to/your/quantized_model \
--max-model-len 4096 \
--port 8000 \
--tensor-parallel-size 2 \
--data-parallel-size 1 \
--served-model-name quantized_model \
--trust-remote-code \
--quantization ascend
```
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
## References
2025-06-20 15:53:25 +08:00
2026-01-05 09:12:11 +08:00
- [ModelSlim Documentation ](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md )
- [LLM-Compressor GitHub ](https://github.com/vllm-project/llm-compressor )
- [vLLM Quantization Guide ](https://docs.vllm.ai/en/latest/quantization/ )