# Quantization Guide Model quantization is a technique that reduces model size and computational overhead by lowering the numerical precision of weights and activations, thereby saving memory and improving inference speed. `vLLM Ascend` supports multiple quantization methods. This guide provides instructions for using different quantization tools and running quantized models on vLLM Ascend. > **Note** > > You can choose to convert the model yourself or use the quantized model we uploaded. > See . > Before you quantize a model, ensure that the RAM size is enough. ## Quantization Tools vLLM Ascend supports models quantized by two main tools: `ModelSlim` and `LLM-Compressor`. ### 1. ModelSlim (Recommended) [ModelSlim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md) is an Ascend-friendly compression tool focused on acceleration, using compression techniques, and built for Ascend hardware. It includes a series of inference optimization technologies such as quantization and compression, aiming to accelerate large language dense models, MoE models, multimodal understanding models, multimodal generation models, etc. #### Installation To use ModelSlim for model quantization, install it from its [Git repository](https://gitcode.com/Ascend/msit): ```bash # Install br_release_MindStudio_8.3.0_20261231 version git clone https://gitcode.com/Ascend/msit.git -b br_release_MindStudio_8.3.0_20261231 cd msit/msmodelslim bash install.sh ``` #### Model Quantization The following example shows how to generate W8A8 quantized weights for the [Qwen3-MoE model](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/Qwen3-MOE/README.md). **Quantization Script:** ```bash cd example/Qwen3-MOE # Support multi-card quantization export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False # Set model and save paths export MODEL_PATH="/path/to/your/model" export SAVE_PATH="/path/to/your/quantized_model" # Run quantization script python3 quant_qwen_moe_w8a8.py --model_path $MODEL_PATH \ --save_path $SAVE_PATH \ --anti_dataset ../common/qwen3-moe_anti_prompt_50.json \ --calib_dataset ../common/qwen3-moe_calib_prompt_50.json \ --trust_remote_code True ``` After quantization completes, the output directory will contain the quantized model files. For more examples, refer to the [official examples](https://gitcode.com/Ascend/msit/tree/master/msmodelslim/example). ### 2. LLM-Compressor [LLM-Compressor](https://github.com/vllm-project/llm-compressor) is a unified compressed model library for faster vLLM inference. #### Installation ```bash pip install llmcompressor ``` #### Model Quantization `LLM-Compressor` provides various quantization scheme examples. To generate W8A8 dynamic quantized weights: ```bash # Navigate to LLM-Compressor examples directory cd examples/quantization/llm-compressor # Run quantization script python3 w8a8_int8_dynamic.py ``` For more content, refer to the [official examples](https://github.com/vllm-project/llm-compressor/tree/main/examples). Currently supported quantization types by LLM-Compressor: `W8A8` and `W8A8_DYNAMIC`. ## Running Quantized Models Once you have a quantized model which is generated by **ModelSlim**, you can use vLLM Ascend for inference by specifying the `--quantization ascend` parameter to enable quantization features, while for models quantized by **LLM-Compressor**, do not need to add this parameter. ### Offline Inference ```python import torch from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The future of AI is", ] # Set sampling parameters sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) llm = LLM(model="/path/to/your/quantized_model", max_model_len=4096, trust_remote_code=True, # Set appropriate TP and DP values tensor_parallel_size=2, data_parallel_size=1, # Set an unused port port=8000, # Set serving model name served_model_name="quantized_model", # Specify `quantization="ascend"` to enable quantization for models quantized by ModelSlim quantization="ascend") outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ### Online Inference ```bash # Corresponding to offline inference python -m vllm.entrypoints.api_server \ --model /path/to/your/quantized_model \ --max-model-len 4096 \ --port 8000 \ --tensor-parallel-size 2 \ --data-parallel-size 1 \ --served-model-name quantized_model \ --trust-remote-code \ --quantization ascend ``` The above commands are for reference only. For more details, consult the [official guide](../../tutorials/index.md). ## References - [ModelSlim Documentation](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md) - [LLM-Compressor GitHub](https://github.com/vllm-project/llm-compressor) - [vLLM Quantization Guide](https://docs.vllm.ai/en/latest/quantization/)