### What this PR does / why we need it?
This PR makes the following modifications:
1.delete the `user_guide/feature_guide/quantization-llm-compressor.md`
and merge it into `user_guide/feature_guide/quantization.md`.
2.update the content of `user_guide/feature_guide/quantization.md`.
3.add guidance `developer_guide/feature_guide/quantization.md' on the
adaptation of quantization algorithms and quantized models.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: IncSec <1790766300@qq.com>
Signed-off-by: InSec <1790766300@qq.com>
149 lines
5.1 KiB
Markdown
149 lines
5.1 KiB
Markdown
# Quantization Guide
|
|
|
|
Model quantization is a technique that reduces model size and computational overhead by lowering the numerical precision of weights and activations, thereby saving memory and improving inference speed.
|
|
|
|
`vLLM Ascend` supports multiple quantization methods. This guide provides instructions for using different quantization tools and running quantized models on vLLM Ascend.
|
|
|
|
> **Note**
|
|
>
|
|
> You can choose to convert the model yourself or use the quantized model we uploaded.
|
|
> See <https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8>.
|
|
> Before you quantize a model, ensure that the RAM size is enough.
|
|
|
|
## Quantization Tools
|
|
|
|
vLLM Ascend supports models quantized by two main tools: `ModelSlim` and `LLM-Compressor`.
|
|
|
|
### 1. ModelSlim (Recommended)
|
|
|
|
[ModelSlim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md) is an Ascend-friendly compression tool focused on acceleration, using compression techniques, and built for Ascend hardware. It includes a series of inference optimization technologies such as quantization and compression, aiming to accelerate large language dense models, MoE models, multimodal understanding models, multimodal generation models, etc.
|
|
|
|
#### Installation
|
|
|
|
To use ModelSlim for model quantization, install it from its [Git repository](https://gitcode.com/Ascend/msit):
|
|
|
|
```bash
|
|
# Install br_release_MindStudio_8.3.0_20261231 version
|
|
git clone https://gitcode.com/Ascend/msit.git -b br_release_MindStudio_8.3.0_20261231
|
|
|
|
cd msit/msmodelslim
|
|
|
|
bash install.sh
|
|
```
|
|
|
|
#### Model Quantization
|
|
|
|
The following example shows how to generate W8A8 quantized weights for the [Qwen3-MoE model](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/Qwen3-MOE/README.md).
|
|
|
|
**Quantization Script:**
|
|
|
|
```bash
|
|
cd example/Qwen3-MOE
|
|
|
|
# Support multi-card quantization
|
|
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
|
|
|
|
# Set model and save paths
|
|
export MODEL_PATH="/path/to/your/model"
|
|
export SAVE_PATH="/path/to/your/quantized_model"
|
|
|
|
# Run quantization script
|
|
python3 quant_qwen_moe_w8a8.py --model_path $MODEL_PATH \
|
|
--save_path $SAVE_PATH \
|
|
--anti_dataset ../common/qwen3-moe_anti_prompt_50.json \
|
|
--calib_dataset ../common/qwen3-moe_calib_prompt_50.json \
|
|
--trust_remote_code True
|
|
```
|
|
|
|
After quantization completes, the output directory will contain the quantized model files.
|
|
|
|
For more examples, refer to the [official examples](https://gitcode.com/Ascend/msit/tree/master/msmodelslim/example).
|
|
|
|
### 2. LLM-Compressor
|
|
|
|
[LLM-Compressor](https://github.com/vllm-project/llm-compressor) is a unified compressed model library for faster vLLM inference.
|
|
|
|
#### Installation
|
|
|
|
```bash
|
|
pip install llmcompressor
|
|
```
|
|
|
|
#### Model Quantization
|
|
|
|
`LLM-Compressor` provides various quantization scheme examples. To generate W8A8 dynamic quantized weights:
|
|
|
|
```bash
|
|
# Navigate to LLM-Compressor examples directory
|
|
cd examples/quantization/llm-compressor
|
|
|
|
# Run quantization script
|
|
python3 w8a8_int8_dynamic.py
|
|
```
|
|
|
|
For more content, refer to the [official examples](https://github.com/vllm-project/llm-compressor/tree/main/examples).
|
|
|
|
Currently supported quantization types by LLM-Compressor: `W8A8` and `W8A8_DYNAMIC`.
|
|
|
|
## Running Quantized Models
|
|
|
|
Once you have a quantized model which is generated by **ModelSlim**, you can use vLLM Ascend for inference by specifying the `--quantization ascend` parameter to enable quantization features, while for models quantized by **LLM-Compressor**, do not need to add this parameter.
|
|
|
|
### Offline Inference
|
|
|
|
```python
|
|
import torch
|
|
|
|
from vllm import LLM, SamplingParams
|
|
|
|
prompts = [
|
|
"Hello, my name is",
|
|
"The future of AI is",
|
|
]
|
|
# Set sampling parameters
|
|
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
|
|
|
llm = LLM(model="/path/to/your/quantized_model",
|
|
max_model_len=4096,
|
|
trust_remote_code=True,
|
|
# Set appropriate TP and DP values
|
|
tensor_parallel_size=2,
|
|
data_parallel_size=1,
|
|
# Set an unused port
|
|
port=8000,
|
|
# Set serving model name
|
|
served_model_name="quantized_model",
|
|
# Specify `quantization="ascend"` to enable quantization for models quantized by ModelSlim
|
|
quantization="ascend")
|
|
|
|
outputs = llm.generate(prompts, sampling_params)
|
|
for output in outputs:
|
|
prompt = output.prompt
|
|
generated_text = output.outputs[0].text
|
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
|
```
|
|
|
|
### Online Inference
|
|
|
|
```bash
|
|
# Corresponding to offline inference
|
|
python -m vllm.entrypoints.api_server \
|
|
--model /path/to/your/quantized_model \
|
|
--max-model-len 4096 \
|
|
--port 8000 \
|
|
--tensor-parallel-size 2 \
|
|
--data-parallel-size 1 \
|
|
--served-model-name quantized_model \
|
|
--trust-remote-code \
|
|
--quantization ascend
|
|
```
|
|
|
|
The above commands are for reference only. For more details, consult the [official guide](../../tutorials/index.md).
|
|
|
|
## References
|
|
|
|
- [ModelSlim Documentation](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md)
|
|
- [LLM-Compressor GitHub](https://github.com/vllm-project/llm-compressor)
|
|
- [vLLM Quantization Guide](https://docs.vllm.ai/en/latest/quantization/)
|