forked from EngineX-Ascend/enginex-ascend-910-vllm
126 lines
5.0 KiB
Markdown
126 lines
5.0 KiB
Markdown
|
|
# Quantization Guide
|
|||
|
|
|
|||
|
|
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
|
|||
|
|
|
|||
|
|
Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.
|
|||
|
|
|
|||
|
|
## Install modelslim
|
|||
|
|
|
|||
|
|
To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
|
|||
|
|
|
|||
|
|
Install modelslim:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
|
|||
|
|
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitee.com/ascend/msit
|
|||
|
|
|
|||
|
|
cd msit/msmodelslim
|
|||
|
|
|
|||
|
|
bash install.sh
|
|||
|
|
pip install accelerate
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Quantize model
|
|||
|
|
|
|||
|
|
:::{note}
|
|||
|
|
You can choose to convert the model yourself or use the quantized model we uploaded,
|
|||
|
|
see https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
|
|||
|
|
This conversion process will require a larger CPU memory, please ensure that the RAM size is greater than 2TB
|
|||
|
|
:::
|
|||
|
|
|
|||
|
|
### Adapts and change
|
|||
|
|
1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
|
|||
|
|
2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder
|
|||
|
|
|
|||
|
|
### Generate the w8a8 weights
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd example/DeepSeek
|
|||
|
|
|
|||
|
|
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
|||
|
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
|
|||
|
|
export MODEL_PATH="/root/.cache/Kimi-K2-Instruct"
|
|||
|
|
export SAVE_PATH="/root/.cache/Kimi-K2-Instruct-W8A8"
|
|||
|
|
|
|||
|
|
python3 quant_deepseek_w8a8.py --model_path $MODEL_PATH --save_path $SAVE_PATH --batch_size 4
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Here is the full converted model files except safetensors:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
.
|
|||
|
|
|-- config.json
|
|||
|
|
|-- configuration.json
|
|||
|
|
|-- configuration_deepseek.py
|
|||
|
|
|-- generation_config.json
|
|||
|
|
|-- modeling_deepseek.py
|
|||
|
|
|-- quant_model_description.json
|
|||
|
|
|-- quant_model_weight_w8a8_dynamic.safetensors.index.json
|
|||
|
|
|-- tiktoken.model
|
|||
|
|
|-- tokenization_kimi.py
|
|||
|
|
`-- tokenizer_config.json
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Run the model
|
|||
|
|
|
|||
|
|
Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.
|
|||
|
|
|
|||
|
|
### Offline inference
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
from vllm import LLM, SamplingParams
|
|||
|
|
|
|||
|
|
prompts = [
|
|||
|
|
"Hello, my name is",
|
|||
|
|
"The future of AI is",
|
|||
|
|
]
|
|||
|
|
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
|||
|
|
|
|||
|
|
llm = LLM(model="{quantized_model_save_path}",
|
|||
|
|
max_model_len=2048,
|
|||
|
|
trust_remote_code=True,
|
|||
|
|
# Enable quantization by specifying `quantization="ascend"`
|
|||
|
|
quantization="ascend")
|
|||
|
|
|
|||
|
|
outputs = llm.generate(prompts, sampling_params)
|
|||
|
|
for output in outputs:
|
|||
|
|
prompt = output.prompt
|
|||
|
|
generated_text = output.outputs[0].text
|
|||
|
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Online inference
|
|||
|
|
|
|||
|
|
Enable quantization by specifying `--quantization ascend`, for more details, see DeepSeek-V3-W8A8 [tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html)
|
|||
|
|
|
|||
|
|
## FAQs
|
|||
|
|
|
|||
|
|
### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?
|
|||
|
|
|
|||
|
|
First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim version. Finally, if it still doesn't work, please
|
|||
|
|
submit a issue, maybe some new models need to be adapted.
|
|||
|
|
|
|||
|
|
### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
|
|||
|
|
|
|||
|
|
Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim, this version has fixed the missing configuration_deepseek.py error.
|
|||
|
|
|
|||
|
|
### 3. When converting deepseek series models with modelslim, what should you pay attention?
|
|||
|
|
|
|||
|
|
When using the weight generated by modelslim with the `--dynamic` parameter, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
|
|||
|
|
|
|||
|
|
The operation steps are as follows:
|
|||
|
|
|
|||
|
|
1. Search in the CANN package directory used, for example:
|
|||
|
|
find /usr/local/Ascend/ -name fusion_config.json
|
|||
|
|
|
|||
|
|
2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
{
|
|||
|
|
"Switch":{
|
|||
|
|
"GraphFusion":{
|
|||
|
|
"AddRmsNormDynamicQuantFusionPass":"off",
|
|||
|
|
```
|