diff --git a/docs/source/tutorials/multi_npu_quantization.md b/docs/source/tutorials/multi_npu_quantization.md index 76f4bc4..9570804 100644 --- a/docs/source/tutorials/multi_npu_quantization.md +++ b/docs/source/tutorials/multi_npu_quantization.md @@ -36,7 +36,7 @@ see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8 ```bash # (Optional)This tag is recommended and has been verified -git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020 +git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001 cd msit/msmodelslim # Install by run this script @@ -68,6 +68,10 @@ The converted model files looks like: ``` Run the following script to start the vLLM server with quantize model: + +:::{note} +The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now. +::: ```bash vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend ```