2025-02-11 12:00:27 +08:00
# Installation
2025-02-14 10:22:15 +08:00
This document describes how to install vllm-ascend manually.
2025-02-05 10:53:12 +08:00
2025-02-14 10:22:15 +08:00
## Requirements
2025-02-05 10:53:12 +08:00
2025-02-14 10:22:15 +08:00
- OS: Linux
2025-04-27 17:28:29 +08:00
- Python: >= 3.9, < 3.12
2025-02-14 10:22:15 +08:00
- A hardware with Ascend NPU. It's usually the Atlas 800 A2 series.
- Software:
2025-02-05 10:53:12 +08:00
2025-06-16 23:09:53 +08:00
| Software | Supported version | Note |
|---------------|----------------------------------|-------------------------------------------|
2025-11-06 09:05:08 +08:00
| Ascend HDK | Refer to [here ](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html ) | Required for CANN |
| CANN | >= 8.3.RC1 | Required for vllm-ascend and torch-npu |
2025-10-31 22:14:26 +08:00
| torch-npu | == 2.7.1 | Required for vllm-ascend, No need to install manually, it will be auto installed in below steps |
| torch | == 2.7.1 | Required for torch-npu and vllm |
2025-02-05 10:53:12 +08:00
2025-10-29 11:32:12 +08:00
There are two installation methods:
2025-02-18 11:20:06 +08:00
- **Using pip**: first prepare env manually or via CANN image, then install `vllm-ascend` using pip.
- **Using docker**: use the `vllm-ascend` pre-built docker image directly.
2025-02-14 10:22:15 +08:00
## Configure a new environment
2025-10-29 11:32:12 +08:00
Before installation, you need to make sure firmware/driver and CANN are installed correctly, refer to [Ascend Environment Setup Guide ](https://ascend.github.io/docs/sources/ascend/quick_install.html ) for more details.
2025-02-14 10:22:15 +08:00
2025-02-18 11:20:06 +08:00
### Configure hardware environment
2025-02-14 10:22:15 +08:00
2025-02-17 22:12:07 +08:00
To verify that the Ascend NPU firmware and driver were correctly installed, run:
2025-02-05 10:53:12 +08:00
```bash
2025-02-17 22:12:07 +08:00
npu-smi info
2025-02-14 10:22:15 +08:00
```
2025-02-17 22:12:07 +08:00
Refer to [Ascend Environment Setup Guide ](https://ascend.github.io/docs/sources/ascend/quick_install.html ) for more details.
2025-02-18 11:20:06 +08:00
### Configure software environment
2025-02-14 10:22:15 +08:00
2025-02-17 22:12:07 +08:00
:::::{tab-set}
:sync-group: install
2025-02-14 10:22:15 +08:00
2025-02-18 11:20:06 +08:00
::::{tab-item} Before using pip
2025-02-17 22:12:07 +08:00
:selected:
:sync: pip
2025-02-14 10:22:15 +08:00
2025-02-18 11:20:06 +08:00
The easiest way to prepare your software environment is using CANN image directly:
2025-02-14 10:22:15 +08:00
2025-03-10 09:27:48 +08:00
```{code-block} bash
:substitutions:
2025-02-14 10:22:15 +08:00
# Update DEVICE according to your device (/dev/davinci[0-7])
2025-02-19 09:51:43 +08:00
export DEVICE=/dev/davinci7
2025-03-10 09:27:48 +08:00
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/cann:|cann_image_tag|
2025-02-14 10:22:15 +08:00
docker run --rm \
2025-02-05 10:53:12 +08:00
--name vllm-ascend-env \
2025-02-14 10:22:15 +08:00
--device $DEVICE \
2025-02-05 10:53:12 +08:00
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
[quantization] Support w8a8 quantization (#580)
### What this PR does / why we need it?
Add a `VLLMAscendQuantizer` to support w8a8 static (W8A8) and dynamic on
linear and moe (W8A8_DYNAMIC), the quantizer will be enable if a model
has [quantize
filed](https://huggingface.co/vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8/blob/main/config.json#L27).
If MindIE Turbo is installed, the MindIE Turbo Quantizer will apply,
otherwise will use VLLMAscendQuantizer directly.
- This patch fix installation docs to make installation work
- This patch enable norm quantization by patch `RMSNorm.__init__`,
`RMSNorm.forward_oot`, `NPUModelRunnerBase.load_model`
- Add `AscendW8A8LinearMethod` for W8A8
- Add `AscendW8A8DynamicLinearMethod` and
`AscendW8A8DynamicFusedMoEMethod` for W8A8_DYNAMIC
- Add a e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
### Does this PR introduce _any_ user-facing change?
Yes, support w8a8 quantization. After this patch supported, users can
use below commands to run w8a8 models:
```
vllm serve /root/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct-w8a8 --served-model-name "qwen2.5-7B"
```
### How was this patch tested?
0. CI passed: add e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
1. From @Yikun:
I test Qwen2.5-0.5B-Instruct-w8a8 for functional test all is well, pls
refer to
https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613
2. From @dingdingchaomian :
Use qwen2.5-72b-instruct model and deepseek-v2-lite-chat tested, both
models were quantized using Ascend's msmodelslim tool:
- Qwen2.5-72b-instruct were tested twice, one for w8a8 static and one
for w8a8 dynamic.
- Deepseek-v2-lite-chat were tested once because its quantization used
both static and dynamic w8a8.
Models were tested using both off line inference and online serving, and
both work well. The inference codes are exactly the same with the
examples in
https://vllm-ascend.readthedocs.io/en/latest/quick_start.html, with
model path and tensor parallel number changed.
---------
Signed-off-by: dingdingchaomian <wangce21@huawei.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: dingdingchaomian <wangce21@huawei.com>
Co-authored-by: Angazenn <zengyanjia@huawei.com>
Co-authored-by: liujiaxu <liujiaxu4@huawei.com>
Co-authored-by: ApsarasX <apsarax@outlook.com>
Co-authored-by: ganyi1996ppo <pleaplusone.gy@gmail.com>
2025-04-20 18:14:05 +08:00
-v /root/.cache:/root/.cache \
2025-03-10 09:27:48 +08:00
-it $IMAGE bash
2025-02-05 10:53:12 +08:00
```
2025-03-20 19:34:44 +08:00
:::{dropdown} Click here to see "Install CANN manually"
2025-03-10 09:27:48 +08:00
:animate: fade-in-slide-down
2025-02-17 22:12:07 +08:00
You can also install CANN manually:
2025-02-19 09:51:43 +08:00
2025-02-17 22:12:07 +08:00
```bash
2025-10-29 11:32:12 +08:00
# Create a virtual environment.
2025-02-17 22:12:07 +08:00
python -m venv vllm-ascend-env
source vllm-ascend-env/bin/activate
2025-02-05 10:53:12 +08:00
2025-10-29 11:32:12 +08:00
# Install required Python packages.
2025-04-18 14:16:41 +08:00
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple attrs 'numpy< 2.0.0 ' decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions
2025-02-05 10:53:12 +08:00
2025-02-17 22:12:07 +08:00
# Download and install the CANN package.
2025-11-21 22:48:57 +08:00
wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.3.RC2/Ascend-cann-toolkit_8.3.RC2_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-toolkit_8.3.RC2_linux-"$(uname -i)".run
./Ascend-cann-toolkit_8.3.RC2_linux-"$(uname -i)".run --full
# https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C22B800TP052/Ascend-cann-kernels-910b_8.3.rc2_linux-aarch64.run
2025-02-18 11:20:06 +08:00
2025-02-25 11:00:58 +08:00
source /usr/local/Ascend/ascend-toolkit/set_env.sh
2025-11-21 22:48:57 +08:00
wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.3.RC2/Ascend-cann-kernels-910b_8.3.RC2_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-kernels-910b_8.3.RC2_linux-"$(uname -i)".run
./Ascend-cann-kernels-910b_8.3.RC2_linux-"$(uname -i)".run --install
2025-02-25 11:00:58 +08:00
2025-11-21 22:48:57 +08:00
wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.3.RC2/Ascend-cann-nnal_8.3.RC2_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-nnal_8.3.RC2_linux-"$(uname -i)".run
./Ascend-cann-nnal_8.3.RC2_linux-"$(uname -i)".run --install
2025-02-18 11:20:06 +08:00
2025-02-19 09:51:43 +08:00
source /usr/local/Ascend/nnal/atb/set_env.sh
2025-02-17 22:12:07 +08:00
```
2025-02-05 10:53:12 +08:00
2025-03-10 09:27:48 +08:00
:::
2025-02-17 22:12:07 +08:00
::::
2025-02-11 14:28:26 +08:00
2025-02-18 11:20:06 +08:00
::::{tab-item} Before using docker
2025-02-17 22:12:07 +08:00
:sync: docker
2025-10-29 11:32:12 +08:00
No more extra step if you are using `vllm-ascend` prebuilt Docker image.
2025-02-17 22:12:07 +08:00
::::
:::::
2025-02-14 10:22:15 +08:00
2025-10-29 11:32:12 +08:00
Once it is done, you can start to set up `vllm` and `vllm-ascend` .
2025-02-14 10:22:15 +08:00
2025-02-17 22:12:07 +08:00
## Setup vllm and vllm-ascend
2025-02-14 10:22:15 +08:00
2025-02-17 22:12:07 +08:00
:::::{tab-set}
:sync-group: install
2025-02-14 10:22:15 +08:00
2025-02-17 22:12:07 +08:00
::::{tab-item} Using pip
:selected:
:sync: pip
2025-02-14 10:22:15 +08:00
2025-10-29 11:32:12 +08:00
First install system dependencies and configure pip mirror:
2025-03-31 14:17:55 +08:00
```bash
2025-06-23 20:50:33 +08:00
# Using apt-get with mirror
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
apt-get update -y & & apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq
# Or using yum
# yum update -y && yum install -y gcc g++ cmake numactl-devel wget git curl jq
# Config pip mirror
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
2025-03-31 14:17:55 +08:00
```
2025-10-29 11:32:12 +08:00
**[Optional]** Then configure the extra-index of `pip` if you are working on an x86 machine or using torch-npu dev version:
2025-04-22 09:04:20 +08:00
```bash
2025-06-23 20:50:33 +08:00
# For torch-npu dev version or x86 machine
2025-06-23 15:37:50 +08:00
pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"
2025-04-22 09:04:20 +08:00
```
2025-04-14 14:38:50 +08:00
Then you can install `vllm` and `vllm-ascend` from **pre-built wheel** :
2025-02-14 10:22:15 +08:00
2025-02-20 11:05:35 +08:00
```{code-block} bash
:substitutions:
2025-10-16 14:38:11 +08:00
# Install vllm-project/vllm. The newest supported version is |vllm_version|.
# Because the version |vllm_version| has not been archived in pypi, so you need to install from source.
git clone --depth 1 --branch |vllm_version| https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install -v -e .
cd ..
2025-02-20 11:05:35 +08:00
2025-03-03 09:08:41 +08:00
# Install vllm-project/vllm-ascend from pypi.
2025-04-22 09:04:20 +08:00
pip install vllm-ascend==|pip_vllm_ascend_version|
```
2025-03-10 09:27:48 +08:00
:::{dropdown} Click here to see "Build from source code"
2025-03-03 09:08:41 +08:00
or build from **source code** :
```{code-block} bash
:substitutions:
2025-10-29 11:32:12 +08:00
# Install vLLM.
2025-03-03 09:08:41 +08:00
git clone --depth 1 --branch |vllm_version| https://github.com/vllm-project/vllm
2025-02-19 09:51:43 +08:00
cd vllm
2025-04-27 18:37:25 +08:00
VLLM_TARGET_DEVICE=empty pip install -v -e .
[quantization] Support w8a8 quantization (#580)
### What this PR does / why we need it?
Add a `VLLMAscendQuantizer` to support w8a8 static (W8A8) and dynamic on
linear and moe (W8A8_DYNAMIC), the quantizer will be enable if a model
has [quantize
filed](https://huggingface.co/vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8/blob/main/config.json#L27).
If MindIE Turbo is installed, the MindIE Turbo Quantizer will apply,
otherwise will use VLLMAscendQuantizer directly.
- This patch fix installation docs to make installation work
- This patch enable norm quantization by patch `RMSNorm.__init__`,
`RMSNorm.forward_oot`, `NPUModelRunnerBase.load_model`
- Add `AscendW8A8LinearMethod` for W8A8
- Add `AscendW8A8DynamicLinearMethod` and
`AscendW8A8DynamicFusedMoEMethod` for W8A8_DYNAMIC
- Add a e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
### Does this PR introduce _any_ user-facing change?
Yes, support w8a8 quantization. After this patch supported, users can
use below commands to run w8a8 models:
```
vllm serve /root/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct-w8a8 --served-model-name "qwen2.5-7B"
```
### How was this patch tested?
0. CI passed: add e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
1. From @Yikun:
I test Qwen2.5-0.5B-Instruct-w8a8 for functional test all is well, pls
refer to
https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613
2. From @dingdingchaomian :
Use qwen2.5-72b-instruct model and deepseek-v2-lite-chat tested, both
models were quantized using Ascend's msmodelslim tool:
- Qwen2.5-72b-instruct were tested twice, one for w8a8 static and one
for w8a8 dynamic.
- Deepseek-v2-lite-chat were tested once because its quantization used
both static and dynamic w8a8.
Models were tested using both off line inference and online serving, and
both work well. The inference codes are exactly the same with the
examples in
https://vllm-ascend.readthedocs.io/en/latest/quick_start.html, with
model path and tensor parallel number changed.
---------
Signed-off-by: dingdingchaomian <wangce21@huawei.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: dingdingchaomian <wangce21@huawei.com>
Co-authored-by: Angazenn <zengyanjia@huawei.com>
Co-authored-by: liujiaxu <liujiaxu4@huawei.com>
Co-authored-by: ApsarasX <apsarax@outlook.com>
Co-authored-by: ganyi1996ppo <pleaplusone.gy@gmail.com>
2025-04-20 18:14:05 +08:00
cd ..
2025-02-19 09:51:43 +08:00
2025-10-29 11:32:12 +08:00
# Install vLLM Ascend.
2025-03-03 09:08:41 +08:00
git clone --depth 1 --branch |vllm_ascend_version| https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
2025-04-27 18:37:25 +08:00
pip install -v -e .
[quantization] Support w8a8 quantization (#580)
### What this PR does / why we need it?
Add a `VLLMAscendQuantizer` to support w8a8 static (W8A8) and dynamic on
linear and moe (W8A8_DYNAMIC), the quantizer will be enable if a model
has [quantize
filed](https://huggingface.co/vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8/blob/main/config.json#L27).
If MindIE Turbo is installed, the MindIE Turbo Quantizer will apply,
otherwise will use VLLMAscendQuantizer directly.
- This patch fix installation docs to make installation work
- This patch enable norm quantization by patch `RMSNorm.__init__`,
`RMSNorm.forward_oot`, `NPUModelRunnerBase.load_model`
- Add `AscendW8A8LinearMethod` for W8A8
- Add `AscendW8A8DynamicLinearMethod` and
`AscendW8A8DynamicFusedMoEMethod` for W8A8_DYNAMIC
- Add a e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
### Does this PR introduce _any_ user-facing change?
Yes, support w8a8 quantization. After this patch supported, users can
use below commands to run w8a8 models:
```
vllm serve /root/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct-w8a8 --served-model-name "qwen2.5-7B"
```
### How was this patch tested?
0. CI passed: add e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
1. From @Yikun:
I test Qwen2.5-0.5B-Instruct-w8a8 for functional test all is well, pls
refer to
https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613
2. From @dingdingchaomian :
Use qwen2.5-72b-instruct model and deepseek-v2-lite-chat tested, both
models were quantized using Ascend's msmodelslim tool:
- Qwen2.5-72b-instruct were tested twice, one for w8a8 static and one
for w8a8 dynamic.
- Deepseek-v2-lite-chat were tested once because its quantization used
both static and dynamic w8a8.
Models were tested using both off line inference and online serving, and
both work well. The inference codes are exactly the same with the
examples in
https://vllm-ascend.readthedocs.io/en/latest/quick_start.html, with
model path and tensor parallel number changed.
---------
Signed-off-by: dingdingchaomian <wangce21@huawei.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: dingdingchaomian <wangce21@huawei.com>
Co-authored-by: Angazenn <zengyanjia@huawei.com>
Co-authored-by: liujiaxu <liujiaxu4@huawei.com>
Co-authored-by: ApsarasX <apsarax@outlook.com>
Co-authored-by: ganyi1996ppo <pleaplusone.gy@gmail.com>
2025-04-20 18:14:05 +08:00
cd ..
2025-03-03 09:08:41 +08:00
```
2025-04-22 14:11:41 +08:00
2025-10-29 11:32:12 +08:00
vllm-ascend will build custom operators by default. If you don't want to build it, set `COMPILE_CUSTOM_KERNELS=0` environment to disable it.
2025-03-10 09:27:48 +08:00
:::
2025-02-19 09:51:43 +08:00
2025-04-14 14:38:50 +08:00
```{note}
2025-04-22 14:11:41 +08:00
If you are building from v0.7.3-dev and intend to use sleep mode feature, you should set `COMPILE_CUSTOM_KERNELS=1` manually.
2025-10-29 11:32:12 +08:00
To build custom operators, gcc/g++ higher than 8 and c++ 17 or higher is required. If you're using `pip install -e .` and encounter a torch-npu version conflict, please install with `pip install --no-build-isolation -e .` to build on system env.
If you encounter other problems during compiling, it is probably because unexpected compiler is being used, you may export `CXX_COMPILER` and `C_COMPILER` in environment to specify your g++ and gcc locations before compiling.
2025-02-17 22:12:07 +08:00
```
2025-04-14 14:38:50 +08:00
2025-02-17 22:12:07 +08:00
::::
2025-02-14 10:22:15 +08:00
2025-02-17 22:12:07 +08:00
::::{tab-item} Using docker
:sync: docker
2025-02-14 10:22:15 +08:00
2025-02-17 22:12:07 +08:00
You can just pull the **prebuilt image** and run it with bash.
2025-02-11 14:28:26 +08:00
2025-03-10 09:27:48 +08:00
:::{dropdown} Click here to see "Build from Dockerfile"
or build IMAGE from **source code** :
```bash
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
docker build -t vllm-ascend-dev-image:latest -f ./Dockerfile .
```
2025-07-25 22:16:10 +08:00
2025-03-10 09:27:48 +08:00
:::
2025-02-19 09:51:43 +08:00
```{code-block} bash
:substitutions:
2025-02-14 10:22:15 +08:00
# Update DEVICE according to your device (/dev/davinci[0-7])
2025-03-10 09:27:48 +08:00
export DEVICE=/dev/davinci7
2025-02-17 22:12:07 +08:00
# Update the vllm-ascend image
2025-03-10 09:27:48 +08:00
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
2025-02-14 10:22:15 +08:00
docker run --rm \
--name vllm-ascend-env \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
[quantization] Support w8a8 quantization (#580)
### What this PR does / why we need it?
Add a `VLLMAscendQuantizer` to support w8a8 static (W8A8) and dynamic on
linear and moe (W8A8_DYNAMIC), the quantizer will be enable if a model
has [quantize
filed](https://huggingface.co/vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8/blob/main/config.json#L27).
If MindIE Turbo is installed, the MindIE Turbo Quantizer will apply,
otherwise will use VLLMAscendQuantizer directly.
- This patch fix installation docs to make installation work
- This patch enable norm quantization by patch `RMSNorm.__init__`,
`RMSNorm.forward_oot`, `NPUModelRunnerBase.load_model`
- Add `AscendW8A8LinearMethod` for W8A8
- Add `AscendW8A8DynamicLinearMethod` and
`AscendW8A8DynamicFusedMoEMethod` for W8A8_DYNAMIC
- Add a e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
### Does this PR introduce _any_ user-facing change?
Yes, support w8a8 quantization. After this patch supported, users can
use below commands to run w8a8 models:
```
vllm serve /root/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct-w8a8 --served-model-name "qwen2.5-7B"
```
### How was this patch tested?
0. CI passed: add e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
1. From @Yikun:
I test Qwen2.5-0.5B-Instruct-w8a8 for functional test all is well, pls
refer to
https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613
2. From @dingdingchaomian :
Use qwen2.5-72b-instruct model and deepseek-v2-lite-chat tested, both
models were quantized using Ascend's msmodelslim tool:
- Qwen2.5-72b-instruct were tested twice, one for w8a8 static and one
for w8a8 dynamic.
- Deepseek-v2-lite-chat were tested once because its quantization used
both static and dynamic w8a8.
Models were tested using both off line inference and online serving, and
both work well. The inference codes are exactly the same with the
examples in
https://vllm-ascend.readthedocs.io/en/latest/quick_start.html, with
model path and tensor parallel number changed.
---------
Signed-off-by: dingdingchaomian <wangce21@huawei.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: dingdingchaomian <wangce21@huawei.com>
Co-authored-by: Angazenn <zengyanjia@huawei.com>
Co-authored-by: liujiaxu <liujiaxu4@huawei.com>
Co-authored-by: ApsarasX <apsarax@outlook.com>
Co-authored-by: ganyi1996ppo <pleaplusone.gy@gmail.com>
2025-04-20 18:14:05 +08:00
-v /root/.cache:/root/.cache \
2025-02-17 22:12:07 +08:00
-it $IMAGE bash
2025-02-11 14:28:26 +08:00
```
2025-10-29 11:32:12 +08:00
The default workdir is `/workspace` , vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode ](https://setuptools.pypa.io/en/latest/userguide/development_mode.html ) (`pip install -e` ) to help developer immediately take place changes without requiring a new installation.
2025-02-17 22:12:07 +08:00
::::
2025-02-14 10:22:15 +08:00
2025-02-17 22:12:07 +08:00
:::::
2025-02-14 10:22:15 +08:00
## Extra information
### Verify installation
Create and run a simple inference test. The `example.py` can be like:
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
2025-02-19 09:51:43 +08:00
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
2025-02-14 10:22:15 +08:00
# Create an LLM.
2025-02-18 11:20:06 +08:00
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
2025-02-14 10:22:15 +08:00
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
Then run:
```bash
2025-06-17 08:52:26 +08:00
# Try `export VLLM_USE_MODELSCOPE=true` and `pip install modelscope`
# to speed up download if huggingface is not reachable.
2025-02-14 10:22:15 +08:00
python example.py
2025-02-11 14:28:26 +08:00
```
2025-02-18 11:20:06 +08:00
The output will be like:
```bash
2025-02-19 09:51:43 +08:00
INFO 02-18 08:49:58 __init__ .py:28] Available plugins for group vllm.platform_plugins:
INFO 02-18 08:49:58 __init__ .py:30] name=ascend, value=vllm_ascend:register
INFO 02-18 08:49:58 __init__ .py:32] all available plugins for group vllm.platform_plugins will be loaded.
INFO 02-18 08:49:58 __init__ .py:34] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 02-18 08:49:58 __init__ .py:42] plugin ascend loaded.
INFO 02-18 08:49:58 __init__ .py:174] Platform plugin ascend is activated
INFO 02-18 08:50:12 config.py:526] This model supports multiple tasks: {'embed', 'classify', 'generate', 'score', 'reward'}. Defaulting to 'generate'.
INFO 02-18 08:50:12 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='./Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='./Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=./Qwen2.5-0.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00< ?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00< 00:00 , 5 . 86it / s ]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00< 00:00 , 5 . 85it / s ]
INFO 02-18 08:50:24 executor_base.py:108] # CPU blocks: 35064, # CPU blocks: 2730
INFO 02-18 08:50:24 executor_base.py:113] Maximum concurrency for 32768 tokens per request: 136.97x
INFO 02-18 08:50:25 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 3.87 seconds
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00< 00:00 , 8 . 46it / s , est . speed input: 46 . 55 toks / s , output: 135 . 41 toks / s ]
Prompt: 'Hello, my name is', Generated text: " Shinji, a teenage boy from New York City. I'm a computer science"
Prompt: 'The president of the United States is', Generated text: ' a very important person. When he or she is elected, many people think that'
Prompt: 'The capital of France is', Generated text: ' Paris. The oldest part of the city is Saint-Germain-des-Pr'
Prompt: 'The future of AI is', Generated text: ' not bright\n\nThere is no doubt that the evolution of AI will have a huge'
2025-02-18 11:20:06 +08:00
```