v0.10.1rc1

This commit is contained in:
2025-09-09 09:40:35 +08:00
parent d6f6ef41fe
commit 9149384e03
432 changed files with 84698 additions and 1 deletions

View File

@@ -0,0 +1,18 @@
# Tutorials
:::{toctree}
:caption: Deployment
:maxdepth: 1
single_npu
single_npu_multimodal
single_npu_audio
single_npu_qwen3_embedding
single_npu_qwen3_quantization
multi_npu
multi_npu_moge
multi_npu_qwen3_moe
multi_npu_quantization
single_node_300i
multi_node
multi_node_kimi
:::

View File

@@ -0,0 +1,207 @@
# Multi-Node-DP (DeepSeek)
## Getting Start
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which dont currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
## Verify Multi-Node Communication Environment
### Physical Layer Requirements:
- The physical machines must be located on the same WLAN, with network connectivity.
- All NPUs are connected with optical modules, and the connection status must be normal.
### Verification Process:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
### NPU Interconnect Verification:
#### 1. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
```
#### 2. Cross-Node PING Test
```bash
# Execute on the target node (replace with actual IP)
hccn_tool -i 0 -ping -g address 10.20.0.20
```
## Run with docker
Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multi-node.
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively
:::{note}
Before launch the inference server, ensure the following environment variables are set for multi node communication
:::
**node0**
```shell
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"
export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 4 \
--seed 1024 \
--served-model-name deepseek_v3.1 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--quantization ascend \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
**node1**
```shell
#!/bin/sh
nic_name="xxx"
local_ip="xxx"
export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 2 \
--data-parallel-address { node0 ip } \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 4 \
--seed 1024 \
--quantization ascend \
--served-model-name deepseek_v3.1 \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
The Deployment view looks like:
![alt text](../assets/multi_node_dp_deepseek.png)
Once your server is started, you can query the model with input prompts:
```shell
curl http://{ node0 ip:8004 }/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek_v3.1",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0
}'
```
## Run benchmarks
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model vllm-ascend/DeepSeek-V3.1-W8A8 --served-model-name deepseek_v3.1 \
--dataset-name random --random-input-len 128 --random-output-len 128 \
--num-prompts 200 --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1
```

View File

@@ -0,0 +1,153 @@
# Multi-Node-DP (Kimi-K2)
## Verify Multi-Node Communication Environment
referring to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process)
## Run with docker
Assume you have two Atlas 800 A3(64G*16) nodes(or 4 * A2), and want to deploy the `Kimi-K2-Instruct-W8A8` quantitative model across multi-node.
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/home/cache \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively
:::{note}
Before launch the inference server, ensure the following environment variables are set for multi node communication
:::
**node0**
```shell
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--api-server-count 2 \
--data-parallel-size-local 2 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name kimi \
--quantization ascend \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
**node1**
```shell
#!/bin/sh
nic_name="xxxx"
local_ip="xxxx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 2 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--tensor-parallel-size 8 \
--served-model-name kimi \
--max-num-seqs 16 \
--max-model-len 32768 \
--quantization ascend \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
The Deployment view looks like:
![alt text](../assets/multi_node_dp_kimi.png)
Once your server is started, you can query the model with input prompts:
```shell
curl http://{ node0 ip:8004 }/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "kimi",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0
}'
```

View File

@@ -0,0 +1,107 @@
# Multi-NPU (QwQ 32B)
## Run vllm-ascend on Multi-NPU
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
```bash
vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/QwQ-32B",
"prompt": "QwQ-32B是什么",
"max_tokens": "128",
"top_p": "0.95",
"top_k": "40",
"temperature": "0.6"
}'
```
### Offline Inference on Multi-NPU
Run the following script to execute offline inference on multi-NPU:
```python
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="Qwen/QwQ-32B",
tensor_parallel_size=4,
distributed_executor_backend="mp",
max_model_len=4096)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
```

View File

@@ -0,0 +1,242 @@
# Multi-NPU (Pangu Pro MoE)
## Run vllm-ascend on Multi-NPU
Run container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
Download the model:
```bash
git lfs install
git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
```
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
```bash
vllm serve /path/to/pangu-pro-moe-model \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--trust-remote-code \
--enforce-eager
```
Once your server is started, you can query the model with input prompts:
:::::{tab-set}
::::{tab-item} v1/completions
```{code-block} bash
:substitutions:
export question="你是谁?"
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:",
"max_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
}'
```
::::
::::{tab-item} v1/chat/completions
```{code-block} bash
:substitutions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": ""},
{"role": "user", "content": "你是谁?"}
],
"max_tokens": "64",
"top_p": "0.95",
"top_k": "50",
"temperature": "0.6",
"add_special_tokens" : true
}'
```
::::
:::::
If you run this successfully, you can see the info shown below:
```json
{"id":"cmpl-2cd4223228ab4be9a91f65b882e65b32","object":"text_completion","created":1751255067,"model":"/root/.cache/pangu-pro-moe-model","choices":[{"index":0,"text":" [unused16] 好的用户问我是谁我需要根据之前的设定来回答。用户提到我是华为开发的“盘古Reasoner”属于盘古大模型系列作为智能助手帮助解答问题和提供 信息支持。现在用户再次询问,可能是在确认我的身份或者测试我的回答是否一致。\n\n首先我要确保","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":15,"total_tokens":79,"completion_tokens":64,"prompt_tokens_details":null},"kv_transfer_params":null}
```
### Offline Inference on Multi-NPU
Run the following script to execute offline inference on multi-NPU:
:::::{tab-set}
::::{tab-item} Graph Mode
```{code-block} python
:substitutions:
import gc
from transformers import AutoTokenizer
import torch
import os
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained("/path/to/pangu-pro-moe-model", trust_remote_code=True)
tests = [
"Hello, my name is",
"The future of AI is",
]
prompts = []
for text in tests:
messages = [
{"role": "system", "content": ""}, # Optionally customize system content
{"role": "user", "content": text}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompts.append(prompt)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="/path/to/pangu-pro-moe-model",
tensor_parallel_size=4,
enable_expert_parallel=True,
distributed_executor_backend="mp",
max_model_len=1024,
trust_remote_code=True,
additional_config={
'torchair_graph_config': {
'enabled': True,
},
'ascend_scheduler_config':{
'enabled': True,
'enable_chunked_prefill' : False,
'chunked_prefill_enabled': False
},
})
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
::::{tab-item} Eager Mode
```{code-block} python
:substitutions:
import gc
from transformers import AutoTokenizer
import torch
import os
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained("/path/to/pangu-pro-moe-model", trust_remote_code=True)
tests = [
"Hello, my name is",
"The future of AI is",
]
prompts = []
for text in tests:
messages = [
{"role": "system", "content": ""}, # Optionally customize system content
{"role": "user", "content": text}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompts.append(prompt)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="/path/to/pangu-pro-moe-model",
tensor_parallel_size=4,
distributed_executor_backend="mp",
max_model_len=1024,
trust_remote_code=True,
enforce_eager=True)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
:::::
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
```

View File

@@ -0,0 +1,137 @@
# Multi-NPU (QwQ 32B W8A8)
## Run docker container
:::{note}
w8a8 quantization feature is supported by v0.8.4rc2 or higher
:::
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
## Install modelslim and convert model
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8
:::
```bash
# (Optional)This tag is recommended and has been verified
git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
cd msit/msmodelslim
# Install by run this script
bash install.sh
pip install accelerate
cd example/Qwen
# Original weight path, Replace with your local model path
MODEL_PATH=/home/models/QwQ-32B
# Path to save converted weight, Replace with your local path
SAVE_PATH=/home/models/QwQ-32B-w8a8
# In this conversion process, the npu device is not must, you can also set --device_type cpu to have a conversion
python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --calib_file ../common/boolq.jsonl --w_bit 8 --a_bit 8 --device_type npu --anti_method m1 --trust_remote_code True
```
## Verify the quantized model
The converted model files looks like:
```bash
.
|-- config.json
|-- configuration.json
|-- generation_config.json
|-- quant_model_description.json
|-- quant_model_weight_w8a8.safetensors
|-- README.md
|-- tokenizer.json
`-- tokenizer_config.json
```
Run the following script to start the vLLM server with quantized model:
:::{note}
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
:::
```bash
vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwq-32b-w8a8",
"prompt": "what is large language model?",
"max_tokens": "128",
"top_p": "0.95",
"top_k": "40",
"temperature": "0.0"
}'
```
Run the following script to execute offline inference on multi-NPU with quantized model:
:::{note}
To enable quantization for ascend, quantization method must be "ascend"
:::
```python
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="/home/models/QwQ-32B-w8a8",
tensor_parallel_size=4,
distributed_executor_backend="mp",
max_model_len=4096,
quantization="ascend")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```

View File

@@ -0,0 +1,109 @@
# Multi-NPU (Qwen3-30B-A3B)
## Run vllm-ascend on Multi-NPU with Qwen3 MoE
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32GB of memory, tensor-parallel-size should be at least 4.
```bash
vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable_expert_parallel
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-30B-A3B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 4096
}'
```
### Offline Inference on Multi-NPU
Run the following script to execute offline inference on multi-NPU:
```python
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="Qwen/Qwen3-30B-A3B",
tensor_parallel_size=4,
distributed_executor_backend="mp",
max_model_len=4096,
enable_expert_parallel=True)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: " Lucy. I'm from the UK and I'm 11 years old."
Prompt: 'The future of AI is', Generated text: ' a topic that has captured the imagination of scientists, philosophers, and the general public'
```

View File

@@ -0,0 +1,406 @@
# Single Node (Atlas 300I series)
```{note}
1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement.
2. Currently, the 310I series only supports eager mode and the data type is float16.
```
## Run vLLM on Altlas 300I series
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-310p
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
### Online Inference on NPU
Run the following script to start the vLLM server on NPU(Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
:::::{tab-set}
:sync-group: inference
::::{tab-item} Qwen3-0.6B
:selected:
:sync: qwen0.6
Run the following command to start the vLLM server:
```{code-block} bash
:substitutions:
vllm serve Qwen/Qwen3-0.6B \
--tensor-parallel-size 1 \
--enforce-eager \
--dtype float16 \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
}'
```
::::
::::{tab-item} Qwen2.5-7B-Instruct
:sync: qwen7b
Run the following command to start the vLLM server:
```{code-block} bash
:substitutions:
vllm serve Qwen/Qwen2.5-7B-Instruct \
--tensor-parallel-size 2 \
--enforce-eager \
--dtype float16 \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
}'
```
::::
::::{tab-item} Qwen2.5-VL-3B-Instruct
:sync: qwen-vl-2.5-3b
Run the following command to start the vLLM server:
```{code-block} bash
:substitutions:
vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
--tensor-parallel-size 1 \
--enforce-eager \
--dtype float16 \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
}'
```
::::
::::{tab-item} Pangu-Pro-MoE-72B
:sync: pangu
Download the model:
```bash
git lfs install
git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
```
Run the following command to start the vLLM server:
```{code-block} bash
:substitutions:
vllm serve /home/pangu-pro-moe-mode/ \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--dtype "float16" \
--trust-remote-code \
--enforce-eager
```
Once your server is started, you can query the model with input prompts
```bash
export question="你是谁?"
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:",
"max_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
}'
```
::::
:::::
If you run this script successfully, you can see the results.
### Offline Inference
Run the following script (`example.py`) to execute offline inference on NPU:
:::::{tab-set}
:sync-group: inference
::::{tab-item} Qwen3-0.6B
:selected:
:sync: qwen0.6
```{code-block} python
:substitutions:
from vllm import LLM, SamplingParams
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen3-0.6B",
tensor_parallel_size=1,
enforce_eager=True, # For 300I series, only eager mode is supported.
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
::::{tab-item} Qwen2.5-7B-Instruct
:sync: qwen7b
```{code-block} python
:substitutions:
from vllm import LLM, SamplingParams
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
tensor_parallel_size=2,
enforce_eager=True, # For 300I series, only eager mode is supported.
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
::::{tab-item} Qwen2.5-VL-3B-Instruct
:sync: qwen-vl-2.5-3b
```{code-block} python
:substitutions:
from vllm import LLM, SamplingParams
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, top_p=0.95, top_k=50, temperature=0.6)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen2.5-VL-3B-Instruct",
tensor_parallel_size=1,
enforce_eager=True, # For 300I series, only eager mode is supported.
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
::::{tab-item} Pangu-Pro-MoE-72B
:sync: pangu
Download the model:
```bash
git lfs install
git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
```
```{code-block} python
:substitutions:
import gc
from transformers import AutoTokenizer
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained("/home/pangu-pro-moe-mode/", trust_remote_code=True)
tests = [
"Hello, my name is",
"The future of AI is",
]
prompts = []
for text in tests:
messages = [
{"role": "system", "content": ""}, # Optionally customize system content
{"role": "user", "content": text}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # 推荐使用官方的template
prompts.append(prompt)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="/home/pangu-pro-moe-mode/",
tensor_parallel_size=8,
distributed_executor_backend="mp",
enable_expert_parallel=True,
dtype="float16",
max_model_len=1024,
trust_remote_code=True,
enforce_eager=True)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
:::::
Run script:
```bash
python example.py
```
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: " Lina. I'm a 22-year-old student from China. I'm interested in studying in the US. I'm looking for a job in the US. I want to know if there are any opportunities in the US for me to work. I'm also interested in the culture and lifestyle in the US. I want to know if there are any opportunities for me to work in the US. I'm also interested in the culture and lifestyle in the US. I'm interested in the culture"
Prompt: 'The future of AI is', Generated text: " not just about the technology itself, but about how we use it to solve real-world problems. As AI continues to evolve, it's important to consider the ethical implications of its use. AI has the potential to bring about significant changes in society, but it also has the power to create new challenges. Therefore, it's crucial to develop a comprehensive approach to AI that takes into account both the benefits and the risks associated with its use. This includes addressing issues such as bias, privacy, and accountability."
```

View File

@@ -0,0 +1,202 @@
# Single NPU (Qwen3 8B)
## Run vllm-ascend on Single NPU
### Offline Inference on Single NPU
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
:::{note}
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
:::
Run the following script to execute offline inference on a single NPU:
:::::{tab-set}
::::{tab-item} Graph Mode
```{code-block} python
:substitutions:
import os
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="Qwen/Qwen3-8B",
max_model_len=26240
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
::::
::::{tab-item} Eager Mode
```{code-block} python
:substitutions:
import os
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="Qwen/Qwen3-8B",
max_model_len=26240,
enforce_eager=True
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
::::
:::::
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
```
### Online Serving on Single NPU
Run docker container to start the vLLM server on a single NPU:
:::::{tab-set}
::::{tab-item} Graph Mode
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen3-8B --max_model_len 26240
```
::::
::::{tab-item} Eager Mode
```{code-block} bash
:substitutions:
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
```
::::
:::::
:::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
:::
If your service start successfully, you can see the info shown below:
```bash
INFO: Started server process [6873]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```
If you query the server successfully, you can see the info shown below (client):
```bash
{"id":"cmpl-b25a59a2f985459781ce7098aeddfda7","object":"text_completion","created":1739523925,"model":"Qwen/Qwen3-8B","choices":[{"index":0,"text":" here. Its not just a","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}}
```
Logs of the vllm server:
```bash
INFO: 172.17.0.1:49518 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-13 08:34:35 logger.py:39] Received request cmpl-574f00e342904692a73fb6c1c986c521-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
```

View File

@@ -0,0 +1,122 @@
# Single NPU (Qwen2-Audio 7B)
## Run vllm-ascend on Single NPU
### Offline Inference on Single NPU
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
:::{note}
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
:::
Install packages required for audio processing:
```bash
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install librosa soundfile
```
Run the following script to execute offline inference on a single NPU:
```python
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.utils import FlexibleArgumentParser
# If network issues prevent AudioAsset from fetching remote audio files, retry or check your network.
audio_assets = [AudioAsset("mary_had_lamb"), AudioAsset("winning_call")]
question_per_audio_count = {
1: "What is recited in the audio?",
2: "What sport and what nursery rhyme are referenced?"
}
def prepare_inputs(audio_count: int):
audio_in_prompt = "".join([
f"Audio {idx+1}: <|audio_bos|><|AUDIO|><|audio_eos|>\n"
for idx in range(audio_count)
])
question = question_per_audio_count[audio_count]
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n"
f"{audio_in_prompt}{question}<|im_end|>\n"
"<|im_start|>assistant\n")
mm_data = {
"audio":
[asset.audio_and_sample_rate for asset in audio_assets[:audio_count]]
}
# Merge text prompt and audio data into inputs
inputs = {"prompt": prompt, "multi_modal_data": mm_data}
return inputs
def main(audio_count: int):
# NOTE: The default `max_num_seqs` and `max_model_len` may result in OOM on
# lower-end GPUs.
# Unless specified, these settings have been tested to work on a single L4.
# `limit_mm_per_prompt`: the max num items for each modality per prompt.
llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
max_model_len=4096,
max_num_seqs=5,
limit_mm_per_prompt={"audio": audio_count})
inputs = prepare_inputs(audio_count)
sampling_params = SamplingParams(temperature=0.2,
max_tokens=64,
stop_token_ids=None)
outputs = llm.generate(inputs, sampling_params=sampling_params)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
if __name__ == "__main__":
audio_count = 2
main(audio_count)
```
If you run this script successfully, you can see the info shown below:
```bash
The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little Lamb'.
```
### Online Serving on Single NPU
Currently, vllm's OpenAI-compatible server doesn't support audio inputs, find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).

View File

@@ -0,0 +1,192 @@
# Single NPU (Qwen2.5-VL 7B)
## Run vllm-ascend on Single NPU
### Offline Inference on Single NPU
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
:::{note}
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
:::
Run the following script to execute offline inference on a single NPU:
```bash
pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
```
```python
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"
llm = LLM(
model=MODEL_PATH,
max_model_len=16384,
limit_mm_per_prompt={"image": 10},
)
sampling_params = SamplingParams(
max_tokens=512
)
image_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
"min_pixels": 224 * 224,
"max_pixels": 1280 * 28 * 28,
},
{"type": "text", "text": "Please provide a detailed description of this image"},
],
},
]
messages = image_messages
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
```
If you run this script successfully, you can see the info shown below:
```bash
The image displays a logo consisting of two main elements: a stylized geometric design and a pair of text elements.
1. **Geometric Design**: On the left side of the image, there is a blue geometric design that appears to be made up of interconnected shapes. These shapes resemble a network or a complex polygonal structure, possibly hinting at a technological or interconnected theme. The design is monochromatic and uses only blue as its color, which could be indicative of a specific brand or company.
2. **Text Elements**: To the right of the geometric design, there are two lines of text. The first line reads "TONGYI" in a sans-serif font, with the "YI" part possibly being capitalized. The second line reads "Qwen" in a similar sans-serif font, but in a smaller size.
The overall design is modern and minimalist, with a clear contrast between the geometric and textual elements. The use of blue for the geometric design could suggest themes of technology, connectivity, or innovation, which are common associations with the color blue in branding. The simplicity of the design makes it easily recognizable and memorable.
```
### Online Serving on Single NPU
Run docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--dtype bfloat16 \
--max_model_len 16384 \
--max-num-batched-tokens 16384
```
:::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
:::
If your service start successfully, you can see the info shown below:
```bash
INFO: Started server process [2736]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustrate?"}
]}
]
}'
```
If you query the server successfully, you can see the info shown below (client):
```bash
{"id":"chatcmpl-f04fb20e79bb40b39b8ed7fdf5bd613a","object":"chat.completion","created":1741749149,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The text in the illustration reads \"TONGYI Qwen.\"","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":74,"total_tokens":89,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null}
```
Logs of the vllm server:
```bash
INFO 03-12 11:16:50 logger.py:39] Received request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>\nWhat is the text in the illustrate?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16353, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 03-12 11:16:50 engine.py:280] Added request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb.
INFO: 127.0.0.1:54004 - "POST /v1/chat/completions HTTP/1.1" 200 OK
```

View File

@@ -0,0 +1,99 @@
# Single NPU (Qwen3-Embedding-8B)
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
## Run docker container
Take Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
### Online Inference
```bash
vllm serve Qwen/Qwen3-Embedding-8B --task embed
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-Embedding-8B",
"messages": [
{"role": "user", "content": "Hello"}
]
}'
```
### Offline Inference
```python
import torch
import vllm
from vllm import LLM
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery:{query}'
if __name__=="__main__":
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'What is the capital of China?'),
get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
model = LLM(model="Qwen/Qwen3-Embedding-8B",
task="embed",
distributed_executor_backend="mp")
outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
```
If you run this script successfully, you can see the info shown below:
```bash
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 282.22it/s]
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](VllmWorker rank=0 pid=4074750) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 31.95it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[[0.7477798461914062, 0.07548339664936066], [0.0886271521449089, 0.6311039924621582]]
```

View File

@@ -0,0 +1,133 @@
# Single-NPU (Qwen3 8B W4A8)
## Run docker container
:::{note}
w4a8 quantization feature is supported by v0.9.1rc2 or higher
:::
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
## Install modelslim and convert model
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/Qwen3-8B-W4A8
:::
```bash
# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitee.com/ascend/msit
cd msit/msmodelslim
# Install by run this script
bash install.sh
pip install accelerate
cd example/Qwen
# Original weight path, Replace with your local model path
MODEL_PATH=/home/models/Qwen3-8B
# Path to save converted weight, Replace with your local path
SAVE_PATH=/home/models/Qwen3-8B-w4a8
python quant_qwen.py \
--model_path $MODEL_PATH \
--save_directory $SAVE_PATH \
--device_type npu \
--model_type qwen3 \
--calib_file None \
--anti_method m6 \
--anti_calib_file ./calib_data/mix_dataset.json \
--w_bit 4 \
--a_bit 8 \
--is_lowbit True \
--open_outlier False \
--group_size 256 \
--is_dynamic True \
--trust_remote_code True \
--w_method HQQ
```
## Verify the quantized model
The converted model files looks like:
```bash
.
|-- config.json
|-- configuration.json
|-- generation_config.json
|-- merges.txt
|-- quant_model_description.json
|-- quant_model_weight_w4a8_dynamic-00001-of-00003.safetensors
|-- quant_model_weight_w4a8_dynamic-00002-of-00003.safetensors
|-- quant_model_weight_w4a8_dynamic-00003-of-00003.safetensors
|-- quant_model_weight_w4a8_dynamic.safetensors.index.json
|-- README.md
|-- tokenizer.json
`-- tokenizer_config.json
```
Run the following script to start the vLLM server with quantized model:
```bash
vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-8b-w4a8",
"prompt": "what is large language model?",
"max_tokens": "128",
"top_p": "0.95",
"top_k": "40",
"temperature": "0.0"
}'
```
Run the following script to execute offline inference on Single-NPU with quantized model:
:::{note}
To enable quantization for ascend, quantization method must be "ascend"
:::
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="/home/models/Qwen3-8B-w4a8",
max_model_len=4096,
quantization="ascend")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```