[Doc] Update tutorial index (#4920)

Update tutorial index and remove useless doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This commit is contained in:
wangxiyuan
2025-12-11 20:53:13 +08:00
committed by GitHub
parent e56dba9b0d
commit e538fa6f9c
29 changed files with 41 additions and 1034 deletions

View File

@@ -19,7 +19,7 @@ vLLM Ascend currently supports two types of connectors for handling KV cache man
- **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner.
For step-by-step deployment and configuration, refer to the following guide:
[https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_pd_disaggregation_mooncake.html](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_pd_disaggregation_mooncake.html)
[https://vllm-ascend.readthedocs.io/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://vllm-ascend.readthedocs.io/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html)
---

View File

@@ -104,7 +104,7 @@ vllm-ascend is a hardware plugin for vLLM. Basically, the version of vllm-ascend
### 8. Does vllm-ascend support Prefill Disaggregation feature?
Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend. Take [official tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_pd_disaggregation_mooncake.html) for example.
Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend. Take [official tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html) for example.
### 9. Does vllm-ascend support quantization method?

View File

@@ -1,4 +1,4 @@
# Single Node (Atlas 300I Series)
# Atlas 300I
```{note}
1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes related to model coverage and performance improvement.

View File

@@ -212,7 +212,7 @@ vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
### Prefill-Decode Disaggregation
We recommend using Mooncake for deployment: [Mooncake](./multi_node_pd_disaggregation_mooncake.md).
We recommend using Mooncake for deployment: [Mooncake](./pd_disaggregation_mooncake_multi_node.md).
This solution has been tested and demonstrates excellent performance.

View File

@@ -1,4 +1,4 @@
# DeepSeek-V3.1
# DeepSeek-V3/3.1
## Introduction
@@ -251,7 +251,7 @@ vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \
### Prefill-Decode Disaggregation
We recommend using Mooncake for deployment: [Mooncake](./multi_node_pd_disaggregation_mooncake.md).
We recommend using Mooncake for deployment: [Mooncake](./pd_disaggregation_mooncake_multi_node.md).
Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory to serve high concurrency in 1P1D case.
- `DeepSeek-V3.1_w8a8mix_mtp 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16).

View File

@@ -1,4 +1,4 @@
# DeepSeek-V3.2-Exp
# DeepSeek-V3.2
## Introduction

View File

@@ -1,4 +1,4 @@
# Multi-NPU (Kimi-K2-Thinking)
# Kimi-K2-Thinking
## Run with Docker

View File

@@ -1,4 +1,4 @@
# Qwen-VL-Dense
# Qwen-VL-Dense(Qwen2.5VL-3B/7B, Qwen3-VL-2B/4B/8B/32B)
## Introduction

View File

@@ -1,4 +1,4 @@
# Qwen2.5-7B-Instruct
# Qwen2.5-7B
## Introduction

View File

@@ -1,4 +1,4 @@
# Single NPU (Qwen2-Audio-7B)
# Qwen2-Audio-7B
## Run vllm-ascend on Single NPU

View File

@@ -251,11 +251,11 @@ INFO: Application startup complete.
### Multi-node Deployment with Ray
- refer to [Multi-Node-Ray (Qwen/Qwen3-235B-A22B)](./multi_node_ray.md).
- refer to [Ray Distributed (Qwen/Qwen3-235B-A22B)](./ray.md).
### Prefill-Decode Disaggregation
- refer to [Prefill-Decode Disaggregation Mooncake Verification (Qwen)](./multi_node_pd_disaggregation_mooncake.md)
- refer to [Prefill-Decode Disaggregation Mooncake Verification (Qwen)](./pd_disaggregation_mooncake_multi_node.md)
## Functional Verification

View File

@@ -1,4 +1,4 @@
# Multi-NPU (Qwen3-30B-A3B)
# Qwen3-30B-A3B
## Run vllm-ascend on Multi-NPU with Qwen3 MoE

View File

@@ -1,4 +1,4 @@
# Single-NPU (Qwen3 32B W4A4)
# Qwen3-32B-W4A4
## Introduction

View File

@@ -1,4 +1,4 @@
# Single-NPU (Qwen3-8B-W4A8)
# Qwen3-8B-W4A8
## Run Docker Container
:::{note}

View File

@@ -1,4 +1,4 @@
# Qwen3-Dense
# Qwen3-Dense(Qwen3-0.6B/8B/32B)
## Introduction

View File

@@ -1,4 +1,4 @@
# Multi-NPU (Qwen3-Next)
# Qwen3-Next
```{note}
The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement.

View File

@@ -1,4 +1,4 @@
# Single NPU (Qwen3-Embedding-8B)
# Qwen3-Embedding-8B
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.

View File

@@ -3,30 +3,24 @@
:::{toctree}
:caption: Deployment
:maxdepth: 1
single_npu
Qwen-VL-Dense.md
single_npu_qwen2_audio
single_npu_qwen3_embedding
single_npu_qwen3_quantization
single_npu_qwen3_w4a4
single_node_pd_disaggregation_mooncake
Qwen2.5
multi_npu_qwen3_next
multi_npu
multi_npu_kimi-k2-thinking
Qwen2.5-Omni.md
Qwen2_audio
Qwen2.5-7B
Qwen3-Dense
multi_npu_qwen3_moe
multi_npu_quantization
single_node_300i
DeepSeek-R1.md
DeepSeek-V3.1.md
DeepSeek-V3.2-Exp.md
Qwen-VL-Dense.md
Qwen3-30B-A3B.md
Qwen3-235B-A22B.md
Qwen3-Coder-30B-A3B
multi_node
multi_node_kimi
multi_node_qwen3vl
multi_node_pd_disaggregation_mooncake
multi_node_ray
Qwen2.5-Omni.md
Qwen3_embedding
Qwen3-8B-W4A8
Qwen3-32B-W4A4
Qwen3-Next
DeepSeek-V3.1.md
DeepSeek-V3.2.md
DeepSeek-R1.md
Kimi-K2-Thinking
pd_disaggregation_mooncake_single_node
pd_disaggregation_mooncake_multi_node
ray
310p
:::

View File

@@ -1,210 +0,0 @@
# Multi-Node-DP (DeepSeek)
## Getting Start
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
Each DP rank is deployed as a separate "core engine" process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
- Use **Data Parallel (DP)** for attention layers, which are replicated across devices and handle separate batches.
- Use **Expert or Tensor Parallel (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
This division enables attention layers to be replicated across DP ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using DP/TP, maximizing hardware utilization and efficiency.
In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP `Coordinator` process, which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
## Verify Multi-Node Communication Environment
### Physical Layer Requirements:
- The physical machines must be located on the same WLAN, with network connectivity.
- All NPUs are connected with optical modules, and the connection status must be normal.
### Verification Process:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
### NPU Interconnect Verification:
#### 1. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
```
#### 2. Cross-Node PING Test
```bash
# Execute on the target node (replace with actual IP)
hccn_tool -i 0 -ping -g address 10.20.0.20
```
## Run with Docker
Assume you have two Atlas 800 A2 (64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multiple nodes.
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively.
:::{note}
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
**Node 0**
```shell
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxxx"
local_ip="xxxx"
export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 4 \
--seed 1024 \
--served-model-name deepseek_v3.1 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--quantization ascend \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9
```
**Node 1**
```shell
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 2 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 4 \
--seed 1024 \
--quantization ascend \
--served-model-name deepseek_v3.1 \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92
```
The deployment view looks like:
![alt text](../assets/multi_node_dp_deepseek.png)
Once your server is started, you can query the model with input prompts:
```shell
curl http://{ node0 ip:8004 }/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek_v3.1",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0
}'
```
## Run Benchmarks
For details, refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model vllm-ascend/DeepSeek-V3.1-W8A8 --served-model-name deepseek_v3.1 \
--dataset-name random --random-input-len 128 --random-output-len 128 \
--num-prompts 200 --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1
```

View File

@@ -1,155 +0,0 @@
# Multi-Node-DP (Kimi-K2)
## Verify Multi-Node Communication Environment
Refer to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process).
## Run with Docker
Assume you have two Atlas 800 A3 (64G*16) or four A2 nodes, and want to deploy the `Kimi-K2-Instruct-W8A8` quantitative model across multiple nodes.
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/home/cache \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively.
:::{note}
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
**Node 0**
```shell
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxxx"
local_ip="xxxx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--api-server-count 2 \
--data-parallel-size-local 2 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name kimi \
--quantization ascend \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9
```
**Node 1**
```shell
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxxx"
local_ip="xxxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 2 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--tensor-parallel-size 8 \
--served-model-name kimi \
--max-num-seqs 16 \
--max-model-len 8192 \
--quantization ascend \
--max-num-batched-tokens 8192 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92
```
The deployment view looks like:
![alt text](../assets/multi_node_dp_kimi.png)
Once your server is started, you can query the model with input prompts:
```shell
curl http://{ node0 ip:8004 }/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "kimi",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0
}'
```

View File

@@ -1,163 +0,0 @@
# Multi-Node-DP (Qwen3-VL-235B-A22B)
:::{note}
Qwen3 VL relies on the newest version of `transformers` (>4.56.2). Please install it from source.
:::
## Verify Multi-Node Communication Environment
Refer to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process).
## Run with Docker
Assume you have Atlas 800 A3 (64G*16) nodes (or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively.
:::{note}
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
Node 0
```shell
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxxx"
local_ip="xxxx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 2 \
--api-server-count 2 \
--data-parallel-size-local 1 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name qwen3vl \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.8 \
```
Node 1
```shell
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxxx"
local_ip="xxxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--tensor-parallel-size 8 \
--served-model-name qwen3vl \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.8 \
```
If the service starts successfully, the following information will be displayed on node 0:
```shell
INFO: Started server process [44610]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Started server process [44611]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts:
```shell
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3vl",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustrate?"}
]}
]
}'
```

View File

@@ -1,108 +0,0 @@
# Multi-NPU (QwQ-32B)
## Run vllm-ascend on Multi-NPU
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on multi-NPU:
```bash
vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4
```
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/QwQ-32B",
"prompt": "QwQ-32B是什么",
"max_tokens": "128",
"top_p": "0.95",
"top_k": "40",
"temperature": "0.6"
}'
```
### Offline Inference on Multi-NPU
Run the following script to execute offline inference on multi-NPU:
```python
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="Qwen/QwQ-32B",
tensor_parallel_size=4,
distributed_executor_backend="mp",
max_model_len=4096)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
```

View File

@@ -1,138 +0,0 @@
# Multi-NPU (QwQ-32B-W8A8)
## Run Docker Container
:::{note}
w8a8 quantization feature is supported by v0.8.4rc2 and later.
:::
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
## Install modelslim and Convert Model
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8
:::
```bash
# (Optional)This tag is recommended and has been verified
git clone https://gitcode.com/Ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
cd msit/msmodelslim
# Install by run this script
bash install.sh
pip install accelerate
cd example/Qwen
# Original weight path, Replace with your local model path
MODEL_PATH=/home/models/QwQ-32B
# Path to save converted weight, Replace with your local path
SAVE_PATH=/home/models/QwQ-32B-w8a8
# In this conversion process, the npu device is not must, you can also set --device_type cpu to have a conversion
python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --calib_file ../common/boolq.jsonl --w_bit 8 --a_bit 8 --device_type npu --anti_method m1 --trust_remote_code True
```
## Verify the Quantized Model
The converted model files look like:
```bash
.
|-- config.json
|-- configuration.json
|-- generation_config.json
|-- quant_model_description.json
|-- quant_model_weight_w8a8.safetensors
|-- README.md
|-- tokenizer.json
`-- tokenizer_config.json
```
Run the following script to start the vLLM server with the quantized model:
:::{note}
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released. You can cherry-pick this commit for now.
:::
```bash
vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwq-32b-w8a8",
"prompt": "what is large language model?",
"max_tokens": "128",
"top_p": "0.95",
"top_k": "40",
"temperature": "0.0"
}'
```
Run the following script to execute offline inference on multi-NPU with the quantized model:
:::{note}
To enable quantization for ascend, quantization method must be "ascend".
:::
```python
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="/home/models/QwQ-32B-w8a8",
tensor_parallel_size=4,
distributed_executor_backend="mp",
max_model_len=4096,
quantization="ascend")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```

View File

@@ -1,4 +1,4 @@
# Prefill-Decode Disaggregation Mooncake Verification (Deepseek)
# Prefill-Decode Disaggregation (Deepseek)
## Getting Start

View File

@@ -1,4 +1,4 @@
# Prefill-Decode Disaggregation Mooncake Verification (Qwen2.5-VL)
# Prefill-Decode Disaggregation (Qwen2.5-VL)
## Getting Start

View File

@@ -1,4 +1,4 @@
# Multi-Node-Ray (Qwen/Qwen3-235B-A22B)
# Ray Distributed (Qwen3-235B-A22B)
Multi-node inference is suitable for scenarios where the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:

View File

@@ -1,213 +0,0 @@
# Single NPU (Qwen3-8B)
## Run vllm-ascend on Single NPU
### Offline Inference on Single NPU
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
:::{note}
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
:::
Run the following script to execute offline inference on a single NPU:
:::::{tab-set}
:sync-group: inference
::::{tab-item} Graph Mode
:sync: graph mode
```{code-block} python
:substitutions:
import os
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="Qwen/Qwen3-8B",
max_model_len=26240
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
::::
::::{tab-item} Eager Mode
:sync: eager mode
```{code-block} python
:substitutions:
import os
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="Qwen/Qwen3-8B",
max_model_len=26240,
enforce_eager=True
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
::::
:::::
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
```
### Online Serving on Single NPU
Run docker container to start the vLLM server on a single NPU:
:::::{tab-set}
:sync-group: inference
::::{tab-item} Graph Mode
:sync: graph mode
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen3-8B --max_model_len 26240
```
::::
::::{tab-item} Eager Mode
:sync: eager mode
```{code-block} bash
:substitutions:
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
```
::::
:::::
:::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
:::
If your service start successfully, you can see the info shown below:
```bash
INFO: Started server process [6873]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```
If you query the server successfully, you can see the info shown below (client):
```bash
{"id":"cmpl-b25a59a2f985459781ce7098aeddfda7","object":"text_completion","created":1739523925,"model":"Qwen/Qwen3-8B","choices":[{"index":0,"text":" here. Its not just a","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}}
```
Logs of the vllm server:
```bash
INFO: 172.17.0.1:49518 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-13 08:34:35 logger.py:39] Received request cmpl-574f00e342904692a73fb6c1c986c521-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
```

View File

@@ -56,7 +56,7 @@ v0.11.0 will be the next official release version of vLLM Ascend. We'll release
### Core
- Performance of Qwen3 and Deepseek V3 series models are improved.
- Mooncake layerwise connector is supported now [#2602](https://github.com/vllm-project/vllm-ascend/pull/2602). Find tutorial [here](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node_pd_disaggregation_mooncake.html).
- Mooncake layerwise connector is supported now [#2602](https://github.com/vllm-project/vllm-ascend/pull/2602). Find tutorial [here](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html).
- MTP > 1 is supported now. [#2708](https://github.com/vllm-project/vllm-ascend/pull/2708)
- [Experimental] Graph mode `FULL_DECODE_ONLY` is supported now! And `FULL` will be landing in the next few weeks. [#2128](https://github.com/vllm-project/vllm-ascend/pull/2128)
- Pooling models, such as bge-m3, are supported now. [#3171](https://github.com/vllm-project/vllm-ascend/pull/3171)

View File

@@ -9,7 +9,7 @@ Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/160
| Model | Support | Note | BF16 | Supported Hardware | W8A8 | Chunked Prefill | Automatic Prefix Cache | LoRA | Speculative Decoding | Async Scheduling | Tensor Parallel | Pipeline Parallel | Expert Parallel | Data Parallel | Prefill-decode Disaggregation | Piecewise AclGraph | Fullgraph AclGraph | max-model-len | MLP Weight Prefetch | Doc |
|-------------------------------|-----------|----------------------------------------------------------------------|------|--------------------|------|-----------------|------------------------|------|----------------------|------------------|-----------------|-------------------|-----------------|---------------|-------------------------------|--------------------|--------------------|---------------|---------------------|-----|
| DeepSeek V3/3.1 | ✅ | |||||||||||||||||||
| DeepSeek V3.2 EXP | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ❌ | | | 163840 | | [DeepSeek-V3.2-Exp tutorial](../../tutorials/DeepSeek-V3.2-Exp.md) |
| DeepSeek V3.2 EXP | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ❌ | | | 163840 | | [DeepSeek-V3.2-Exp tutorial](../../tutorials/DeepSeek-V3.2.md) |
| DeepSeek R1 | ✅ | |||||||||||||||||||
| DeepSeek Distill (Qwen/Llama) | ✅ | |||||||||||||||||||
| Qwen3 | ✅ | |||||||||||||||||||