init v0.11.0rc0
This commit is contained in:
@@ -8,6 +8,7 @@ single_npu_multimodal
|
||||
single_npu_audio
|
||||
single_npu_qwen3_embedding
|
||||
single_npu_qwen3_quantization
|
||||
multi_npu_qwen3_next
|
||||
multi_npu
|
||||
multi_npu_moge
|
||||
multi_npu_qwen3_moe
|
||||
@@ -15,4 +16,7 @@ multi_npu_quantization
|
||||
single_node_300i
|
||||
multi_node
|
||||
multi_node_kimi
|
||||
multi_node_qwen3vl
|
||||
multi_node_pd_disaggregation
|
||||
multi_node_ray
|
||||
:::
|
||||
|
||||
244
docs/source/tutorials/multi_node_pd_disaggregation.md
Normal file
244
docs/source/tutorials/multi_node_pd_disaggregation.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Prefill-Decode Disaggregation Verification (Qwen)
|
||||
|
||||
## Getting Start
|
||||
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
|
||||
|
||||
Take the Qwen3-30B-A3B model as an example, use vllm-ascend v0.10.1rc1 (with vLLM v0.10.1.1) on 3 Atlas 800T A2 servers to deploy the "1P2D" architecture. Assume the ip of the prefiller server is 192.0.0.1, and the decoder servers are 192.0.0.2 (decoder 1) and 192.0.0.3 (decoder 2). On each server, use 2 NPUs to deploy one service instance.
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
### Physical Layer Requirements
|
||||
|
||||
- The physical machines must be located on the same WLAN, with network connectivity.
|
||||
- All NPUs must be interconnected. Intra-node connectivity is via HCCS, and inter-node connectivity is via RDMA.
|
||||
|
||||
### Verification Process
|
||||
|
||||
1. Single Node Verification:
|
||||
|
||||
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
|
||||
|
||||
```bash
|
||||
# Check the remote switch ports
|
||||
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
|
||||
# Get the link status of the Ethernet ports (UP or DOWN)
|
||||
for i in {0..7}; do hccn_tool -i $i -link -g ; done
|
||||
# Check the network health status
|
||||
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
|
||||
# View the network detected IP configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
|
||||
# View gateway configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
|
||||
# View NPU network configuration
|
||||
cat /etc/hccn.conf
|
||||
```
|
||||
|
||||
2. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
for i in {0..7}; do hccn_tool -i $i -ip -g;done
|
||||
```
|
||||
|
||||
3. Cross-Node PING Test
|
||||
|
||||
```bash
|
||||
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
|
||||
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
|
||||
```
|
||||
|
||||
## Generate Ranktable
|
||||
|
||||
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. For more details please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md). Execute the following commands for reference.
|
||||
|
||||
```shell
|
||||
cd vllm-ascend/examples/disaggregate_prefill_v1/
|
||||
bash gen_ranktable.sh --ips <prefiller_node1_local_ip> <prefiller_node2_local_ip> <decoder_node1_local_ip> <decoder_node2_local_ip> \
|
||||
--npus-per-node <npu_clips> --network-card-name <nic_name> --prefill-device-cnt <prefiller_npu_clips> --decode-device-cnt <decode_npu_clips> \
|
||||
[--local-device-ids <id_1>,<id_2>,<id_3>...]
|
||||
```
|
||||
|
||||
Assume that we use device 0,1 on the prefiller server node and device 6,7 on both of the decoder server nodes. Take the following commands as an example. (`--local-device-ids` is necessary if you specify certain NPU devices on the local server.)
|
||||
|
||||
```shell
|
||||
# On the prefiller node
|
||||
cd vllm-ascend/examples/disaggregate_prefill_v1/
|
||||
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
|
||||
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 0,1
|
||||
|
||||
# On the decoder 1
|
||||
cd vllm-ascend/examples/disaggregate_prefill_v1/
|
||||
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
|
||||
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 6,7
|
||||
|
||||
# On the decoder 2
|
||||
cd vllm-ascend/examples/disaggregate_prefill_v1/
|
||||
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
|
||||
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 6,7
|
||||
```
|
||||
|
||||
Rank table will generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
|
||||
|
||||
|Parameter | meaning |
|
||||
| --- | --- |
|
||||
| --ips | Each node's local ip (prefiller nodes should be front of decoder nodes) |
|
||||
| --npus-per-node | Each node's npu clips |
|
||||
| --network-card-name | The physical machines' NIC |
|
||||
|--prefill-device-cnt | Npu clips used for prefill |
|
||||
|--decode-device-cnt |Npu clips used for decode |
|
||||
|--local-device-ids |Optional. No need if using all devices on the local node. |
|
||||
|
||||
## Prefiller / Decoder Deployment
|
||||
|
||||
We can run the following scripts to launch a server on the prefiller/decoder node respectively.
|
||||
|
||||
:::::{tab-set}
|
||||
|
||||
::::{tab-item} Prefiller node
|
||||
|
||||
```shell
|
||||
export HCCL_IF_IP=192.0.0.1 # node ip
|
||||
export GLOO_SOCKET_IFNAME="eth0" # network card name
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
vllm serve /model/Qwen3-30B-A3B \
|
||||
--host 0.0.0.0 \
|
||||
--port 13700 \
|
||||
--tensor-parallel-size 2 \
|
||||
--no-enable-prefix-caching \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3-moe \
|
||||
--max-model-len 6144 \
|
||||
--max-num-batched-tokens 6144 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-expert-parallel \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "LLMDataDistCMgrConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_parallel_size": 1,
|
||||
"kv_port": "20001",
|
||||
"engine_id": "0",
|
||||
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
|
||||
}' \
|
||||
--additional-config \
|
||||
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}' \
|
||||
--enforce-eager
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder node 1
|
||||
|
||||
```shell
|
||||
export HCCL_IF_IP=192.0.0.2 # node ip
|
||||
export GLOO_SOCKET_IFNAME="eth0" # network card name
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
vllm serve /model/Qwen3-30B-A3B \
|
||||
--host 0.0.0.0 \
|
||||
--port 13700 \
|
||||
--no-enable-prefix-caching \
|
||||
--tensor-parallel-size 2 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3-moe \
|
||||
--max-model-len 6144 \
|
||||
--max-num-batched-tokens 6144 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-expert-parallel \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "LLMDataDistCMgrConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_parallel_size": 1,
|
||||
"kv_port": "20001",
|
||||
"engine_id": "0",
|
||||
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
|
||||
}' \
|
||||
--additional-config \
|
||||
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder node 2
|
||||
|
||||
```shell
|
||||
export HCCL_IF_IP=192.0.0.3 # node ip
|
||||
export GLOO_SOCKET_IFNAME="eth0" # network card name
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
vllm serve /model/Qwen3-30B-A3B \
|
||||
--host 0.0.0.0 \
|
||||
--port 13700 \
|
||||
--no-enable-prefix-caching \
|
||||
--tensor-parallel-size 2 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3-moe \
|
||||
--max-model-len 6144 \
|
||||
--max-num-batched-tokens 6144 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-expert-parallel \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "LLMDataDistCMgrConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_parallel_size": 1,
|
||||
"kv_port": "20001",
|
||||
"engine_id": "0",
|
||||
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
|
||||
}' \
|
||||
--additional-config \
|
||||
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Example proxy for Deployment
|
||||
|
||||
Run a proxy server on the same node with prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py \
|
||||
--host 192.0.0.1 \
|
||||
--port 8080 \
|
||||
--prefiller-hosts 192.0.0.1 \
|
||||
--prefiller-port 13700 \
|
||||
--decoder-hosts 192.0.0.2 192.0.0.3 \
|
||||
--decoder-ports 13700 13700
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
Check service health using the proxy server endpoint.
|
||||
|
||||
```shell
|
||||
curl http://192.0.0.1:8080/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3-moe",
|
||||
"prompt": "Who are you?",
|
||||
"max_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
156
docs/source/tutorials/multi_node_qwen3vl.md
Normal file
156
docs/source/tutorials/multi_node_qwen3vl.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Multi-Node-DP (Qwen3-VL-235B-A22B)
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
referring to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process)
|
||||
|
||||
## Run with docker
|
||||
Assume you have an Atlas 800 A3(64G*16) nodes(or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multi-node.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci8 \
|
||||
--device /dev/davinci9 \
|
||||
--device /dev/davinci10 \
|
||||
--device /dev/davinci11 \
|
||||
--device /dev/davinci12 \
|
||||
--device /dev/davinci13 \
|
||||
--device /dev/davinci14 \
|
||||
--device /dev/davinci15 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Run the following scripts on two nodes respectively
|
||||
|
||||
:::{note}
|
||||
Before launch the inference server, ensure the following environment variables are set for multi node communication
|
||||
:::
|
||||
|
||||
node0
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--data-parallel-size 2 \
|
||||
--api-server-count 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-address $local_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3vl \
|
||||
--tensor-parallel-size 8 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.8 \
|
||||
```
|
||||
|
||||
node1
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
node0_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--headless \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-start-rank 1 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--seed 1024 \
|
||||
--tensor-parallel-size 8 \
|
||||
--served-model-name qwen3vl \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.8 \
|
||||
```
|
||||
|
||||
If the service starts successfully, the following information will be displayed on node0:
|
||||
|
||||
```shell
|
||||
INFO: Started server process [44610]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Started server process [44611]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3vl",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
|
||||
{"type": "text", "text": "What is the text in the illustrate?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
182
docs/source/tutorials/multi_node_ray.md
Normal file
182
docs/source/tutorials/multi_node_ray.md
Normal file
@@ -0,0 +1,182 @@
|
||||
# Multi-Node-Ray (Qwen/Qwen3-235B-A22B)
|
||||
|
||||
Multi-node inference is suitable for the scenarios that the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
|
||||
|
||||
* **Verify Multi-Node Communication Environment**
|
||||
* **Set Up and Start the Ray Cluster**
|
||||
* **Start the Online Inference Service on multinode**
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
### Physical Layer Requirements:
|
||||
|
||||
* The physical machines must be located on the same LAN, with network connectivity.
|
||||
* All NPUs are connected with optical modules, and the connection status must be normal.
|
||||
|
||||
### Verification Process:
|
||||
|
||||
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
|
||||
|
||||
```bash
|
||||
# Check the remote switch ports
|
||||
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
|
||||
# Get the link status of the Ethernet ports (UP or DOWN)
|
||||
for i in {0..7}; do hccn_tool -i $i -link -g ; done
|
||||
# Check the network health status
|
||||
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
|
||||
# View the network detected IP configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
|
||||
# View gateway configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
|
||||
# View NPU network configuration
|
||||
cat /etc/hccn.conf
|
||||
```
|
||||
|
||||
### NPU Interconnect Verification:
|
||||
#### 1. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
|
||||
```
|
||||
|
||||
#### 2. Cross-Node PING Test
|
||||
|
||||
```bash
|
||||
# Execute on the target node (replace with actual IP)
|
||||
hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||
```
|
||||
|
||||
## Set Up and Start the Ray Cluster
|
||||
### Setting Up the Basic Container
|
||||
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
|
||||
|
||||
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
|
||||
|
||||
Below is the example container setup command, which should be executed on **all nodes** :
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /path/to/shared/cache:/root/.cache \ # IMPORTANT: This must be a shared directory accessible by all nodes
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
### Start Ray Cluster
|
||||
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
|
||||
|
||||
Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
|
||||
|
||||
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues.
|
||||
|
||||
Below are the commands for the head and worker nodes:
|
||||
|
||||
**Head node**:
|
||||
|
||||
:::{note}
|
||||
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
|
||||
Updating the environment variables requires restarting the Ray cluster.
|
||||
:::
|
||||
|
||||
```shell
|
||||
# Head node
|
||||
export HCCL_IF_IP={local_ip}
|
||||
export GLOO_SOCKET_IFNAME={nic_name}
|
||||
export TP_SOCKET_IFNAME={nic_name}
|
||||
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
ray start --head
|
||||
```
|
||||
|
||||
**Worker node**:
|
||||
|
||||
:::{note}
|
||||
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
|
||||
:::
|
||||
|
||||
```shell
|
||||
# Worker node
|
||||
export HCCL_IF_IP={local_ip}
|
||||
export GLOO_SOCKET_IFNAME={nic_name}
|
||||
export TP_SOCKET_IFNAME={nic_name}
|
||||
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
ray start --address='{head_node_ip}:6379' --node-ip-address={local_ip}
|
||||
```
|
||||
|
||||
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
|
||||
|
||||
## Start the Online Inference Service on multinode scenario
|
||||
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster.
|
||||
|
||||
**You only need to run the vllm command on one node.**
|
||||
|
||||
To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
|
||||
|
||||
For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22B \
|
||||
--distributed-executor-backend ray \
|
||||
--pipeline-parallel-size 2 \
|
||||
--tensor-parallel-size 8 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-seqs 25 \
|
||||
--served-model-name qwen \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22B \
|
||||
--distributed-executor-backend ray \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-seqs 25 \
|
||||
--served-model-name qwen \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen",
|
||||
"prompt": "tell me how to sleep well",
|
||||
"max_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
156
docs/source/tutorials/multi_npu_qwen3_next.md
Normal file
156
docs/source/tutorials/multi_npu_qwen3_next.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Multi-NPU (Qwen3-Next)
|
||||
|
||||
```{note}
|
||||
The Qwen3 Next are using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes around stability, accuracy and performance improvement.
|
||||
```
|
||||
|
||||
## Run vllm-ascend on Multi-NPU with Qwen3 Next
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend-qwen3 \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
```
|
||||
|
||||
### Install Triton Ascend
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Linux (aarch64)
|
||||
|
||||
The [Triton Ascend](https://gitee.com/ascend/triton-ascend) is required when you run Qwen3 Next, please follow the instructions below to install it and its dependency.
|
||||
|
||||
Install the Ascend BiSheng toolkit:
|
||||
|
||||
```bash
|
||||
wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/Ascend-BiSheng-toolkit_aarch64.run
|
||||
chmod a+x Ascend-BiSheng-toolkit_aarch64.run
|
||||
./Ascend-BiSheng-toolkit_aarch64.run --install
|
||||
source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
|
||||
```
|
||||
|
||||
Install Triton Ascend:
|
||||
|
||||
```bash
|
||||
wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
|
||||
pip install triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Linux (x86_64)
|
||||
|
||||
Coming soon ...
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
### Inference on Multi-NPU
|
||||
|
||||
Please make sure you already executed the command:
|
||||
|
||||
```bash
|
||||
source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
|
||||
```
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Online Inference
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
|
||||
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32GB of memory, tensor-parallel-size should be at least 8.
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --enforce-eager
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||||
"model": "Qwen/Qwen3-Next-80B-A3B-Instruct",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Who are you?"}
|
||||
],
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20,
|
||||
"max_tokens": 32
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Offline Inference
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU:
|
||||
|
||||
```python
|
||||
import gc
|
||||
import torch
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
|
||||
if __name__ == '__main__':
|
||||
prompts = [
|
||||
"Who are you?",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
|
||||
llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
enforce_eager=True,
|
||||
distributed_executor_backend="mp",
|
||||
gpu_memory_utilization=0.7,
|
||||
max_model_len=4096)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am'
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
Reference in New Issue
Block a user