[Doc] Add multi node data parallel doc (#1685)

### What this PR does / why we need it? add multi node data parallel doc ### Does this PR introduce _any_ user-facing change? add multi node data parallel doc ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: 805d62ca88 Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-10 09:36:37 +08:00
parent b4b19ea588
commit 0c4aa2b4f1
2 changed files with 113 additions and 110 deletions
--- a/docs/source/assets/multi_node_dp.png
+++ b/docs/source/assets/multi_node_dp.png
--- a/docs/source/tutorials/multi_node.md
+++ b/docs/source/tutorials/multi_node.md
@@ -1,11 +1,19 @@
-# Multi-Node (DeepSeek)
+# Multi-Node-DP (DeepSeek)
-Multi-node inference is suitable for scenarios where the model cannot be deployed on a single NPU. In such cases, the model can be distributed using tensor parallelism and pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
+## Getting Start
 vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
-* **Verify Multi-Node Communication Environment** 
+Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
 * **Set Up and Start the Ray Cluster**
 * **Start the Online Inference Service on multinode**
 For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
    - Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
    - Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
 This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
 In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
 For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which don’t currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
 ## Verify Multi-Node Communication Environment
@@ -45,24 +53,19 @@ for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
 hccn_tool -i 0 -ping -g address 10.20.0.20
 ```
-## Set Up and Start the Ray Cluster
+## Run with docker
-### Setting Up the Basic Container
+Assume you have two Altas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3-w8a8` quantitative model across multi-node.
 To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
 For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
 Below is the example container setup command, which should be executed on **all nodes** :
 ```shell
 # Define the image and container name
-export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+export IMAGE=quay.io/ascend/vllm-ascend:main
 export NAME=vllm-ascend
 # Run the container using the defined variables
 # Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
 docker run --rm \
 --name $NAME \
 --net=host \
 --device /dev/davinci0 \
 --device /dev/davinci1 \
 --device /dev/davinci2 \
@@ -75,121 +78,121 @@ docker run --rm \
 --device /dev/devmm_svm \
 --device /dev/hisi_hdc \
 -v /usr/local/dcmi:/usr/local/dcmi \
 -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
 -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 -v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
+-v /mnt/sfs_turbo/.cache:/root/.cache \
 -p 8000:8000 \
 -it $IMAGE bash
 ```
 ### Start Ray Cluster
 After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
 Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
 Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues. The `--num-gpus` parameter defines the number of NPUs to be used on each node.
 Below are the commands for the head and worker nodes:
 **Head node**:
 :::{note}
-When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. 
+Before launch the inference server, ensure some environment variables are set for multi node communication
 Updating the environment variables requires restarting the Ray cluster.
 :::
 Run the following scripts on two nodes respectively
 **node0**
 ```shell
-# Head node
+#!/bin/sh
 export HCCL_IF_IP={local_ip}
 export GLOO_SOCKET_IFNAME={nic_name}
 export TP_SOCKET_IFNAME={nic_name}
 export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
 export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
 ray start --head --num-gpus=8
 ```
 **Worker node**:
-:::{note}
+# this obtained through ifconfig
-When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
+# nic_name is the network interface name corresponding to local_ip
-:::
+nic_name="xxxx"
 local_ip="xxxx"
-```shell
+export HCCL_IF_IP=$local_ip
-# Worker node
+export GLOO_SOCKET_IFNAME=$nic_name
-export HCCL_IF_IP={local_ip}
+export TP_SOCKET_IFNAME=$nic_name
-export GLOO_SOCKET_IFNAME={nic_name}
+export HCCL_SOCKET_IFNAME=$nic_name
-export TP_SOCKET_IFNAME={nic_name}
+export OMP_PROC_BIND=false
-export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 
+export OMP_NUM_THREADS=100
-export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+export VLLM_USE_V1=1
-ray start --address='{head_node_ip}:{port_num}' --num-gpus=8 --node-ip-address={local_ip}
+export HCCL_BUFFSIZE=1024
 ```
 :::{tip}
 Before starting the Ray cluster, set the `export ASCEND_PROCESS_LOG_PATH={plog_save_path}` environment variable on each node to redirect the Ascend plog, which helps in debugging issues during multi-node execution.
 :::
-
+# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8
-Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
+# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/quantization.html
-
+vllm serve /root/.cache/ds_v3 \
-
+--host 0.0.0.0 \
-## Start the Online Inference Service on multinode
+--port 8004 \
-In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. You only need to run the vllm command on one node. 
+--data-parallel-size 4 \
-
+--data-parallel-size-local 2 \
-To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
+--data-parallel-address $local_ip \
-
+--data-parallel-rpc-port 13389 \
-For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
+--tensor-parallel-size 4 \
-
+--seed 1024 \
-```shell
+--served-model-name deepseek_v3 \
-python -m vllm.entrypoints.openai.api_server \
+--enable-expert-parallel \
-       --model="Deepseek/DeepSeek-V2-Lite-Chat" \
+--max-num-seqs 16 \
-       --trust-remote-code \
+--max-model-len 32768 \
-       --enforce-eager \
+--quantization ascend \
-       --distributed_executor_backend "ray" \
+--max-num-batched-tokens 4096 \
-       --tensor-parallel-size 8 \
+--trust-remote-code \
-       --pipeline-parallel-size 2 \
+--no-enable-prefix-caching \
-       --disable-frontend-multiprocessing \
+--gpu-memory-utilization 0.9 \
-       --port {port_num}
+--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
 ```
 :::{note}
 Pipeline parallelism currently requires AsyncLLMEngine, hence the `--disable-frontend-multiprocessing`  is set.
 :::
 Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
 ```shell
 python -m vllm.entrypoints.openai.api_server \
       --model="Deepseek/DeepSeek-V2-Lite-Chat" \
       --trust-remote-code \
       --distributed_executor_backend "ray" \
       --enforce-eager \
       --tensor-parallel-size 16 \
       --port {port_num}
 ```
-:::{note}
+**node1**
-If you're running DeepSeek V3/R1, please remove `quantization_config` section in `config.json` file since it's not supported by vllm-ascend currently.
+```shell
-:::
+#!/bin/sh
 nic_name="xxx"
 local_ip="xxx"
 export HCCL_IF_IP=$local_ip
 export GLOO_SOCKET_IFNAME=$nic_name
 export TP_SOCKET_IFNAME=$nic_name
 export HCCL_SOCKET_IFNAME=$nic_name
 export OMP_PROC_BIND=false
 export OMP_NUM_THREADS=100
 export VLLM_USE_V1=1
 export HCCL_BUFFSIZE=1024
 vllm serve /root/.cache/ds_v3 \
 --host 0.0.0.0 \
 --port 8004 \
 --headless \
 --data-parallel-size 4 \
 --data-parallel-size-local 2 \
 --data-parallel-start-rank 2 \
 --data-parallel-address { node0 ip } \
 --data-parallel-rpc-port 13389 \
 --tensor-parallel-size 4 \
 --seed 1024 \
 --quantization ascend \
 --served-model-name deepseek_v3 \
 --max-num-seqs 16 \
 --max-model-len 32768 \
 --max-num-batched-tokens 4096 \
 --enable-expert-parallel \
 --trust-remote-code \
 --no-enable-prefix-caching \
 --gpu-memory-utilization 0.92 \
 --additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
 ```
 The Deployment view looks like: 
 ![alt text](../assets/multi_node_dp.png)
 Once your server is started, you can query the model with input prompts:
 ```shell
-curl -X POST http://127.0.0.1:{prot_num}/v1/completions  \
+curl http://{ node0 ip:8004 }/v1/completions \
-     -H "Content-Type: application/json" \
+    -H "Content-Type: application/json" \
-     -d '{
+    -d '{
-         "model": "Deepseek/DeepSeek-V2-Lite-Chat",
+        "model": "/root/.cache/ds_v3",
-         "prompt": "The future of AI is",
+        "prompt": "The future of AI is",
-         "max_tokens": 24
+        "max_tokens": 50,
-     }'
+        "temperature": 0
    }'
 ```
-If you query the server successfully, you can see the info shown below (client):
+## Run benchmarks
-
+For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
-```
+```shell
-{"id":"cmpl-6dfb5a8d8be54d748f0783285dd52303","object":"text_completion","created":1739957835,"model":"/home/data/DeepSeek-V2-Lite-Chat/","choices":[{"index":0,"text":" heavily influenced by neuroscience and cognitiveGuionistes. The goalochondria is to combine the efforts of researchers, technologists,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":30,"completion_tokens":24,"prompt_tokens_details":null}}
+vllm bench serve --model /root/.cache/ds_v3  --served-model-name deepseek_v3 \
-```
+--dataset-name random --random-input-len 128 --random-output-len 128 \
-
+--num-prompts 200  --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1
 Logs of the vllm server:
 ```
 INFO:     127.0.0.1:59384 - "POST /v1/completions HTTP/1.1" 200 OK
 INFO 02-19 17:37:35 metrics.py:453 Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, NPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
 ```