[Doc] Add multi node data parallel doc (#1685)
### What this PR does / why we need it?
add multi node data parallel doc
### Does this PR introduce _any_ user-facing change?
add multi node data parallel doc
### How was this patch tested?
- vLLM version: v0.9.1
- vLLM main:
805d62ca88
Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
BIN
docs/source/assets/multi_node_dp.png
Normal file
BIN
docs/source/assets/multi_node_dp.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 115 KiB |
@@ -1,11 +1,19 @@
|
|||||||
# Multi-Node (DeepSeek)
|
# Multi-Node-DP (DeepSeek)
|
||||||
|
|
||||||
Multi-node inference is suitable for scenarios where the model cannot be deployed on a single NPU. In such cases, the model can be distributed using tensor parallelism and pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
|
## Getting Start
|
||||||
|
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
|
||||||
|
|
||||||
* **Verify Multi-Node Communication Environment**
|
Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
|
||||||
* **Set Up and Start the Ray Cluster**
|
|
||||||
* **Start the Online Inference Service on multinode**
|
|
||||||
|
|
||||||
|
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
|
||||||
|
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
|
||||||
|
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
|
||||||
|
|
||||||
|
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
|
||||||
|
|
||||||
|
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
|
||||||
|
|
||||||
|
For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which don’t currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
|
||||||
|
|
||||||
## Verify Multi-Node Communication Environment
|
## Verify Multi-Node Communication Environment
|
||||||
|
|
||||||
@@ -45,24 +53,19 @@ for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
|
|||||||
hccn_tool -i 0 -ping -g address 10.20.0.20
|
hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||||
```
|
```
|
||||||
|
|
||||||
## Set Up and Start the Ray Cluster
|
## Run with docker
|
||||||
### Setting Up the Basic Container
|
Assume you have two Altas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3-w8a8` quantitative model across multi-node.
|
||||||
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
|
|
||||||
|
|
||||||
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
|
|
||||||
|
|
||||||
Below is the example container setup command, which should be executed on **all nodes** :
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
# Define the image and container name
|
# Define the image and container name
|
||||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
export IMAGE=quay.io/ascend/vllm-ascend:main
|
||||||
export NAME=vllm-ascend
|
export NAME=vllm-ascend
|
||||||
|
|
||||||
# Run the container using the defined variables
|
# Run the container using the defined variables
|
||||||
|
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
|
||||||
docker run --rm \
|
docker run --rm \
|
||||||
--name $NAME \
|
--name $NAME \
|
||||||
|
--net=host \
|
||||||
--device /dev/davinci0 \
|
--device /dev/davinci0 \
|
||||||
--device /dev/davinci1 \
|
--device /dev/davinci1 \
|
||||||
--device /dev/davinci2 \
|
--device /dev/davinci2 \
|
||||||
@@ -75,121 +78,121 @@ docker run --rm \
|
|||||||
--device /dev/devmm_svm \
|
--device /dev/devmm_svm \
|
||||||
--device /dev/hisi_hdc \
|
--device /dev/hisi_hdc \
|
||||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||||
|
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||||
-v /root/.cache:/root/.cache \
|
-v /mnt/sfs_turbo/.cache:/root/.cache \
|
||||||
-p 8000:8000 \
|
|
||||||
-it $IMAGE bash
|
-it $IMAGE bash
|
||||||
```
|
```
|
||||||
|
|
||||||
### Start Ray Cluster
|
|
||||||
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
|
|
||||||
|
|
||||||
Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
|
|
||||||
|
|
||||||
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues. The `--num-gpus` parameter defines the number of NPUs to be used on each node.
|
|
||||||
|
|
||||||
Below are the commands for the head and worker nodes:
|
|
||||||
|
|
||||||
**Head node**:
|
|
||||||
|
|
||||||
:::{note}
|
:::{note}
|
||||||
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
|
Before launch the inference server, ensure some environment variables are set for multi node communication
|
||||||
Updating the environment variables requires restarting the Ray cluster.
|
|
||||||
:::
|
:::
|
||||||
|
|
||||||
|
Run the following scripts on two nodes respectively
|
||||||
|
|
||||||
|
**node0**
|
||||||
```shell
|
```shell
|
||||||
# Head node
|
#!/bin/sh
|
||||||
export HCCL_IF_IP={local_ip}
|
|
||||||
export GLOO_SOCKET_IFNAME={nic_name}
|
|
||||||
export TP_SOCKET_IFNAME={nic_name}
|
|
||||||
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
|
|
||||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
|
||||||
ray start --head --num-gpus=8
|
|
||||||
```
|
|
||||||
**Worker node**:
|
|
||||||
|
|
||||||
:::{note}
|
# this obtained through ifconfig
|
||||||
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
|
# nic_name is the network interface name corresponding to local_ip
|
||||||
:::
|
nic_name="xxxx"
|
||||||
|
local_ip="xxxx"
|
||||||
|
|
||||||
```shell
|
export HCCL_IF_IP=$local_ip
|
||||||
# Worker node
|
export GLOO_SOCKET_IFNAME=$nic_name
|
||||||
export HCCL_IF_IP={local_ip}
|
export TP_SOCKET_IFNAME=$nic_name
|
||||||
export GLOO_SOCKET_IFNAME={nic_name}
|
export HCCL_SOCKET_IFNAME=$nic_name
|
||||||
export TP_SOCKET_IFNAME={nic_name}
|
export OMP_PROC_BIND=false
|
||||||
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
|
export OMP_NUM_THREADS=100
|
||||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
export VLLM_USE_V1=1
|
||||||
ray start --address='{head_node_ip}:{port_num}' --num-gpus=8 --node-ip-address={local_ip}
|
export HCCL_BUFFSIZE=1024
|
||||||
```
|
|
||||||
:::{tip}
|
|
||||||
Before starting the Ray cluster, set the `export ASCEND_PROCESS_LOG_PATH={plog_save_path}` environment variable on each node to redirect the Ascend plog, which helps in debugging issues during multi-node execution.
|
|
||||||
:::
|
|
||||||
|
|
||||||
|
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8
|
||||||
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
|
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/quantization.html
|
||||||
|
vllm serve /root/.cache/ds_v3 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
## Start the Online Inference Service on multinode
|
--port 8004 \
|
||||||
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. You only need to run the vllm command on one node.
|
--data-parallel-size 4 \
|
||||||
|
--data-parallel-size-local 2 \
|
||||||
To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
|
--data-parallel-address $local_ip \
|
||||||
|
--data-parallel-rpc-port 13389 \
|
||||||
For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
|
--tensor-parallel-size 4 \
|
||||||
|
--seed 1024 \
|
||||||
```shell
|
--served-model-name deepseek_v3 \
|
||||||
python -m vllm.entrypoints.openai.api_server \
|
--enable-expert-parallel \
|
||||||
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
|
--max-num-seqs 16 \
|
||||||
--trust-remote-code \
|
--max-model-len 32768 \
|
||||||
--enforce-eager \
|
--quantization ascend \
|
||||||
--distributed_executor_backend "ray" \
|
--max-num-batched-tokens 4096 \
|
||||||
--tensor-parallel-size 8 \
|
--trust-remote-code \
|
||||||
--pipeline-parallel-size 2 \
|
--no-enable-prefix-caching \
|
||||||
--disable-frontend-multiprocessing \
|
--gpu-memory-utilization 0.9 \
|
||||||
--port {port_num}
|
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||||
```
|
|
||||||
:::{note}
|
|
||||||
Pipeline parallelism currently requires AsyncLLMEngine, hence the `--disable-frontend-multiprocessing` is set.
|
|
||||||
:::
|
|
||||||
|
|
||||||
Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
|
|
||||||
```shell
|
|
||||||
python -m vllm.entrypoints.openai.api_server \
|
|
||||||
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
|
|
||||||
--trust-remote-code \
|
|
||||||
--distributed_executor_backend "ray" \
|
|
||||||
--enforce-eager \
|
|
||||||
--tensor-parallel-size 16 \
|
|
||||||
--port {port_num}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
:::{note}
|
**node1**
|
||||||
If you're running DeepSeek V3/R1, please remove `quantization_config` section in `config.json` file since it's not supported by vllm-ascend currently.
|
```shell
|
||||||
:::
|
#!/bin/sh
|
||||||
|
|
||||||
|
nic_name="xxx"
|
||||||
|
local_ip="xxx"
|
||||||
|
|
||||||
|
export HCCL_IF_IP=$local_ip
|
||||||
|
export GLOO_SOCKET_IFNAME=$nic_name
|
||||||
|
export TP_SOCKET_IFNAME=$nic_name
|
||||||
|
export HCCL_SOCKET_IFNAME=$nic_name
|
||||||
|
export OMP_PROC_BIND=false
|
||||||
|
export OMP_NUM_THREADS=100
|
||||||
|
export VLLM_USE_V1=1
|
||||||
|
export HCCL_BUFFSIZE=1024
|
||||||
|
|
||||||
|
vllm serve /root/.cache/ds_v3 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8004 \
|
||||||
|
--headless \
|
||||||
|
--data-parallel-size 4 \
|
||||||
|
--data-parallel-size-local 2 \
|
||||||
|
--data-parallel-start-rank 2 \
|
||||||
|
--data-parallel-address { node0 ip } \
|
||||||
|
--data-parallel-rpc-port 13389 \
|
||||||
|
--tensor-parallel-size 4 \
|
||||||
|
--seed 1024 \
|
||||||
|
--quantization ascend \
|
||||||
|
--served-model-name deepseek_v3 \
|
||||||
|
--max-num-seqs 16 \
|
||||||
|
--max-model-len 32768 \
|
||||||
|
--max-num-batched-tokens 4096 \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--trust-remote-code \
|
||||||
|
--no-enable-prefix-caching \
|
||||||
|
--gpu-memory-utilization 0.92 \
|
||||||
|
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||||
|
```
|
||||||
|
|
||||||
|
The Deployment view looks like:
|
||||||
|

|
||||||
|
|
||||||
Once your server is started, you can query the model with input prompts:
|
Once your server is started, you can query the model with input prompts:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
curl -X POST http://127.0.0.1:{prot_num}/v1/completions \
|
curl http://{ node0 ip:8004 }/v1/completions \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"model": "Deepseek/DeepSeek-V2-Lite-Chat",
|
"model": "/root/.cache/ds_v3",
|
||||||
"prompt": "The future of AI is",
|
"prompt": "The future of AI is",
|
||||||
"max_tokens": 24
|
"max_tokens": 50,
|
||||||
}'
|
"temperature": 0
|
||||||
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
If you query the server successfully, you can see the info shown below (client):
|
## Run benchmarks
|
||||||
|
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
|
||||||
```
|
```shell
|
||||||
{"id":"cmpl-6dfb5a8d8be54d748f0783285dd52303","object":"text_completion","created":1739957835,"model":"/home/data/DeepSeek-V2-Lite-Chat/","choices":[{"index":0,"text":" heavily influenced by neuroscience and cognitiveGuionistes. The goalochondria is to combine the efforts of researchers, technologists,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":30,"completion_tokens":24,"prompt_tokens_details":null}}
|
vllm bench serve --model /root/.cache/ds_v3 --served-model-name deepseek_v3 \
|
||||||
```
|
--dataset-name random --random-input-len 128 --random-output-len 128 \
|
||||||
|
--num-prompts 200 --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1
|
||||||
Logs of the vllm server:
|
|
||||||
|
|
||||||
```
|
|
||||||
INFO: 127.0.0.1:59384 - "POST /v1/completions HTTP/1.1" 200 OK
|
|
||||||
INFO 02-19 17:37:35 metrics.py:453 Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, NPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
|
|
||||||
```
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user