diff --git a/docs/source/tutorials/multi_node_ray.md b/docs/source/tutorials/multi_node_ray.md index 827976a..ad1a8d6 100644 --- a/docs/source/tutorials/multi_node_ray.md +++ b/docs/source/tutorials/multi_node_ray.md @@ -91,7 +91,7 @@ After setting up the containers and installing vllm-ascend on each node, follow Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name). -Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues. The `--num-gpus` parameter defines the number of NPUs to be used on each node. +Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues. Below are the commands for the head and worker nodes: @@ -109,7 +109,7 @@ export GLOO_SOCKET_IFNAME={nic_name} export TP_SOCKET_IFNAME={nic_name} export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -ray start --head --num-gpus=8 +ray start --head ``` **Worker node**: @@ -125,20 +125,22 @@ export GLOO_SOCKET_IFNAME={nic_name} export TP_SOCKET_IFNAME={nic_name} export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -ray start --address='{head_node_ip}:6379' --num-gpus=8 --node-ip-address={local_ip} +ray start --address='{head_node_ip}:6379' --node-ip-address={local_ip} ``` Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed. -## Start the Online Inference Service on multinode -In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. You only need to run the vllm command on one node. +## Start the Online Inference Service on multinode scenario +In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. + +**You only need to run the vllm command on one node.** To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes. For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2: ```shell -vllm Qwen/Qwen3-235B-A22B \ +vllm serve Qwen/Qwen3-235B-A22B \ --distributed-executor-backend ray \ --pipeline-parallel-size 2 \ --tensor-parallel-size 8 \ @@ -154,7 +156,7 @@ vllm Qwen/Qwen3-235B-A22B \ Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16: ```shell -vllm Qwen/Qwen3-235B-A22B \ +vllm serve Qwen/Qwen3-235B-A22B \ --distributed-executor-backend ray \ --tensor-parallel-size 16 \ --enable-expert-parallel \