fix wrong --num-gpus parameter requirements, and avoid ambiguity (#3116)

fix the problem of https://github.com/vllm-project/vllm-ascend/issues/3114 - vLLM version: v0.10.2 - vLLM main: 5aeb925452 Signed-off-by: Jianwei Mao <maojianwei2012@126.com>
2025-09-23 11:58:44 +08:00
parent 39a85c49fa
commit d586255678
1 changed files with 9 additions and 7 deletions
--- a/docs/source/tutorials/multi_node_ray.md
+++ b/docs/source/tutorials/multi_node_ray.md
@@ -91,7 +91,7 @@ After setting up the containers and installing vllm-ascend on each node, follow
 Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
-Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues. The `--num-gpus` parameter defines the number of NPUs to be used on each node.
+Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues.
 Below are the commands for the head and worker nodes:
@@ -109,7 +109,7 @@ export GLOO_SOCKET_IFNAME={nic_name}
 export TP_SOCKET_IFNAME={nic_name}
 export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
 export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-ray start --head --num-gpus=8
+ray start --head
 ```
 **Worker node**:
@@ -125,20 +125,22 @@ export GLOO_SOCKET_IFNAME={nic_name}
 export TP_SOCKET_IFNAME={nic_name}
 export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
 export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-ray start --address='{head_node_ip}:6379' --num-gpus=8 --node-ip-address={local_ip}
+ray start --address='{head_node_ip}:6379' --node-ip-address={local_ip}
 ```
 Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
-## Start the Online Inference Service on multinode
+## Start the Online Inference Service on multinode scenario
-In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. You only need to run the vllm command on one node.
+In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster.
 **You only need to run the vllm command on one node.**
 To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
 For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
 ```shell
-vllm Qwen/Qwen3-235B-A22B \
+vllm serve Qwen/Qwen3-235B-A22B \
  --distributed-executor-backend ray \
  --pipeline-parallel-size 2 \
  --tensor-parallel-size 8 \
@@ -154,7 +156,7 @@ vllm Qwen/Qwen3-235B-A22B \
 Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
 ```shell
-vllm Qwen/Qwen3-235B-A22B \
+vllm serve Qwen/Qwen3-235B-A22B \
  --distributed-executor-backend ray \
  --tensor-parallel-size 16 \
  --enable-expert-parallel \