diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index c2236a0..971e6e0 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -16,6 +16,7 @@ multi_npu_quantization single_node_300i multi_node multi_node_kimi +multi_node_qwen3vl multi_node_pd_disaggregation multi_node_ray ::: diff --git a/docs/source/tutorials/multi_node_qwen3vl.md b/docs/source/tutorials/multi_node_qwen3vl.md new file mode 100644 index 0000000..40a4d2a --- /dev/null +++ b/docs/source/tutorials/multi_node_qwen3vl.md @@ -0,0 +1,156 @@ +# Multi-Node-DP (Qwen3-VL-235B-A22B) + +## Verify Multi-Node Communication Environment + +referring to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process) + +## Run with docker +Assume you have an Atlas 800 A3(64G*16) nodes(or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multi-node. + +```{code-block} bash + :substitutions: +# Update the vllm-ascend image +export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| +docker run --rm \ +--name vllm-ascend \ +--net=host \ +--device /dev/davinci0 \ +--device /dev/davinci1 \ +--device /dev/davinci2 \ +--device /dev/davinci3 \ +--device /dev/davinci4 \ +--device /dev/davinci5 \ +--device /dev/davinci6 \ +--device /dev/davinci7 \ +--device /dev/davinci8 \ +--device /dev/davinci9 \ +--device /dev/davinci10 \ +--device /dev/davinci11 \ +--device /dev/davinci12 \ +--device /dev/davinci13 \ +--device /dev/davinci14 \ +--device /dev/davinci15 \ +--device /dev/davinci_manager \ +--device /dev/devmm_svm \ +--device /dev/hisi_hdc \ +-v /usr/local/dcmi:/usr/local/dcmi \ +-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-v /root/.cache:/root/.cache \ +-p 8000:8000 \ +-it $IMAGE bash +``` + +Run the following scripts on two nodes respectively + +:::{note} +Before launch the inference server, ensure the following environment variables are set for multi node communication +::: + +node0 + +```shell +#!/bin/sh +# this obtained through ifconfig +# nic_name is the network interface name corresponding to local_ip +nic_name="xxxx" +local_ip="xxxx" + +export HCCL_IF_IP=$local_ip +export GLOO_SOCKET_IFNAME=$nic_name +export TP_SOCKET_IFNAME=$nic_name +export HCCL_SOCKET_IFNAME=$nic_name +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=100 +export VLLM_USE_V1=1 +export HCCL_BUFFSIZE=1024 + +vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \ +--host 0.0.0.0 \ +--port 8000 \ +--data-parallel-size 2 \ +--api-server-count 2 \ +--data-parallel-size-local 1 \ +--data-parallel-address $local_ip \ +--data-parallel-rpc-port 13389 \ +--seed 1024 \ +--served-model-name qwen3vl \ +--tensor-parallel-size 8 \ +--enable-expert-parallel \ +--max-num-seqs 16 \ +--max-model-len 32768 \ +--max-num-batched-tokens 4096 \ +--trust-remote-code \ +--no-enable-prefix-caching \ +--gpu-memory-utilization 0.8 \ +``` + +node1 + +```shell +#!/bin/sh + +nic_name="xxxx" +local_ip="xxxx" +node0_ip="xxxx" + +export HCCL_IF_IP=$local_ip +export GLOO_SOCKET_IFNAME=$nic_name +export TP_SOCKET_IFNAME=$nic_name +export HCCL_SOCKET_IFNAME=$nic_name +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=100 +export VLLM_USE_V1=1 +export HCCL_BUFFSIZE=1024 + +vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \ +--host 0.0.0.0 \ +--port 8000 \ +--headless \ +--data-parallel-size 2 \ +--data-parallel-size-local 1 \ +--data-parallel-start-rank 1 \ +--data-parallel-address $node0_ip \ +--data-parallel-rpc-port 13389 \ +--seed 1024 \ +--tensor-parallel-size 8 \ +--served-model-name qwen3vl \ +--max-num-seqs 16 \ +--max-model-len 32768 \ +--max-num-batched-tokens 4096 \ +--enable-expert-parallel \ +--trust-remote-code \ +--no-enable-prefix-caching \ +--gpu-memory-utilization 0.8 \ +``` + +If the service starts successfully, the following information will be displayed on node0: + +```shell +INFO: Started server process [44610] +INFO: Waiting for application startup. +INFO: Application startup complete. +INFO: Started server process [44611] +INFO: Waiting for application startup. +INFO: Application startup complete. +``` + +Once your server is started, you can query the model with input prompts: + +```shell +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "qwen3vl", + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": [ + {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}}, + {"type": "text", "text": "What is the text in the illustrate?"} + ]} + ] + }' +```