[doc] Update GLM4.x.md, add GLM4.x multi-node deploy tutorial (#6872)
### What this PR does / why we need it?
This PR updates the GLM4.x documentation by adding multi-node like 2 ×
Atlas 800 A2 (64G × 8) deployment tutorial.
- **What changed**: Added instructions for deploying GLM-4.X models
across multiple nodes, including environment variables and example
commands.
- **Why needed**: Although the previous tutorial stated that multi-node
deployment on Atlas 800 A2 (64GB × 8) is **not recommended**, but we
still face some situation that must deploy GLM-4.7 on 2 × Atlas 800 A2
(64G × 8). And we successfully run GLM-4.7 on 2 nodes and it works fine,
so we think it might be the time to update this part.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Verified that the new documentation renders correctly in Markdown
format.
- Tested the multi-node deployment steps on 2 × Atlas 800 A2 (64G × 8)
to ensure the commands work as described.
- Confirmed that existing GLM4.x documentation links and structure
remain intact.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: ZKSU <zksu@outlook.com>
This commit is contained in:
@@ -98,7 +98,7 @@ vllm serve /weight/glm4.5_w8a8_with_float_mtp \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "model":"/weight/glm4.5_w8a8_with_float_mtp", "method":"mtp"}' \
|
||||
--compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32], "cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--async-scheduling \
|
||||
--async-scheduling
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
@@ -109,7 +109,103 @@ The parameters are explained as follows:
|
||||
|
||||
### Multi-node Deployment
|
||||
|
||||
Not recommended to deploy multi-node on Atlas 800 A2 (64G * 8).
|
||||
Although the former tutorial said "Not recommended to deploy multi-node on Atlas 800 A2 (64G × 8)", but if you insist to deploy GLM-4.x model on multi-node like 2 × Atlas 800 A2 (64G × 8), run the following scripts on two nodes respectively.
|
||||
|
||||
**Node 0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=200
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_ASCEND_BALANCE_SCHEDULING=1
|
||||
export HCCL_INTRA_PCIE_ENABLE=1
|
||||
export HCCL_INTRA_ROCE_ENABLE=0
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
vllm serve ZhipuAI/GLM-4.7 \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000 \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-address $local_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--tensor-parallel-size 4 \
|
||||
--seed 1024 \
|
||||
--async-scheduling \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--enable-auto-tool-choice \
|
||||
--reasoning-parser glm45 \
|
||||
--tool-call-parser glm47 \
|
||||
--speculative-config {"num_speculative_tokens":3,"method":"mtp"} \
|
||||
--compilation-config {"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"} \
|
||||
--trust-remote-code \
|
||||
--served-model-name glm47
|
||||
|
||||
```
|
||||
|
||||
**Node 1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
node0_ip="xxxx" # same as the local_IP address in node 0
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=200
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_ASCEND_BALANCE_SCHEDULING=1
|
||||
export HCCL_INTRA_PCIE_ENABLE=1
|
||||
export HCCL_INTRA_ROCE_ENABLE=0
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
vllm serve ZhipuAI/GLM-4.7 \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000 \
|
||||
--headless \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-start-rank 2 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--tensor-parallel-size 4 \
|
||||
--seed 1024 \
|
||||
--async-scheduling \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--enable-auto-tool-choice \
|
||||
--reasoning-parser glm45 \
|
||||
--tool-call-parser glm47 \
|
||||
--speculative-config {"num_speculative_tokens":3,"method":"mtp"} \
|
||||
--compilation-config {"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"} \
|
||||
--trust-remote-code \
|
||||
--served-model-name glm47
|
||||
```
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
|
||||
Reference in New Issue
Block a user