diff --git a/docs/source/tutorials/models/GLM5.md b/docs/source/tutorials/models/GLM5.md index ddfff28e..c0c28e2d 100644 --- a/docs/source/tutorials/models/GLM5.md +++ b/docs/source/tutorials/models/GLM5.md @@ -26,11 +26,21 @@ It is recommended to download the model weight to the shared directory of multip vLLM and vLLM-ascend only support GLM-5 on our main branches. you can use our official docker images and upgrade vllm and vllm-ascend for inference. +:::::{tab-set} +:sync-group: install + +::::{tab-item} A3 series +:sync: A3 + +Start the docker image on your each node. + ```{code-block} bash + :substitutions: + # Update --device according to your device (Atlas A3:/dev/davinci[0-15]). # Update the vllm-ascend image according to your environment. # Note you should download the weight to /root/.cache in advance. -# Update the vllm-ascend image, alm5-a3 can be replaced by: glm5;glm5-openeuler;glm5-a3-openeuler +# Update the vllm-ascend image, glm5-a3 can be replaced by: glm5;glm5-openeuler;glm5-a3-openeuler export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:glm5-a3 export NAME=vllm-ascend @@ -69,6 +79,44 @@ docker run --rm \ -it $IMAGE bash ``` +:::: +::::{tab-item} A2 series +:sync: A2 + +Start the docker image on your each node. + +```{code-block} bash + :substitutions: + +export IMAGE=quay.io/ascend/vllm-ascend:glm5 +docker run --rm \ + --name vllm-ascend \ + --shm-size=1g \ + --net=host \ + --device /dev/davinci0 \ + --device /dev/davinci1 \ + --device /dev/davinci2 \ + --device /dev/davinci3 \ + --device /dev/davinci4 \ + --device /dev/davinci5 \ + --device /dev/davinci6 \ + --device /dev/davinci7 \ + --device /dev/davinci_manager \ + --device /dev/devmm_svm \ + --device /dev/hisi_hdc \ + -v /usr/local/dcmi:/usr/local/dcmi \ + -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ + -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ + -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ + -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ + -v /etc/ascend_install.info:/etc/ascend_install.info \ + -v /root/.cache:/root/.cache \ + -it $IMAGE bash +``` + +:::: +::::: + In addition, if you don't want to use the docker image as above, you can also build all from source: - Install `vllm-ascend` from source, refer to [installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html). @@ -99,17 +147,18 @@ If you want to deploy multi-node environment, you need to set up environment on ### Single-node Deployment -**A2 series** +:::::{tab-set} +:sync-group: install -Not test yet. - -**A3 series** +::::{tab-item} A3 series +:sync: A3 - Quantized model `glm-5-w4a8` can be deployed on 1 Atlas 800 A3 (64G × 16) . Run the following script to execute online inference. -```shell +```{code-block} bash + :substitutions: export HCCL_OP_EXPANSION_MODE="AIV" export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 @@ -140,6 +189,49 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w4a8 \ --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' ``` +:::: +::::{tab-item} A2 series +:sync: A2 + +- Quantized model `glm-5-w4a8` can be deployed on 1 Atlas 800 A2 (64G × 8) . + +Run the following script to execute online inference. + +```{code-block} bash + :substitutions: +export HCCL_OP_EXPANSION_MODE="AIV" +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=10 +export VLLM_USE_V1=1 +export HCCL_BUFFSIZE=200 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export VLLM_ASCEND_BALANCE_SCHEDULING=1 + +vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \ +--host 0.0.0.0 \ +--port 8077 \ +--data-parallel-size 1 \ +--tensor-parallel-size 8 \ +--enable-expert-parallel \ +--seed 1024 \ +--served-model-name glm-5 \ +--max-num-seqs 2 \ +--max-model-len 32768 \ +--max-num-batched-tokens 4096 \ +--trust-remote-code \ +--gpu-memory-utilization 0.95 \ +--quantization ascend \ +--enable-chunked-prefill \ +--enable-prefix-caching \ +--async-scheduling \ +--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ +--additional-config '{"multistream_overlap_shared_expert":true}' \ +--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' +``` + +:::: +::::: + **Notice:** The parameters are explained as follows: @@ -148,11 +240,11 @@ The parameters are explained as follows: ### Multi-node Deployment -**A2 series** +:::::{tab-set} +:sync-group: install -Not test yet. - -**A3 series** +::::{tab-item} A3 series +:sync: A3 - `glm-5-bf16`: require at least 2 Atlas 800 A3 (64G × 16). @@ -160,7 +252,8 @@ Run the following scripts on two nodes respectively. **node 0** -```shell +```{code-block} bash + :substitutions: # this obtained through ifconfig # nic_name is the network interface name corresponding to local_ip of the current node nic_name="xxx" @@ -204,7 +297,8 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-bf16 \ **node 1** -```shell +```{code-block} bash + :substitutions: # this obtained through ifconfig # nic_name is the network interface name corresponding to local_ip of the current node nic_name="xxx" @@ -248,6 +342,111 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-bf16 \ --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' ``` +:::: +::::{tab-item} A2 series +:sync: A2 + +Run the following scripts on two nodes respectively. + +**node 0** + +```{code-block} bash + :substitutions: +# this obtained through ifconfig +# nic_name is the network interface name corresponding to local_ip of the current node +nic_name="xxx" +local_ip="xxx" + +# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) +node0_ip="xxx" + +export HCCL_OP_EXPANSION_MODE="AIV" + +export HCCL_IF_IP=$local_ip +export GLOO_SOCKET_IFNAME=$nic_name +export TP_SOCKET_IFNAME=$nic_name +export HCCL_SOCKET_IFNAME=$nic_name +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=10 +export VLLM_USE_V1=1 +export HCCL_BUFFSIZE=200 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \ +--host 0.0.0.0 \ +--port 8077 \ +--data-parallel-size 2 \ +--data-parallel-size-local 1 \ +--data-parallel-address $node0_ip \ +--data-parallel-rpc-port 13389 \ +--tensor-parallel-size 8 \ +--quantization ascend \ +--seed 1024 \ +--served-model-name glm-5 \ +--enable-expert-parallel \ +--max-num-seqs 2 \ +--max-model-len 131072 \ +--max-num-batched-tokens 4096 \ +--trust-remote-code \ +--no-enable-prefix-caching \ +--gpu-memory-utilization 0.95 \ +--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ +--additional-config '{"multistream_overlap_shared_expert":true}' \ +--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' +``` + +**node 1** + +```{code-block} bash + :substitutions: +# this obtained through ifconfig +# nic_name is the network interface name corresponding to local_ip of the current node +nic_name="xxx" +local_ip="xxx" + +# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) +node0_ip="xxx" + +export HCCL_OP_EXPANSION_MODE="AIV" + +export HCCL_IF_IP=$local_ip +export GLOO_SOCKET_IFNAME=$nic_name +export TP_SOCKET_IFNAME=$nic_name +export HCCL_SOCKET_IFNAME=$nic_name +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=10 +export VLLM_USE_V1=1 +export HCCL_BUFFSIZE=200 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + +vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \ +--host 0.0.0.0 \ +--port 8077 \ +--headless \ +--data-parallel-size 2 \ +--data-parallel-size-local 1 \ +--data-parallel-start-rank 1 \ +--data-parallel-address $node0_ip \ +--data-parallel-rpc-port 13389 \ +--tensor-parallel-size 8 \ +--quantization ascend \ +--seed 1024 \ +--served-model-name glm-5 \ +--enable-expert-parallel \ +--max-num-seqs 2 \ +--max-model-len 131072 \ +--max-num-batched-tokens 4096 \ +--trust-remote-code \ +--no-enable-prefix-caching \ +--gpu-memory-utilization 0.95 \ +--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ +--additional-config '{"multistream_overlap_shared_expert":true}' \ +--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' +``` + +:::: +::::: + ### Prefill-Decode Disaggregation Not test yet.