diff --git a/docs/source/tutorials/models/GLM5.md b/docs/source/tutorials/models/GLM5.md index 435a40ee..86981206 100644 --- a/docs/source/tutorials/models/GLM5.md +++ b/docs/source/tutorials/models/GLM5.md @@ -17,15 +17,15 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea ### Model Weight - `GLM-5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5). -- `GLM-5-w4a8`(Quantized version without MTP quant): [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8). -- `GLM-5-w4a8`(Quantized version with MTP quant): [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8-mtp-QuaRot). +- `GLM-5-w4a8`: [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8). +- `GLM-5-w8a8`: [Download model weight](https://ai.gitcode.com/Eco-Tech/GLM-5-w8a8/tree/main). - You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively. It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` ### Installation -vLLM and vLLM-ascend only support GLM-5 on our main branches. you can use our glm5 docker images for inference. +You can use our official docker image to run GLM-5 directly. :::::{tab-set} :sync-group: install @@ -38,11 +38,7 @@ Start the docker image on your each node. ```{code-block} bash :substitutions: -# Update --device according to your device (Atlas A3:/dev/davinci[0-15]). -# Update the vllm-ascend image according to your environment. -# Note you should download the weight to /root/.cache in advance. -# Update the vllm-ascend image, glm5-a3 can be replaced by: glm5;glm5-openeuler;glm5-a3-openeuler -export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:glm5-a3 +export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3 export NAME=vllm-ascend # Run the container using the defined variables @@ -89,7 +85,7 @@ Start the docker image on your each node. ```{code-block} bash :substitutions: -export IMAGE=quay.io/ascend/vllm-ascend:glm5 +export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| docker run --rm \ --name vllm-ascend \ --shm-size=1g \ @@ -122,26 +118,6 @@ In addition, if you don't want to use the docker image as above, you can also bu - Install `vllm-ascend` from source, refer to [installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html). -- After install `vllm-ascend` from source, you should upgrade vllm、vllm-ascend、transformers to main branches: - -```shell -# upgrade vllm -git clone https://github.com/vllm-project/vllm.git -cd vllm -git checkout 978a37c82387ce4a40aaadddcdbaf4a06fc4d590 -VLLM_TARGET_DEVICE=empty pip install -v . - -# upgrade vllm-ascend -git clone https://github.com/vllm-project/vllm-ascend.git -cd vllm-ascend -git checkout ff3a50d011dcbea08f87ebed69ff1bf156dbb01e -git submodule update --init --recursive -pip install -v . - -# reinstall transformers -pip install git+https://github.com/huggingface/transformers.git -``` - If you want to deploy multi-node environment, you need to set up environment on each node. ## Deployment @@ -162,8 +138,7 @@ Run the following script to execute online inference. :substitutions: export HCCL_OP_EXPANSION_MODE="AIV" export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export VLLM_USE_V1=1 +export OMP_NUM_THREADS=1 export HCCL_BUFFSIZE=200 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export VLLM_ASCEND_BALANCE_SCHEDULING=1 @@ -185,7 +160,43 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w4a8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --async-scheduling \ ---additional-config '{"multistream_overlap_shared_expert":true}' \ +--additional-config '{"enable_npugraph_ex": true,"fuse_muls_add":true,"multistream_overlap_shared_expert":true}' \ +--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ +--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' +``` + +- Quantized model `glm-5-w8a8` can be deployed on 1 Atlas 800 A3 (64G × 16) . + +Run the following script to execute online inference. + +```{code-block} bash + :substitutions: +export HCCL_OP_EXPANSION_MODE="AIV" +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=1 +export HCCL_BUFFSIZE=200 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export VLLM_ASCEND_BALANCE_SCHEDULING=1 +export VLLM_ASCEND_ENABLE_MLAPO=1 + +vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w8a8 \ +--host 0.0.0.0 \ +--port 8077 \ +--data-parallel-size 1 \ +--tensor-parallel-size 16 \ +--enable-expert-parallel \ +--seed 1024 \ +--served-model-name glm-5 \ +--max-num-seqs 8 \ +--max-model-len 40960 \ +--max-num-batched-tokens 4096 \ +--trust-remote-code \ +--gpu-memory-utilization 0.95 \ +--quantization ascend \ +--enable-chunked-prefill \ +--enable-prefix-caching \ +--async-scheduling \ +--additional-config '{"enable_npugraph_ex": true,"fuse_muls_add":true,"multistream_overlap_shared_expert":true}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' ``` @@ -202,8 +213,7 @@ Run the following script to execute online inference. :substitutions: export HCCL_OP_EXPANSION_MODE="AIV" export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export VLLM_USE_V1=1 +export OMP_NUM_THREADS=1 export HCCL_BUFFSIZE=200 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export VLLM_ASCEND_BALANCE_SCHEDULING=1 @@ -226,7 +236,7 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \ --enable-prefix-caching \ --async-scheduling \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ ---additional-config '{"multistream_overlap_shared_expert":true}' \ +--additional-config '{"enable_npugraph_ex": true,"fuse_muls_add":true,"multistream_overlap_shared_expert":true}' \ --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' ``` @@ -272,8 +282,7 @@ export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export VLLM_USE_V1=1 +export OMP_NUM_THREADS=1 export HCCL_BUFFSIZE=200 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True @@ -317,8 +326,7 @@ export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export VLLM_USE_V1=1 +export OMP_NUM_THREADS=1 export HCCL_BUFFSIZE=200 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True @@ -370,8 +378,7 @@ export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export VLLM_USE_V1=1 +export OMP_NUM_THREADS=1 export HCCL_BUFFSIZE=200 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True @@ -394,7 +401,7 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ ---additional-config '{"multistream_overlap_shared_expert":true}' \ +--additional-config '{"enable_npugraph_ex": true, "fuse_muls_add":true,"multistream_overlap_shared_expert":true}' \ --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' ``` @@ -417,8 +424,7 @@ export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export VLLM_USE_V1=1 +export OMP_NUM_THREADS=1 export HCCL_BUFFSIZE=200 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True @@ -443,7 +449,7 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ ---additional-config '{"multistream_overlap_shared_expert":true}' \ +--additional-config '{"enable_npugraph_ex": true, "fuse_muls_add":true,"multistream_overlap_shared_expert":true}' \ --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' ``` @@ -508,9 +514,799 @@ if __name__ == "__main__": json.dump(json_data, f, indent=2) ``` +:::::{tab-set} +:sync-group: install + +::::{tab-item} A3 series +:sync: A3 + +- `glm-5-w8a8`: require 2 Atlas 800 A3 (64G × 16). + +Run the following scripts on two nodes respectively. + +**node 0** + +```{code-block} bash + :substitutions: +# this obtained through ifconfig +# nic_name is the network interface name corresponding to local_ip of the current node +nic_name="xxx" +local_ip="xxx" + +# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) +node0_ip="xxxx" + +export HCCL_OP_EXPANSION_MODE="AIV" + +export HCCL_IF_IP=$local_ip +export GLOO_SOCKET_IFNAME=$nic_name +export TP_SOCKET_IFNAME=$nic_name +export HCCL_SOCKET_IFNAME=$nic_name +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=1 +export HCCL_BUFFSIZE=200 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export VLLM_ASCEND_ENABLE_MLAPO=1 + +vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w8a8 \ +--host 0.0.0.0 \ +--port 8077 \ +--data-parallel-size 2 \ +--data-parallel-size-local 1 \ +--data-parallel-address $node0_ip \ +--data-parallel-rpc-port 12890 \ +--tensor-parallel-size 16 \ +--seed 1024 \ +--served-model-name glm-5 \ +--enable-expert-parallel \ +--max-num-seqs 16 \ +--max-model-len 65536 \ +--max-num-batched-tokens 4096 \ +--trust-remote-code \ +--gpu-memory-utilization 0.95 \ +--quantization ascend \ +--enable-chunked-prefill \ +--enable-prefix-caching \ +--async-scheduling \ +--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ +--additional-config '{"enable_npugraph_ex": true,"fuse_muls_add":true,"multistream_overlap_shared_expert":true}' \ +--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' +``` + +**node 1** + +```{code-block} bash + :substitutions: +# this obtained through ifconfig +# nic_name is the network interface name corresponding to local_ip of the current node +nic_name="xxx" +local_ip="xxx" + +# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) +node0_ip="xxxx" + +export HCCL_OP_EXPANSION_MODE="AIV" + +export HCCL_IF_IP=$local_ip +export GLOO_SOCKET_IFNAME=$nic_name +export TP_SOCKET_IFNAME=$nic_name +export HCCL_SOCKET_IFNAME=$nic_name +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=1 +export HCCL_BUFFSIZE=200 +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export VLLM_ASCEND_ENABLE_MLAPO=1 + +vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w8a8 \ +--host 0.0.0.0 \ +--port 8077 \ +--headless \ +--data-parallel-size 2 \ +--data-parallel-size-local 1 \ +--data-parallel-start-rank 1 \ +--data-parallel-address $node0_ip \ +--data-parallel-rpc-port 12890 \ +--tensor-parallel-size 16 \ +--seed 1024 \ +--served-model-name glm-5 \ +--enable-expert-parallel \ +--max-num-seqs 16 \ +--max-model-len 65536 \ +--max-num-batched-tokens 4096 \ +--trust-remote-code \ +--gpu-memory-utilization 0.95 \ +--quantization ascend \ +--enable-chunked-prefill \ +--enable-prefix-caching \ +--async-scheduling \ +--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ +--additional-config '{"enable_npugraph_ex": true,"fuse_muls_add":true,"multistream_overlap_shared_expert":true}' \ +--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' +``` + +:::: +::::: + ### Prefill-Decode Disaggregation -Not test yet. +We'd like to show the deployment guide of `GLM-5` on multi-node environment with 1P1D for better performance. + +Before you start, please + +1. prepare the script `launch_online_dp.py` on each node: + + ```python + import argparse + import multiprocessing + import os + import subprocess + import sys + + def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--dp-size", + type=int, + required=True, + help="Data parallel size." + ) + parser.add_argument( + "--tp-size", + type=int, + default=1, + help="Tensor parallel size." + ) + parser.add_argument( + "--dp-size-local", + type=int, + default=-1, + help="Local data parallel size." + ) + parser.add_argument( + "--dp-rank-start", + type=int, + default=0, + help="Starting rank for data parallel." + ) + parser.add_argument( + "--dp-address", + type=str, + required=True, + help="IP address for data parallel master node." + ) + parser.add_argument( + "--dp-rpc-port", + type=str, + default=12345, + help="Port for data parallel master node." + ) + parser.add_argument( + "--vllm-start-port", + type=int, + default=9000, + help="Starting port for the engine." + ) + return parser.parse_args() + + args = parse_args() + dp_size = args.dp_size + tp_size = args.tp_size + dp_size_local = args.dp_size_local + if dp_size_local == -1: + dp_size_local = dp_size + dp_rank_start = args.dp_rank_start + dp_address = args.dp_address + dp_rpc_port = args.dp_rpc_port + vllm_start_port = args.vllm_start_port + + def run_command(visible_devices, dp_rank, vllm_engine_port): + command = [ + "bash", + "./run_dp_template.sh", + visible_devices, + str(vllm_engine_port), + str(dp_size), + str(dp_rank), + dp_address, + dp_rpc_port, + str(tp_size), + ] + subprocess.run(command, check=True) + + if __name__ == "__main__": + template_path = "./run_dp_template.sh" + if not os.path.exists(template_path): + print(f"Template file {template_path} does not exist.") + sys.exit(1) + + processes = [] + num_cards = dp_size_local * tp_size + for i in range(dp_size_local): + dp_rank = dp_rank_start + i + vllm_engine_port = vllm_start_port + i + visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size)) + process = multiprocessing.Process(target=run_command, + args=(visible_devices, dp_rank, + vllm_engine_port)) + processes.append(process) + process.start() + + for process in processes: + process.join() + + ``` + +2. prepare the script `run_dp_template.sh` on each node. + + 1. Prefill node 0 + + ```shell + nic_name="xxxx" # change to your own nic name + local_ip="xxxx" # change to your own ip + + export HCCL_OP_EXPANSION_MODE="AIV" + + export HCCL_IF_IP=$local_ip + export GLOO_SOCKET_IFNAME=$nic_name + export TP_SOCKET_IFNAME=$nic_name + export HCCL_SOCKET_IFNAME=$nic_name + + export OMP_PROC_BIND=false + export OMP_NUM_THREADS=1 + export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + export HCCL_BUFFSIZE=256 + + export ASCEND_AGGREGATE_ENABLE=1 + export ASCEND_TRANSPORT_PRINT=1 + export ACL_OP_INIT_MODE=1 + export ASCEND_A3_ENABLE=1 + export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000 + + export ASCEND_RT_VISIBLE_DEVICES=$1 + + export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 + + export VLLM_ASCEND_ENABLE_FUSED_MC2=1 + export VLLM_ASCEND_ENABLE_MLAPO=1 + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib + + vllm serve /root/.cache/glm5-w8a8 \ + --host 0.0.0.0 \ + --port $2 \ + --data-parallel-size $3 \ + --data-parallel-rank $4 \ + --data-parallel-address $5 \ + --data-parallel-rpc-port $6 \ + --tensor-parallel-size $7 \ + --enable-expert-parallel \ + --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \ + --profiler-config \ + '{"profiler": "torch", + "torch_profiler_dir": "./vllm_profile", + "torch_profiler_with_stack": false}' \ + --seed 1024 \ + --served-model-name glm-5 \ + --max-model-len 131072 \ + --additional-config '{"enable_npugraph_ex": true, "fuse_muls_add":true,"multistream_overlap_shared_expert":true,"recompute_scheduler_enable" : true}' \ + --max-num-batched-tokens 4096 \ + --trust-remote-code \ + --max-num-seqs 64 \ + --quantization ascend \ + --gpu-memory-utilization 0.95 \ + --enforce-eager \ + --enable-auto-tool-choice \ + --tool-call-parser glm47 \ + --reasoning-parser glm45 \ + --kv-transfer-config \ + '{"kv_connector": "MooncakeConnectorV1", + "kv_role": "kv_producer", + "kv_port": "30000", + "engine_id": "0", + "kv_connector_extra_config": { + "use_ascend_direct": true, + "prefill": { + "dp_size": 4, + "tp_size": 8 + }, + "decode": { + "dp_size": 16, + "tp_size": 4 + } + } + }' + + ``` + + 2. Prefill node 1 + + ```shell + nic_name="xxxx" # change to your own nic name + local_ip="xxxx" # change to your own ip + + export HCCL_OP_EXPANSION_MODE="AIV" + + export HCCL_IF_IP=$local_ip + export GLOO_SOCKET_IFNAME=$nic_name + export TP_SOCKET_IFNAME=$nic_name + export HCCL_SOCKET_IFNAME=$nic_name + + export OMP_PROC_BIND=false + export OMP_NUM_THREADS=1 + export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + export HCCL_BUFFSIZE=256 + + export ASCEND_AGGREGATE_ENABLE=1 + export ASCEND_TRANSPORT_PRINT=1 + export ACL_OP_INIT_MODE=1 + export ASCEND_A3_ENABLE=1 + export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000 + + export ASCEND_RT_VISIBLE_DEVICES=$1 + export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + + export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 + + export VLLM_ASCEND_ENABLE_FUSED_MC2=1 + export VLLM_ASCEND_ENABLE_MLAPO=1 + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib + + vllm serve /root/.cache/glm5-w8a8 \ + --host 0.0.0.0 \ + --port $2 \ + --data-parallel-size $3 \ + --data-parallel-rank $4 \ + --data-parallel-address $5 \ + --data-parallel-rpc-port $6 \ + --tensor-parallel-size $7 \ + --enable-expert-parallel \ + --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \ + --profiler-config \ + '{"profiler": "torch", + "torch_profiler_dir": "./vllm_profile", + "torch_profiler_with_stack": false}' \ + --seed 1024 \ + --served-model-name glm-5 \ + --max-model-len 131072 \ + --additional-config '{"enable_npugraph_ex": true, "fuse_muls_add":true,"multistream_overlap_shared_expert":true,"recompute_scheduler_enable" : true}' \ + --max-num-batched-tokens 4096 \ + --trust-remote-code \ + --max-num-seqs 64 \ + --gpu-memory-utilization 0.95 \ + --quantization ascend \ + --enforce-eager \ + --enable-auto-tool-choice \ + --tool-call-parser glm47 \ + --reasoning-parser glm45 \ + --kv-transfer-config \ + '{"kv_connector": "MooncakeConnectorV1", + "kv_role": "kv_producer", + "kv_port": "30000", + "engine_id": "0", + "kv_connector_extra_config": { + "use_ascend_direct": true, + "prefill": { + "dp_size": 4, + "tp_size": 8 + }, + "decode": { + "dp_size": 16, + "tp_size": 4 + } + } + }' + ``` + + 3. Decode node 0 + + ```shell + nic_name="xxxx" # change to your own nic name + local_ip="xxxx" # change to your own ip + + export HCCL_OP_EXPANSION_MODE="AIV" + + export HCCL_IF_IP=$local_ip + export GLOO_SOCKET_IFNAME=$nic_name + export TP_SOCKET_IFNAME=$nic_name + export HCCL_SOCKET_IFNAME=$nic_name + + #Mooncake + export OMP_PROC_BIND=false + export OMP_NUM_THREADS=1 + + export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + export HCCL_BUFFSIZE=256 + + + export ASCEND_AGGREGATE_ENABLE=1 + export ASCEND_TRANSPORT_PRINT=1 + export ACL_OP_INIT_MODE=1 + export ASCEND_A3_ENABLE=1 + export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000 + + export TASK_QUEUE_ENABLE=1 + + export ASCEND_RT_VISIBLE_DEVICES=$1 + + export VLLM_ASCEND_ENABLE_FUSED_MC2=1 + export VLLM_ASCEND_ENABLE_MLAPO=1 + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib + + vllm serve /root/.cache/glm5-w8a8 \ + --host 0.0.0.0 \ + --port $2 \ + --data-parallel-size $3 \ + --data-parallel-rank $4 \ + --data-parallel-address $5 \ + --data-parallel-rpc-port $6 \ + --tensor-parallel-size $7 \ + --enable-expert-parallel \ + --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \ + --profiler-config \ + '{"profiler": "torch", + "torch_profiler_dir": "./vllm_profile", + "torch_profiler_with_stack": false}' \ + --seed 1024 \ + --served-model-name glm-5 \ + --max-model-len 200000 \ + --max-num-batched-tokens 32 \ + --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \ + --additional-config '{"enable_npugraph_ex": true, "fuse_muls_add":true,"multistream_overlap_shared_expert":true,"recompute_scheduler_enable" : true}' \ + --trust-remote-code \ + --max-num-seqs 8 \ + --gpu-memory-utilization 0.92 \ + --async-scheduling \ + --quantization ascend \ + --enable-auto-tool-choice \ + --tool-call-parser glm47 \ + --reasoning-parser glm45 \ + --kv-transfer-config \ + '{"kv_connector": "MooncakeConnectorV1", + "kv_role": "kv_consumer", + "kv_port": "30100", + "engine_id": "1", + "kv_connector_extra_config": { + "use_ascend_direct": true, + "prefill": { + "dp_size": 4, + "tp_size": 8 + }, + "decode": { + "dp_size": 16, + "tp_size": 4 + } + } + }' + ``` + + 4. Decode node 1 + + ```shell + nic_name="xxxx" # change to your own nic name + local_ip="xxxx" # change to your own ip + + export HCCL_OP_EXPANSION_MODE="AIV" + + export HCCL_IF_IP=$local_ip + export GLOO_SOCKET_IFNAME=$nic_name + export TP_SOCKET_IFNAME=$nic_name + export HCCL_SOCKET_IFNAME=$nic_name + + #Mooncake + export OMP_PROC_BIND=false + export OMP_NUM_THREADS=1 + + export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + export HCCL_BUFFSIZE=256 + + export ASCEND_AGGREGATE_ENABLE=1 + export ASCEND_TRANSPORT_PRINT=1 + export ACL_OP_INIT_MODE=1 + export ASCEND_A3_ENABLE=1 + export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000 + + export TASK_QUEUE_ENABLE=1 + + export ASCEND_RT_VISIBLE_DEVICES=$1 + + export VLLM_ASCEND_ENABLE_FUSED_MC2=1 + export VLLM_ASCEND_ENABLE_MLAPO=1 + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib + + vllm serve /root/.cache/glm5-w8a8 \ + --host 0.0.0.0 \ + --port $2 \ + --data-parallel-size $3 \ + --data-parallel-rank $4 \ + --data-parallel-address $5 \ + --data-parallel-rpc-port $6 \ + --tensor-parallel-size $7 \ + --enable-expert-parallel \ + --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \ + --profiler-config \ + '{"profiler": "torch", + "torch_profiler_dir": "./vllm_profile", + "torch_profiler_with_stack": false}' \ + --seed 1024 \ + --served-model-name glm-5 \ + --max-model-len 200000 \ + --max-num-batched-tokens 32 \ + --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \ + --additional-config '{"enable_npugraph_ex": true, "fuse_muls_add":true,"multistream_overlap_shared_expert":true,"recompute_scheduler_enable" : true}' \ + --trust-remote-code \ + --max-num-seqs 8 \ + --gpu-memory-utilization 0.92 \ + --async-scheduling \ + --quantization ascend \ + --enable-auto-tool-choice \ + --tool-call-parser glm47 \ + --reasoning-parser glm45 \ + --kv-transfer-config \ + '{"kv_connector": "MooncakeConnectorV1", + "kv_role": "kv_consumer", + "kv_port": "30100", + "engine_id": "1", + "kv_connector_extra_config": { + "use_ascend_direct": true, + "prefill": { + "dp_size": 4, + "tp_size": 8 + }, + "decode": { + "dp_size": 16, + "tp_size": 4 + } + } + }' + ``` + + 5. Decode node 2 + + ```shell + nic_name="xxxx" # change to your own nic name + local_ip="xxxx" # change to your own ip + + export HCCL_OP_EXPANSION_MODE="AIV" + + export HCCL_IF_IP=$local_ip + export GLOO_SOCKET_IFNAME=$nic_name + export TP_SOCKET_IFNAME=$nic_name + export HCCL_SOCKET_IFNAME=$nic_name + + #Mooncake + export OMP_PROC_BIND=false + export OMP_NUM_THREADS=1 + + export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + export HCCL_BUFFSIZE=256 + + export ASCEND_AGGREGATE_ENABLE=1 + export ASCEND_TRANSPORT_PRINT=1 + export ACL_OP_INIT_MODE=1 + export ASCEND_A3_ENABLE=1 + export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000 + + export TASK_QUEUE_ENABLE=1 + + export ASCEND_RT_VISIBLE_DEVICES=$1 + + export VLLM_ASCEND_ENABLE_FUSED_MC2=1 + export VLLM_ASCEND_ENABLE_MLAPO=1 + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib + + vllm serve /root/.cache/glm5-w8a8 \ + --host 0.0.0.0 \ + --port $2 \ + --data-parallel-size $3 \ + --data-parallel-rank $4 \ + --data-parallel-address $5 \ + --data-parallel-rpc-port $6 \ + --tensor-parallel-size $7 \ + --enable-expert-parallel \ + --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \ + --profiler-config \ + '{"profiler": "torch", + "torch_profiler_dir": "./vllm_profile", + "torch_profiler_with_stack": false}' \ + --seed 1024 \ + --served-model-name glm-5 \ + --max-model-len 200000 \ + --max-num-batched-tokens 32 \ + --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \ + --additional-config '{"enable_npugraph_ex": true, "fuse_muls_add":true,"multistream_overlap_shared_expert":true,"recompute_scheduler_enable" : true}' \ + --trust-remote-code \ + --max-num-seqs 8 \ + --gpu-memory-utilization 0.92 \ + --async-scheduling \ + --quantization ascend \ + --enable-auto-tool-choice \ + --tool-call-parser glm47 \ + --reasoning-parser glm45 \ + --kv-transfer-config \ + '{"kv_connector": "MooncakeConnectorV1", + "kv_role": "kv_consumer", + "kv_port": "30100", + "engine_id": "1", + "kv_connector_extra_config": { + "use_ascend_direct": true, + "prefill": { + "dp_size": 4, + "tp_size": 8 + }, + "decode": { + "dp_size": 16, + "tp_size": 4 + } + } + }' + ``` + + 6. Decode node 3 + + ```shell + nic_name="xxxx" # change to your own nic name + local_ip="xxxx" # change to your own ip + + export HCCL_OP_EXPANSION_MODE="AIV" + + export HCCL_IF_IP=$local_ip + export GLOO_SOCKET_IFNAME=$nic_name + export TP_SOCKET_IFNAME=$nic_name + export HCCL_SOCKET_IFNAME=$nic_name + + #Mooncake + export OMP_PROC_BIND=false + export OMP_NUM_THREADS=1 + + export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True + export HCCL_BUFFSIZE=256 + + export ASCEND_AGGREGATE_ENABLE=1 + export ASCEND_TRANSPORT_PRINT=1 + export ACL_OP_INIT_MODE=1 + export ASCEND_A3_ENABLE=1 + export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000 + + export TASK_QUEUE_ENABLE=1 + + export ASCEND_RT_VISIBLE_DEVICES=$1 + + export VLLM_ASCEND_ENABLE_FUSED_MC2=1 + export VLLM_ASCEND_ENABLE_MLAPO=1 + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib + + vllm serve /root/.cache/glm5-w8a8 \ + --host 0.0.0.0 \ + --port $2 \ + --data-parallel-size $3 \ + --data-parallel-rank $4 \ + --data-parallel-address $5 \ + --data-parallel-rpc-port $6 \ + --tensor-parallel-size $7 \ + --enable-expert-parallel \ + --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \ + --profiler-config \ + '{"profiler": "torch", + "torch_profiler_dir": "./vllm_profile", + "torch_profiler_with_stack": false}' \ + --seed 1024 \ + --served-model-name glm-5 \ + --max-model-len 200000 \ + --max-num-batched-tokens 32 \ + --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \ + --additional-config '{"enable_npugraph_ex": true, "fuse_muls_add":true,"multistream_overlap_shared_expert":true,"recompute_scheduler_enable" : true}' \ + --trust-remote-code \ + --max-num-seqs 8 \ + --gpu-memory-utilization 0.92 \ + --async-scheduling \ + --quantization ascend \ + --enable-auto-tool-choice \ + --tool-call-parser glm47 \ + --reasoning-parser glm45 \ + --kv-transfer-config \ + '{"kv_connector": "MooncakeConnectorV1", + "kv_role": "kv_consumer", + "kv_port": "30100", + "engine_id": "1", + "kv_connector_extra_config": { + "use_ascend_direct": true, + "prefill": { + "dp_size": 4, + "tp_size": 8 + }, + "decode": { + "dp_size": 16, + "tp_size": 4 + } + } + }' + ``` + +Once the preparation is done, you can start the server with the following command on each node: + +1. Prefill node 0 + +```shell +# change ip to your own +python launch_online_dp.py --dp-size 4 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address $node_p0_ip --dp-rpc-port 10521 --vllm-start-port 6700 +``` + +2. Prefill node 1 + +```shell +# change ip to your own +python launch_online_dp.py --dp-size 4 --tp-size 8 --dp-size-local 2 --dp-rank-start 2 --dp-address $node_p0_ip --dp-rpc-port 10521 --vllm-start-port 6700 +``` + +3. Decode node 0 + +```shell +# change ip to your own +python launch_online_dp.py --dp-size 16 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address $node_d0_ip --dp-rpc-port 10523 --vllm-start-port 6721 +``` + +4. Decode node 1 + +```shell +# change ip to your own +python launch_online_dp.py --dp-size 16 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address $node_d0_ip --dp-rpc-port 10523 --vllm-start-port 6721 +``` + +5. Decode node 2 + +```shell +# change ip to your own +python launch_online_dp.py --dp-size 16 --tp-size 4 --dp-size-local 4 --dp-rank-start 8 --dp-address $node_d0_ip --dp-rpc-port 10523 --vllm-start-port 6721 +``` + +6. Decode node 3 + +```shell +# change ip to your own +python launch_online_dp.py --dp-size 16 --tp-size 4 --dp-size-local 4 --dp-rank-start 12 --dp-address $node_d0_ip --dp-rpc-port 10523 --vllm-start-port 6721 +``` + +### Request Forwarding + +To set up request forwarding, run the following script on any machine. You can get the proxy program in the repository's examples: [load_balance_proxy_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py) + +```shell +unset http_proxy +unset https_proxy + +python load_balance_proxy_server_example.py \ + --port 8000 \ + --host 0.0.0.0 \ + --prefiller-hosts \ + $node_p0_ip \ + $node_p0_ip \ + $node_p1_ip \ + $node_p1_ip \ + --prefiller-ports \ + 6700 6701 \ + 6700 6701 \ + --decoder-hosts \ + $node_d0_ip \ + $node_d0_ip \ + $node_d0_ip \ + $node_d0_ip \ + $node_d1_ip \ + $node_d1_ip \ + $node_d1_ip \ + $node_d1_ip \ + $node_d2_ip \ + $node_d2_ip \ + $node_d2_ip \ + $node_d2_ip \ + $node_d3_ip \ + $node_d3_ip \ + $node_d3_ip \ + $node_d3_ip \ + --decoder-ports \ + 6721 6722 6723 6724 \ + 6721 6722 6723 6724 \ + 6721 6722 6723 6724 \ + 6721 6722 6723 6724 +``` ## Accuracy Evaluation diff --git a/docs/source/user_guide/release_notes.md b/docs/source/user_guide/release_notes.md index 89b97078..30e9a0fd 100644 --- a/docs/source/user_guide/release_notes.md +++ b/docs/source/user_guide/release_notes.md @@ -17,7 +17,7 @@ This is the first release candidate of v0.17.0 for vLLM Ascend. Please follow th - FlashLB algorithm for EPLB: supports per-step heat collection and multi-stage load balancing for better expert parallelism efficiency. [#6477](https://github.com/vllm-project/vllm-ascend/pull/6477) - LoRA with tensor parallel and `--fully-sharded-loras` is now fixed and working. [#6650](https://github.com/vllm-project/vllm-ascend/pull/6650) - LMCacheAscendConnector is added as a new KV cache pooling solution for Ascend. [#6882](https://github.com/vllm-project/vllm-ascend/pull/6882) -- W8A8C8 quantization is now supported for DeepSeek-V3.2 and GLM5 in PD-mix scenario. [#7029](https://github.com/vllm-project/vllm-ascend/pull/7029) +- W8A8C8 quantization is now supported for DeepSeek-V3.2 in PD-mix scenario. [#7029](https://github.com/vllm-project/vllm-ascend/pull/7029) - [Experimental] Minimax-m2.5 model is now supported on Ascend NPU. [#7105](https://github.com/vllm-project/vllm-ascend/pull/7105) - [Experimental] Mooncake Layerwise Connector now supports hybrid attention manager with multiple KV cache groups. [#7022](https://github.com/vllm-project/vllm-ascend/pull/7022) - [Experimental] Prefix cache is now supported in hybrid model. [#7103](https://github.com/vllm-project/vllm-ascend/pull/7103) diff --git a/tests/e2e/nightly/single_node/models/configs/GLM-5.yaml b/tests/e2e/nightly/single_node/models/configs/GLM-5.yaml new file mode 100644 index 00000000..8be988cb --- /dev/null +++ b/tests/e2e/nightly/single_node/models/configs/GLM-5.yaml @@ -0,0 +1,83 @@ +# ========================================== +# Shared Configurations +# ========================================== + +_envs: &envs + HCCL_BUFFSIZE: "200" + SERVER_PORT: "DEFAULT_PORT" + HCCL_OP_EXPANSION_MODE: "AIV" + OMP_PROC_BIND: "false" + OMP_NUM_THREADS: "1" + PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True" + VLLM_ASCEND_BALANCE_SCHEDULING: "1" + +_server_cmd: &server_cmd + - "--enable-expert-parallel" + - "--tensor-parallel-size" + - "16" + - "--data-parallel-size" + - "1" + - "--port" + - "$SERVER_PORT" + - "--max-model-len" + - "8192" + - "--max-num-batched-tokens" + - "4096" + - "--trust-remote-code" + - "--gpu-memory-utilization" + - "0.95" + - "--max-num-seqs" + - "8" + - "--quantization" + - "ascend" + - "--async-scheduling" + - "--additional-config" + - '{"enable_npugraph_ex": true,"fuse_muls_add":true,"multistream_overlap_shared_expert":true}' + - "--speculative-config" + - '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' + +_benchmarks: &benchmarks + acc: + case_type: accuracy + dataset_path: vllm-ascend/gsm8k-lite + request_conf: vllm_api_general_chat + dataset_conf: gsm8k/gsm8k_gen_0_shot_cot_chat_prompt + max_out_len: 4096 + batch_size: 8 + baseline: 95 + threshold: 5 + perf: + case_type: performance + dataset_path: vllm-ascend/GSM8K-in3500-bs400 + request_conf: vllm_api_stream_chat + dataset_conf: gsm8k/gsm8k_gen_0_shot_cot_str_perf + num_prompts: 16 + max_out_len: 1500 + batch_size: 8 + request_rate: 0 + baseline: 1 + threshold: 0.97 + +# ========================================== +# ACTUAL TEST CASES +# ========================================== + +test_cases: + - name: "GLM-5-TP16-DP1-decodegraph" + model: "Eco-Tech/GLM-5-w4a8" + envs: + <<: *envs + server_cmd: *server_cmd + server_cmd_extra: + - "--compilation-config" + - '{"cudagraph_capture": [4,8,12,16,20,24,28,32], "cudagraph_model":"FULL_DECODE_ONLY"}' + benchmarks: + <<: *benchmarks + + - name: "GLM-5-TP16-DP1-eager" + model: "Eco-Tech/GLM-5-w4a8" + envs: + <<: *envs + server_cmd: *server_cmd + benchmarks: + <<: *benchmarks