From e38fab011d0b81f3a8e40d9bbe263c283dd4129b Mon Sep 17 00:00:00 2001 From: hucong <33891520+underfituu@users.noreply.github.com> Date: Mon, 4 Aug 2025 20:30:53 +0800 Subject: [PATCH] [Doc][PD] Restore the default configuration items in examples/disaggregate_prefill_v1/README.md (#2165) ### What this PR does / why we need it? - In the D node, the max-num-batched-tokens parameter can be set to a smaller value since the D node processes at most max-num-seqs batches concurrently. As the profile_run only needs to handle max-num-seqs sequences at a time, we can safely set max-num-batched-tokens equal to max-num-seqs. This optimization will help reduce activation memory consumption. - Restore the default configuration items for PD separation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: https://github.com/vllm-project/vllm/commit/61dcc280faf305778c0c44597e823f40063aaed6 Signed-off-by: underfituu --- examples/disaggregated_prefill_v1/README.md | 100 ++++++++++++-------- 1 file changed, 58 insertions(+), 42 deletions(-) diff --git a/examples/disaggregated_prefill_v1/README.md b/examples/disaggregated_prefill_v1/README.md index 14def1d..a77e3e2 100644 --- a/examples/disaggregated_prefill_v1/README.md +++ b/examples/disaggregated_prefill_v1/README.md @@ -6,7 +6,7 @@ This demo document provides instructions for running a disaggregated vLLM-ascend ## Prerequisites - Ascend NPU environment with vLLM 0.9.1 installed - Network interfaces configured for distributed communication (eg: eth0) -- Model weights located at `/data01/deepseek_r1_w8a8_zhw` +- Model weights located at `/models/deepseek_r1_w8a8` ## Rank table generation The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. The following command generates a rank table for all nodes with 16 cards prefill and 16 cards decode: @@ -15,11 +15,15 @@ Run the following command on every node to generate the rank table: ```shell cd vllm-ascend/examples/disaggregate_prefill_v1/ bash gen_ranktable.sh --ips 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36 \ - --npus-per-node 8 --network-card-name enp189s0f0 --prefill-device-cnt 16 --decode-device-cnt 16 + --npus-per-node 8 --network-card-name eth0 --prefill-device-cnt 16 --decode-device-cnt 16 ``` Rank table will generated at `/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json` -## Start disaggregated vLLM-ascend service +## Start disaggregated vLLM-ascend service +For demonstration purposes, we will utilize the quantized version of Deepseek-R1. Recommended Parallelization Strategies: +- P-node: DP2-TP8-EP16 (Data Parallelism 2, Tensor Parallelism 8, Expert Parallelism 16) +- D-node: DP4-TP4-EP16 (Data Parallelism 4, Tensor Parallelism 4, Expert Parallelism 16) + Execution Sequence - 4 configured node ip are: 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36 - Start Prefill on Node 1 (P1) @@ -28,7 +32,7 @@ Execution Sequence - Start Decode on Node 2 (D2) - Start proxy server on Node1 -* Run prefill server P1 on first node +Run prefill server P1 on first node: ```shell export HCCL_IF_IP=172.19.32.175 # node ip export GLOO_SOCKET_IFNAME="eth0" # network card name @@ -38,7 +42,9 @@ export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/example export OMP_PROC_BIND=false export OMP_NUM_THREADS=100 export VLLM_USE_V1=1 -vllm serve /data01/deepseek_r1_w8a8_zhw \ +export VLLM_LLMDD_RPC_PORT=5559 + +vllm serve /models/deepseek_r1_w8a8 \ --host 0.0.0.0 \ --port 20002 \ --data-parallel-size 2 \ @@ -47,11 +53,12 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \ --data-parallel-address 172.19.32.175 \ --data-parallel-rpc-port 13356 \ --tensor-parallel-size 8 \ - --no-enable-prefix-caching \ + --enable-expert-parallel \ --seed 1024 \ --served-model-name deepseek \ - --max-model-len 6144 \ - --max-num-batched-tokens 6144 \ + --max-model-len 32768 \ + --max-num-batched-tokens 32768 \ + --max-num-seqs 256 \ --trust-remote-code \ --enforce-eager \ --gpu-memory-utilization 0.9 \ @@ -65,10 +72,10 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \ "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector" }' \ --additional-config \ - '{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}' + '{"chunked_prefill_for_mla":true}' ``` -* Run prefill server P2 on second node +Run prefill server P2 on second node: ```shell export HCCL_IF_IP=172.19.241.49 export GLOO_SOCKET_IFNAME="eth0" @@ -78,7 +85,9 @@ export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/example export OMP_PROC_BIND=false export OMP_NUM_THREADS=100 export VLLM_USE_V1=1 -vllm serve /data01/deepseek_r1_w8a8_zhw \ +export VLLM_LLMDD_RPC_PORT=5659 + +vllm serve /models/deepseek_r1_w8a8 \ --host 0.0.0.0 \ --port 20002 \ --headless \ @@ -88,11 +97,12 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \ --data-parallel-address 172.19.32.175 \ --data-parallel-rpc-port 13356 \ --tensor-parallel-size 8 \ - --no-enable-prefix-caching \ + --enable-expert-parallel \ --seed 1024 \ --served-model-name deepseek \ - --max-model-len 6144 \ - --max-num-batched-tokens 6144 \ + --max-model-len 32768 \ + --max-num-batched-tokens 32768 \ + --max-num-seqs 256 \ --trust-remote-code \ --enforce-eager \ --gpu-memory-utilization 0.9 \ @@ -106,10 +116,12 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \ "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector" }' \ --additional-config \ - '{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}' + '{"chunked_prefill_for_mla":true}' ``` -* Run decode server d1 on third node +Run decode server d1 on third node: + +* In the D node, the `max-num-batched-tokens` parameter can be set to a smaller value since the D node processes at most `max-num-seqs` batches concurrently. As the `profile_run` only needs to handle `max-num-seqs` sequences at a time, we can safely set `max-num-batched-tokens` equal to `max-num-seqs`. This optimization will help reduce activation memory consumption. ```shell export HCCL_IF_IP=172.19.123.51 export GLOO_SOCKET_IFNAME="eth0" @@ -119,22 +131,24 @@ export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/example export OMP_PROC_BIND=false export OMP_NUM_THREADS=100 export VLLM_USE_V1=1 -vllm serve /data01/deepseek_r1_w8a8_zhw \ +export VLLM_LLMDD_RPC_PORT=5759 + +vllm serve /models/deepseek_r1_w8a8 \ --host 0.0.0.0 \ --port 20002 \ - --data-parallel-size 2 \ - --data-parallel-size-local 1 \ + --data-parallel-size 4 \ + --data-parallel-size-local 2 \ --api-server-count 2 \ --data-parallel-address 172.19.123.51 \ --data-parallel-rpc-port 13356 \ - --tensor-parallel-size 8 \ - --no-enable-prefix-caching \ + --tensor-parallel-size 4 \ + --enable-expert-parallel \ --seed 1024 \ --served-model-name deepseek \ - --max-model-len 6144 \ - --max-num-batched-tokens 6144 \ + --max-model-len 32768 \ + --max-num-batched-tokens 256 \ + --max-num-seqs 256 \ --trust-remote-code \ - --enforce-eager \ --gpu-memory-utilization 0.9 \ --kv-transfer-config \ '{"kv_connector": "LLMDataDistCMgrConnector", @@ -146,10 +160,10 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \ "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector" }' \ --additional-config \ - '{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}' + '{"torchair_graph_config": {"enabled":true}}' ``` -* Run decode server d2 on last node +Run decode server d2 on last node: ```shell export HCCL_IF_IP=172.19.190.36 export GLOO_SOCKET_IFNAME="eth0" @@ -159,23 +173,25 @@ export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/example export OMP_PROC_BIND=false export OMP_NUM_THREADS=100 export VLLM_USE_V1=1 -vllm serve /data01/deepseek_r1_w8a8_zhw \ +export VLLM_LLMDD_RPC_PORT=5859 + +vllm serve /models/deepseek_r1_w8a8 \ --host 0.0.0.0 \ --port 20002 \ --headless \ - --data-parallel-size 2 \ - --data-parallel-start-rank 1 \ - --data-parallel-size-local 1 \ + --data-parallel-size 4 \ + --data-parallel-start-rank 2 \ + --data-parallel-size-local 2 \ --data-parallel-address 172.19.123.51 \ --data-parallel-rpc-port 13356 \ - --tensor-parallel-size 8 \ - --no-enable-prefix-caching \ + --tensor-parallel-size 4 \ + --enable-expert-parallel \ --seed 1024 \ --served-model-name deepseek \ - --max-model-len 6144 \ - --max-num-batched-tokens 6144 \ + --max-model-len 32768 \ + --max-num-batched-tokens 256 \ + --max-num-seqs 256 \ --trust-remote-code \ - --enforce-eager \ --gpu-memory-utilization 0.9 \ --kv-transfer-config \ '{"kv_connector": "LLMDataDistCMgrConnector", @@ -187,16 +203,16 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \ "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector" }' \ --additional-config \ - '{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}' + '{"torchair_graph_config": {"enabled":true}}' ``` -* Run proxy server on the first node +Run proxy server on the first node: ```shell cd /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1 python toy_proxy_server.py --host 172.19.32.175 --port 1025 --prefiller-hosts 172.19.241.49 --prefiller-port 20002 --decoder-hosts 172.19.123.51 --decoder-ports 20002 ``` -* Verification +Verification Check service health using the proxy server endpoint: ```shell curl http://localhost:1025/v1/completions \ @@ -209,8 +225,8 @@ curl http://localhost:1025/v1/completions \ }' ``` -* Performance -Test performance with vllm benchmark +Performance +Test performance with vllm benchmark: ```shell cd /vllm-workspace/vllm/benchmarks python3 benchmark_serving.py \ @@ -221,9 +237,9 @@ python3 benchmark_serving.py \ --num-prompts 256 \ --ignore-eos \ --model deepseek \ - --tokenizer /data01/deepseek_r1_w8a8_zhw \ + --tokenizer /models/deepseek_r1_w8a8 \ --host localhost \ - --port 8000 \ + --port 1025 \ --endpoint /v1/completions \ --max-concurrency 4 \ --request-rate 4