[Doc][PD] Restore the default configuration items in examples/disaggregate_prefill_v1/README.md (#2165)

### What this PR does / why we need it?
- In the D node, the max-num-batched-tokens parameter can be set to a
smaller value since the D node processes at most max-num-seqs batches
concurrently. As the profile_run only needs to handle max-num-seqs
sequences at a time, we can safely set max-num-batched-tokens equal to
max-num-seqs. This optimization will help reduce activation memory
consumption.
- Restore the default configuration items for PD separation.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.0
- vLLM main:
61dcc280fa

Signed-off-by: underfituu <hzhucong@163.com>
This commit is contained in:
hucong
2025-08-04 20:30:53 +08:00
committed by GitHub
parent 957c7f108d
commit e38fab011d

View File

@@ -6,7 +6,7 @@ This demo document provides instructions for running a disaggregated vLLM-ascend
## Prerequisites ## Prerequisites
- Ascend NPU environment with vLLM 0.9.1 installed - Ascend NPU environment with vLLM 0.9.1 installed
- Network interfaces configured for distributed communication (eg: eth0) - Network interfaces configured for distributed communication (eg: eth0)
- Model weights located at `/data01/deepseek_r1_w8a8_zhw` - Model weights located at `/models/deepseek_r1_w8a8`
## Rank table generation ## Rank table generation
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. The following command generates a rank table for all nodes with 16 cards prefill and 16 cards decode: The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. The following command generates a rank table for all nodes with 16 cards prefill and 16 cards decode:
@@ -15,11 +15,15 @@ Run the following command on every node to generate the rank table:
```shell ```shell
cd vllm-ascend/examples/disaggregate_prefill_v1/ cd vllm-ascend/examples/disaggregate_prefill_v1/
bash gen_ranktable.sh --ips 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36 \ bash gen_ranktable.sh --ips 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36 \
--npus-per-node 8 --network-card-name enp189s0f0 --prefill-device-cnt 16 --decode-device-cnt 16 --npus-per-node 8 --network-card-name eth0 --prefill-device-cnt 16 --decode-device-cnt 16
``` ```
Rank table will generated at `/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json` Rank table will generated at `/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json`
## Start disaggregated vLLM-ascend service ## Start disaggregated vLLM-ascend service
For demonstration purposes, we will utilize the quantized version of Deepseek-R1. Recommended Parallelization Strategies:
- P-node: DP2-TP8-EP16 (Data Parallelism 2, Tensor Parallelism 8, Expert Parallelism 16)
- D-node: DP4-TP4-EP16 (Data Parallelism 4, Tensor Parallelism 4, Expert Parallelism 16)
Execution Sequence Execution Sequence
- 4 configured node ip are: 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36 - 4 configured node ip are: 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36
- Start Prefill on Node 1 (P1) - Start Prefill on Node 1 (P1)
@@ -28,7 +32,7 @@ Execution Sequence
- Start Decode on Node 2 (D2) - Start Decode on Node 2 (D2)
- Start proxy server on Node1 - Start proxy server on Node1
* Run prefill server P1 on first node Run prefill server P1 on first node:
```shell ```shell
export HCCL_IF_IP=172.19.32.175 # node ip export HCCL_IF_IP=172.19.32.175 # node ip
export GLOO_SOCKET_IFNAME="eth0" # network card name export GLOO_SOCKET_IFNAME="eth0" # network card name
@@ -38,7 +42,9 @@ export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/example
export OMP_PROC_BIND=false export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100 export OMP_NUM_THREADS=100
export VLLM_USE_V1=1 export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \ export VLLM_LLMDD_RPC_PORT=5559
vllm serve /models/deepseek_r1_w8a8 \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20002 \ --port 20002 \
--data-parallel-size 2 \ --data-parallel-size 2 \
@@ -47,11 +53,12 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \
--data-parallel-address 172.19.32.175 \ --data-parallel-address 172.19.32.175 \
--data-parallel-rpc-port 13356 \ --data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--no-enable-prefix-caching \ --enable-expert-parallel \
--seed 1024 \ --seed 1024 \
--served-model-name deepseek \ --served-model-name deepseek \
--max-model-len 6144 \ --max-model-len 32768 \
--max-num-batched-tokens 6144 \ --max-num-batched-tokens 32768 \
--max-num-seqs 256 \
--trust-remote-code \ --trust-remote-code \
--enforce-eager \ --enforce-eager \
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
@@ -65,10 +72,10 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector" "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \ }' \
--additional-config \ --additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}' '{"chunked_prefill_for_mla":true}'
``` ```
* Run prefill server P2 on second node Run prefill server P2 on second node:
```shell ```shell
export HCCL_IF_IP=172.19.241.49 export HCCL_IF_IP=172.19.241.49
export GLOO_SOCKET_IFNAME="eth0" export GLOO_SOCKET_IFNAME="eth0"
@@ -78,7 +85,9 @@ export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/example
export OMP_PROC_BIND=false export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100 export OMP_NUM_THREADS=100
export VLLM_USE_V1=1 export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \ export VLLM_LLMDD_RPC_PORT=5659
vllm serve /models/deepseek_r1_w8a8 \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20002 \ --port 20002 \
--headless \ --headless \
@@ -88,11 +97,12 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \
--data-parallel-address 172.19.32.175 \ --data-parallel-address 172.19.32.175 \
--data-parallel-rpc-port 13356 \ --data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--no-enable-prefix-caching \ --enable-expert-parallel \
--seed 1024 \ --seed 1024 \
--served-model-name deepseek \ --served-model-name deepseek \
--max-model-len 6144 \ --max-model-len 32768 \
--max-num-batched-tokens 6144 \ --max-num-batched-tokens 32768 \
--max-num-seqs 256 \
--trust-remote-code \ --trust-remote-code \
--enforce-eager \ --enforce-eager \
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
@@ -106,10 +116,12 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector" "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \ }' \
--additional-config \ --additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}' '{"chunked_prefill_for_mla":true}'
``` ```
* Run decode server d1 on third node Run decode server d1 on third node:
* In the D node, the `max-num-batched-tokens` parameter can be set to a smaller value since the D node processes at most `max-num-seqs` batches concurrently. As the `profile_run` only needs to handle `max-num-seqs` sequences at a time, we can safely set `max-num-batched-tokens` equal to `max-num-seqs`. This optimization will help reduce activation memory consumption.
```shell ```shell
export HCCL_IF_IP=172.19.123.51 export HCCL_IF_IP=172.19.123.51
export GLOO_SOCKET_IFNAME="eth0" export GLOO_SOCKET_IFNAME="eth0"
@@ -119,22 +131,24 @@ export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/example
export OMP_PROC_BIND=false export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100 export OMP_NUM_THREADS=100
export VLLM_USE_V1=1 export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \ export VLLM_LLMDD_RPC_PORT=5759
vllm serve /models/deepseek_r1_w8a8 \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20002 \ --port 20002 \
--data-parallel-size 2 \ --data-parallel-size 4 \
--data-parallel-size-local 1 \ --data-parallel-size-local 2 \
--api-server-count 2 \ --api-server-count 2 \
--data-parallel-address 172.19.123.51 \ --data-parallel-address 172.19.123.51 \
--data-parallel-rpc-port 13356 \ --data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \ --tensor-parallel-size 4 \
--no-enable-prefix-caching \ --enable-expert-parallel \
--seed 1024 \ --seed 1024 \
--served-model-name deepseek \ --served-model-name deepseek \
--max-model-len 6144 \ --max-model-len 32768 \
--max-num-batched-tokens 6144 \ --max-num-batched-tokens 256 \
--max-num-seqs 256 \
--trust-remote-code \ --trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector", '{"kv_connector": "LLMDataDistCMgrConnector",
@@ -146,10 +160,10 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector" "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \ }' \
--additional-config \ --additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}' '{"torchair_graph_config": {"enabled":true}}'
``` ```
* Run decode server d2 on last node Run decode server d2 on last node:
```shell ```shell
export HCCL_IF_IP=172.19.190.36 export HCCL_IF_IP=172.19.190.36
export GLOO_SOCKET_IFNAME="eth0" export GLOO_SOCKET_IFNAME="eth0"
@@ -159,23 +173,25 @@ export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/example
export OMP_PROC_BIND=false export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100 export OMP_NUM_THREADS=100
export VLLM_USE_V1=1 export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \ export VLLM_LLMDD_RPC_PORT=5859
vllm serve /models/deepseek_r1_w8a8 \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20002 \ --port 20002 \
--headless \ --headless \
--data-parallel-size 2 \ --data-parallel-size 4 \
--data-parallel-start-rank 1 \ --data-parallel-start-rank 2 \
--data-parallel-size-local 1 \ --data-parallel-size-local 2 \
--data-parallel-address 172.19.123.51 \ --data-parallel-address 172.19.123.51 \
--data-parallel-rpc-port 13356 \ --data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \ --tensor-parallel-size 4 \
--no-enable-prefix-caching \ --enable-expert-parallel \
--seed 1024 \ --seed 1024 \
--served-model-name deepseek \ --served-model-name deepseek \
--max-model-len 6144 \ --max-model-len 32768 \
--max-num-batched-tokens 6144 \ --max-num-batched-tokens 256 \
--max-num-seqs 256 \
--trust-remote-code \ --trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector", '{"kv_connector": "LLMDataDistCMgrConnector",
@@ -187,16 +203,16 @@ vllm serve /data01/deepseek_r1_w8a8_zhw \
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector" "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \ }' \
--additional-config \ --additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}' '{"torchair_graph_config": {"enabled":true}}'
``` ```
* Run proxy server on the first node Run proxy server on the first node:
```shell ```shell
cd /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1 cd /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1
python toy_proxy_server.py --host 172.19.32.175 --port 1025 --prefiller-hosts 172.19.241.49 --prefiller-port 20002 --decoder-hosts 172.19.123.51 --decoder-ports 20002 python toy_proxy_server.py --host 172.19.32.175 --port 1025 --prefiller-hosts 172.19.241.49 --prefiller-port 20002 --decoder-hosts 172.19.123.51 --decoder-ports 20002
``` ```
* Verification Verification
Check service health using the proxy server endpoint: Check service health using the proxy server endpoint:
```shell ```shell
curl http://localhost:1025/v1/completions \ curl http://localhost:1025/v1/completions \
@@ -209,8 +225,8 @@ curl http://localhost:1025/v1/completions \
}' }'
``` ```
* Performance Performance
Test performance with vllm benchmark Test performance with vllm benchmark:
```shell ```shell
cd /vllm-workspace/vllm/benchmarks cd /vllm-workspace/vllm/benchmarks
python3 benchmark_serving.py \ python3 benchmark_serving.py \
@@ -221,9 +237,9 @@ python3 benchmark_serving.py \
--num-prompts 256 \ --num-prompts 256 \
--ignore-eos \ --ignore-eos \
--model deepseek \ --model deepseek \
--tokenizer /data01/deepseek_r1_w8a8_zhw \ --tokenizer /models/deepseek_r1_w8a8 \
--host localhost \ --host localhost \
--port 8000 \ --port 1025 \
--endpoint /v1/completions \ --endpoint /v1/completions \
--max-concurrency 4 \ --max-concurrency 4 \
--request-rate 4 --request-rate 4