### What this PR does / why we need it?
This PR adopt `LLMDataDist` for kv cache register and `pull_blocks`
style disaggregate prefill implementation. The interface implementation
mainly follows the design of NIXL PR
https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953
.
This PR can be test with the following step:
- Generate the rank table for all machine.
- execute`toy_proxy.py` to launch the disaggregate prefill proxy server,
specify the prefill ip, port and the decode ip, port
- Run the prefill server and decode server.
- send the request to the disaggregate prefill proxy
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2
---------
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Signed-off-by: liziyu179 <3475441767@qq.com>
Signed-off-by: underfitc <hucong24@huawei.com>
Signed-off-by: zouyida2052 <zouyida@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: underfituu <hzhucong@163.com>
Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Co-authored-by: liziyu179 <3475441767@qq.com>
Co-authored-by: underfitc <hucong24@huawei.com>
Co-authored-by: zouyida2052 <zouyida@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: underfituu <hzhucong@163.com>
7.7 KiB
7.7 KiB
Disaggregated Prefill-Decode Deployment Guide
Overview
This demo document provides instructions for running a disaggregated vLLM-ascend service with separate prefill and decode stages across 4 nodes, uses 16 Ascend NPUs for two prefill nodes (P1/P2) and 16 Ascend NPUS for two decode nodes (D1/D2).
Prerequisites
- Ascend NPU environment with vLLM 0.9.1 installed
- Network interfaces configured for distributed communication (eg: eth0)
- Model weights located at
/data01/deepseek_r1_w8a8_zhw
Rank table generation
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. The following command generates a rank table for all nodes with 16 cards prefill and 16 cards decode:
Run the following command on every node to generate the rank table:
cd vllm-ascend/examples/disaggregate_prefill_v1/
bash gen_ranktable.sh --ips 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36 \
--npus-per-node 8 --network-card-name enp189s0f0 --prefill-device-cnt 16 --decode-device-cnt 16
Rank table will generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
Start disaggregated vLLM-ascend service
Execution Sequence
- 4 configured node ip are: 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36
- Start Prefill on Node 1 (P1)
- Start Prefill on Node 2 (P2)
- Start Decode on Node 1 (D1)
- Start Decode on Node 2 (D2)
- Start proxy server on Node1
- Run prefill server P1 on first node
export HCCL_IF_IP=172.19.32.175 # node ip
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \
--host 0.0.0.0 \
--port 20002 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--api-server-count 2 \
--data-parallel-address 172.19.32.175 \
--data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \
--no-enable-prefix-caching \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 6144 \
--max-num-batched-tokens 6144 \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
- Run prefill server P2 on second node
export HCCL_IF_IP=172.19.241.49
export GLOO_SOCKET_IFNAME="eth0"
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \
--host 0.0.0.0 \
--port 20002 \
--headless \
--data-parallel-size 2 \
--data-parallel-start-rank 1 \
--data-parallel-size-local 1 \
--data-parallel-address 172.19.32.175 \
--data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \
--no-enable-prefix-caching \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 6144 \
--max-num-batched-tokens 6144 \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
- Run decode server d1 on third node
export HCCL_IF_IP=172.19.123.51
export GLOO_SOCKET_IFNAME="eth0"
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \
--host 0.0.0.0 \
--port 20002 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--api-server-count 2 \
--data-parallel-address 172.19.123.51 \
--data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \
--no-enable-prefix-caching \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 6144 \
--max-num-batched-tokens 6144 \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
- Run decode server d2 on last node
export HCCL_IF_IP=172.19.190.36
export GLOO_SOCKET_IFNAME="eth0"
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \
--host 0.0.0.0 \
--port 20002 \
--headless \
--data-parallel-size 2 \
--data-parallel-start-rank 1 \
--data-parallel-size-local 1 \
--data-parallel-address 172.19.123.51 \
--data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \
--no-enable-prefix-caching \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 6144 \
--max-num-batched-tokens 6144 \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
- Run proxy server on the first node
cd /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1
python toy_proxy_server.py --host 172.19.32.175 --port 1025 --prefiller-hosts 172.19.241.49 --prefiller-port 20002 --decoder-hosts 172.19.123.51 --decoder-ports 20002
- Verification Check service health using the proxy server endpoint:
curl http://localhost:1025/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek",
"prompt": "Who are you?",
"max_tokens": 100,
"temperature": 0
}'
- Performance Test performance with vllm benchmark
cd /vllm-workspace/vllm/benchmarks
python3 benchmark_serving.py \
--backend vllm \
--dataset-name random \
--random-input-len 4096 \
--random-output-len 1536 \
--num-prompts 256 \
--ignore-eos \
--model deepseek \
--tokenizer /data01/deepseek_r1_w8a8_zhw \
--host localhost \
--port 8000 \
--endpoint /v1/completions \
--max-concurrency 4 \
--request-rate 4