### What this PR does / why we need it? Set adxl engine as the default Mooncake backend, because Ascend Transport is no longer maintained. Update README to include instructions for installing the adxl backend Mooncake. ### Does this PR introduce _any_ user-facing change? Users need to compile and install the mooncake backend for adxl according to the revised README instructions. ### How was this patch tested? By CI. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
16 KiB
Prefill-Decode Disaggregation Mooncake Verification (Qwen)
Getting Start
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide takes one-by-one steps to verify these features with constrained resources.
Take the Qwen3-235B model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
Verify Multi-Node Communication Environment
Physical Layer Requirements
- The physical machines must be located on the same WLAN, with network connectivity.
- All NPUs must be interconnected. Intra-node connectivity is via HCCS, and inter-node connectivity is via RDMA.
Verification Process
- Single Node Verification:
Execute the following commands on each node in sequence. The results must all be success and the status must be UP:
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
- Check NPU network configuration:
# Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
cat /etc/hccn.conf
- Get NPU IP Addresses
for i in {0..15}; do hccn_tool -i $i -ip -g | grep ipaddr; done
- Cross-Node PING Test
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..15}; do hccn_tool -i $i -ping -g address x.x.x.x;done
Install Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. First, we need to obtain the Mooncake project. Refer to the following command:
git clone https://github.com/kvcache-ai/Mooncake.git
Update and install Python.
apt-get update
apt-get install python3
Modify Mooncake compilation option
cd Mooncake
vi mooncake-common/common.cmake
# find this row and set USE_ASCEND_DIRECT ON.
option(USE_ASCEND_DIRECT "option for using ascend npu with adxl engine" ON)
Install mpi
apt-get install mpich libmpich-dev -y
Install the relevant dependencies. The installation of Go is not required.
bash dependencies.sh -y
Compile and install
mkdir build
cd build
cmake ..
make -j
make install
Prefiller/Decoder Deployment
We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
Layerwise
:::::{tab-set}
::::{tab-item} Prefiller node 1
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.1
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address 192.0.0.1 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--enforce-eager \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
::::
::::{tab-item} Prefiller node 2
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.2
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address 192.0.0.2 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--enforce-eager \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_producer",
"kv_port": "30100",
"engine_id": "1",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
::::
::::{tab-item} Decoder node 1 (master node)
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.3
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 32 \
--data-parallel-size-local 16 \
--data-parallel-address 192.0.0.3 \
--data-parallel-rpc-port 5964 \
--tensor-parallel-size 1 \
--enable-expert-parallel \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 512 \
--max-num_seqs 16 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_capture_sizes":[16]}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
::::
::::{tab-item} Decoder node 2 (primary node)
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.4
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--data-parallel-size 32 \
--data-parallel-size-local 16 \
--data-parallel-start-rank 16 \
--data-parallel-address 192.0.0.3 \
--data-parallel-rpc-port 5964 \
--tensor-parallel-size 1 \
--enable-expert-parallel \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 512 \
--max-num_seqs 16 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_capture_sizes":[16]}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
::::
:::::
Non-layerwise
:::::{tab-set}
::::{tab-item} Prefiller node 1
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.1
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address 192.0.0.1 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--enforce-eager \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
::::
::::{tab-item} Prefiller node 2
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.2
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address 192.0.0.2 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--enforce-eager \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_producer",
"kv_port": "30100",
"engine_id": "1",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
::::
::::{tab-item} Decoder node 1 (master Node)
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.3
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 32 \
--data-parallel-size-local 16 \
--data-parallel-address 192.0.0.3 \
--data-parallel-rpc-port 5964 \
--tensor-parallel-size 1 \
--enable-expert-parallel \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 512 \
--max-num_seqs 16 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_capture_sizes":[16]}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
::::
::::{tab-item} Decoder node 2 (primary Node)
unset ftp_proxy
unset https_proxy
unset http_proxy
export HCCL_IF_IP=192.0.0.4
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=2048
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
vllm serve /model/Qwen3-235B-A22B-W8A8 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--data-parallel-size 32 \
--data-parallel-size-local 16 \
--data-parallel-start-rank 16 \
--data-parallel-address 192.0.0.3 \
--data-parallel-rpc-port 5964 \
--tensor-parallel-size 1 \
--enable-expert-parallel \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name qwen3-moe \
--max-model-len 32768 \
--max-num-batched-tokens 512 \
--max-num_seqs 16 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_capture_sizes":[16]}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
::::
:::::
Example Proxy for Deployment
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: load_balance_proxy_layerwise_server_example.py or load_balance_proxy_server_example.py
:::::{tab-set}
::::{tab-item} Layerwise
python load_balance_proxy_layerwise_server_example.py \
--host 192.0.0.1 \
--port 8080 \
--prefiller-hosts 192.0.0.1 192.0.0.2\
--prefiller-port 8004 8004\
--decoder-hosts 192.0.0.3\
--decoder-ports 8004
::::
::::{tab-item} Non-layerwise
python load_balance_proxy_server_example.py \
--host 192.0.0.1 \
--port 8080 \
--prefiller-hosts 192.0.0.1 192.0.0.2\
--prefiller-port 8004 8004\
--decoder-hosts 192.0.0.3\
--decoder-ports 8004
::::
:::::
Verification
Check service health using the proxy server endpoint.
curl http://192.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-moe",
"prompt": "Who are you?",
"max_tokens": 100,
"temperature": 0
}'