# PD-Colocated with Mooncake Multi-Instance ## Getting Started vLLM-Ascend now supports PD-colocated deployment with Mooncake features. This guide provides step-by-step instructions to test these features with constrained resources. Using the Qwen2.5-72B-Instruct model as an example, this guide demonstrates how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two Atlas 800T A2 nodes to deploy two vLLM instances. Each instance occupies 4 NPU cards and uses PD-colocated deployment. ## Verify Multi-Node Communication Environment ### Physical Layer Requirements - The two Atlas 800T A2 nodes must be physically interconnected via a RoCE network. Without RoCE interconnection, cross-node KV Cache access performance will be significantly degraded. - All NPU cards must communicate properly. Intra-node communication uses HCCS, while inter-node communication uses the RoCE network. ### Verification Process The following process serves as a reference example. Please modify parameters such as IP addresses according to your actual environment. 1. Single Node Verification: Execute the following commands sequentially. The results must all be `success` and the status must be `UP`: ```bash # Check the remote switch ports for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done # Get the link status of the Ethernet ports (UP or DOWN) for i in {0..7}; do hccn_tool -i $i -link -g ; done # Check the network health status for i in {0..7}; do hccn_tool -i $i -net_health -g ; done # View the network detected IP configuration for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done # View gateway configuration for i in {0..7}; do hccn_tool -i $i -gateway -g ; done ``` 2. Check NPU HCCN Configuration: Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container. ```bash cat /etc/hccn.conf ``` 3. Get NPU IP Addresses: ```bash for i in {0..7}; do hccn_tool -i $i -ip -g; done ``` 4. Cross-Node PING Test: ```bash # Execute the following command on each node, replacing x.x.x.x # with the target node's NPU card address. for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x; done ``` 5. Check NPU TLS Configuration ```bash # The tls settings should be consistent across all nodes for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch ``` ## Run with Docker Start a Docker container on each node. ```bash # Update the vllm-ascend image export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0 export NAME=vllm-ascend # Run the container using the defined variables # This test uses four NPU cards to create the container. # Mount the hccn.conf file from the host node into the container. docker run --rm \ --name $NAME \ --net=host \ --shm-size=1g \ --device /dev/davinci0 \ --device /dev/davinci1 \ --device /dev/davinci2 \ --device /dev/davinci3 \ --device /dev/davinci_manager \ --device /dev/devmm_svm \ --device /dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/Ascend/driver/tools/hccn_tool:\ /usr/local/Ascend/driver/tools/hccn_tool \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /etc/hccn.conf:/etc/hccn.conf \ -v /root/.cache:/root/.cache \ -it $IMAGE bash ``` ## (Optional) Install Mooncake Mooncake is pre-installed and functional in the v0.11.0 image. The following installation steps are optional. Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. Installation and compilation guide: . First, obtain the Mooncake project using the following command: ```bash git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git cd Mooncake git submodule update --init --recursive ``` Install MPI: ```bash apt-get install mpich libmpich-dev -y ``` Install the relevant dependencies (Go installation is not required): ```bash bash dependencies.sh -y ``` Compile and install: ```bash mkdir build cd build cmake .. -DUSE_ASCEND_DIRECT=ON make -j make install ``` After installation, verify that Mooncake is installed correctly: ```bash python -c "import mooncake; print(mooncake.__file__)" # Expected output path: # /usr/local/Ascend/ascend-toolkit/latest/python/ # site-packages/mooncake/__init__.py ``` ## Start Mooncake Master Service To start the Mooncake master service in one of the node containers, use the following command: ```bash docker exec -it vllm-ascend bash cd /vllm-workspace/Mooncake mooncake_master --port 50088 \ --eviction_high_watermark_ratio 0.95 \ --eviction_ratio 0.05 ``` | Parameter | Value | Explanation | | ----------------------------- | ----- | ------------------------------------- | | port | 50088 | Port for the master service | | eviction_high_watermark_ratio | 0.95 | High watermark ratio (95% threshold) | | eviction_ratio | 0.05 | Percentage to evict when full (5%) | ## Create a Mooncake Configuration File Named mooncake.json The template for the mooncake.json file is as follows: ```json { "metadata_server": "P2PHANDSHAKE", "protocol": "ascend", "device_name": "", "use_ascend_direct": true, "master_server_address": ":50088", "global_segment_size": 107374182400 } ``` | Parameter | Value | Explanation | | --------------| ------------------------| -----------------------------------| | metadata_server | P2PHANDSHAKE | Point-to-point handshake mode | | protocol | ascend | Ascend proprietary protocol | | use_ascend_direct | true | Enable direct hardware access | | master_server_address | 90.90.100.188:50088(for example) | Master server address| | global_segment_size | 107374182400 | Size per segment (100 GB) | ## vLLM Instance Deployment Create containers on both Node 1 and Node 2, and launch the Qwen2.5-72B-Instruct model service in each to test the reusability and performance of cross-node, cross-instance KV Cache. Instance 1 utilizes NPU cards [0-3] on the first Atlas 800T A2 server, while Instance 2 utilizes cards [0-3] on the second server. ### Deploy Instance 1 Replace file paths, host, and port parameters based on your actual environment configuration. ```bash export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/\ latest/python/site-packages:$LD_LIBRARY_PATH export MOONCAKE_CONFIG_PATH="/vllm-workspace/mooncake.json" # NPU buffer pool: quantity:size(MB) # Allocates 4 buffers of 8MB each for KV transfer export ASCEND_BUFFER_POOL=4:8 vllm serve /Qwen2.5-72B-Instruct/ \ --served-model-name qwen \ --dtype bfloat16 \ --max-model-len 25600 \ --tensor-parallel-size 4 \ --host \ --port 8002 \ --max-num-batched-tokens 4096 \ --gpu-memory-utilization 0.9 \ --kv-transfer-config '{ "kv_connector": "MooncakeConnectorStoreV1", "kv_role": "kv_both", "kv_connector_extra_config": { "use_layerwise": false, "mooncake_rpc_port": "0", "load_async": true, "register_buffer": true } }' ``` ### Deploy Instance 2 The deployment method for Instance 2 is identical to Instance 1. Simply modify the `--host` and `--port` parameters according to your Instance 2 configuration. ### Configuration Parameters | Parameter | Value | Explanation | | ----------------- | ----------------------| -------------------------------- | | kv_connector | MooncakeConnectorStoreV1 | Use StoreV1 version | | kv_role | kv_both | Enable both produce and consume | | use_layerwise | false | Transfer entire cache (see note) | | mooncake_rpc_port | 0 | Automatic port assignment | | load_async | true | Enable asynchronous loading | | register_buffer | true | Required for PD-colocated mode | **Note on use_layerwise:** - `false`: Transfer entire KV Cache (suitable for cross-node with sufficient bandwidth) - `true`: Layer-by-layer transfer (suitable for single-node memory constraints) ## Benchmark We recommend using the **aisbench** tool to assess performance. The test uses **Dataset A**, consisting of fully random data, with the following configuration: - Input/output tokens: 1024/10 - Total requests: 100 - Concurrency: 25 The test procedure consists of three steps: ### Step 1: Baseline (No Cache) Send Dataset A to Instance 1 on Node 1 and record the Time to First Token (TTFT) as **TTFT1**. ### Preparation for Step 2 Before Step 2, send a fully random Dataset B to Instance 1. Due to the unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy, Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's cache only in Node 1's DRAM. ### Step 2: Local DRAM Hit Send Dataset A to Instance 1 again to measure the performance when hitting the KV Cache in local DRAM. Record the TTFT as **TTFT2**. ### Step 3: Cross-Node DRAM Hit Send Dataset A to Instance 2. With the Mooncake KV Cache pool, this results in a cross-node KV Cache hit from Node 1's DRAM. Record the TTFT as **TTFT3**. **Model Configuration**: ```python from ais_bench.benchmark.models import VLLMCustomAPIChatStream from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content models = [ dict( attr="service", type=VLLMCustomAPIChatStream, abbr='vllm-api-stream-chat', path="/Qwen2.5-72B-Instruct", model="qwen", request_rate = 0, retry = 2, host_ip = "", host_port = 8002, max_out_len = 10, batch_size= 25, trust_remote_code=False, generation_kwargs = dict( temperature = 0, ignore_eos = True, ), ) ] ``` **Performance Benchmarking Commands**: ```shell ais_bench --models vllm_api_stream_chat \ --datasets gsm8k_gen_0_shot_cot_str_perf \ --debug --summarizer default_perf --mode perf ``` ### Test Results | Requests | Concur | TTFT1 (ms) | TTFT2 (ms) | TTFT3 (ms) | | -------- | ------ | ---------- | ---------- | ---------- | | 100 | 25 | 2322 | 739 | 948 |