Files
xc-llm-ascend/docs/source/tutorials/pd_colocated_mooncake_multi_instance.md
zhangmuzhi_yuwan 6c1a685b30 [Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415)
### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.

The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
0bfd7484fd

---------

Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2026-01-05 14:19:57 +08:00

10 KiB
Raw Blame History

PD-Colocated with Mooncake Multi-Instance

Getting Started

vLLM-Ascend now supports PD-colocated deployment with Mooncake features. This guide provides step-by-step instructions to test these features with constrained resources.

Using the Qwen2.5-72B-Instruct model as an example, this guide demonstrates how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two Atlas 800T A2 nodes to deploy two vLLM instances. Each instance occupies 4 NPU cards and uses PD-colocated deployment.

Verify Multi-Node Communication Environment

Physical Layer Requirements

  • The two Atlas 800T A2 nodes must be physically interconnected via a RoCE network. Without RoCE interconnection, cross-node KV Cache access performance will be significantly degraded.
  • All NPU cards must communicate properly. Intra-node communication uses HCCS, while inter-node communication uses the RoCE network.

Verification Process

The following process serves as a reference example. Please modify parameters such as IP addresses according to your actual environment.

  1. Single Node Verification:

    Execute the following commands sequentially. The results must all be success and the status must be UP:

    # Check the remote switch ports
    for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
    # Get the link status of the Ethernet ports (UP or DOWN)
    for i in {0..7}; do hccn_tool -i $i -link -g ; done
    # Check the network health status
    for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
    # View the network detected IP configuration
    for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
    # View gateway configuration
    for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
    
  2. Check NPU Network Configuration:

    Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.

    cat /etc/hccn.conf
    
  3. Get NPU IP Addresses:

    for i in {0..7}; do hccn_tool -i $i -ip -g; done
    
  4. Cross-Node PING Test:

    # Execute the following command on each node, replacing x.x.x.x
    # with the target node's NPU card address.
    for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x; done
    

Run with Docker

Start a Docker container on each node.

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0
export NAME=vllm-ascend

# Run the container using the defined variables
# This test uses four NPU cards to create the container.
# Mount the hccn.conf file from the host node into the container.
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:\
/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

(Optional) Install Mooncake

Mooncake is pre-installed and functional in the v0.11.0 image. The following installation steps are optional.

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. Installation and compilation guide: https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries.

First, obtain the Mooncake project using the following command:

git clone -b v0.3.7.post2 --depth 1 https://github.com/kvcache-ai/Mooncake.git
cd Mooncake
git submodule update --init --recursive

Install MPI:

apt-get install mpich libmpich-dev -y

Install the relevant dependencies (Go installation is not required):

bash dependencies.sh -y

Compile and install:

mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON
make -j
make install

After installation, verify that Mooncake is installed correctly:

python -c "import mooncake; print(mooncake.__file__)"
# Expected output path:
# /usr/local/Ascend/ascend-toolkit/latest/python/
# site-packages/mooncake/__init__.py

Start Mooncake Master Service

To start the Mooncake master service in one of the node containers, use the following command:

docker exec -it vllm-ascend bash
cd /vllm-workspace/Mooncake
mooncake_master --port 50088 \
  --eviction_high_watermark_ratio 0.95 \
  --eviction_ratio 0.05
Parameter Value Explanation
port 50088 Port for the master service
eviction_high_watermark_ratio 0.95 High watermark ratio (95% threshold)
eviction_ratio 0.05 Percentage to evict when full (5%)

Create a Mooncake Configuration File Named mooncake.json

The template for the mooncake.json file is as follows:

{
    "metadata_server": "P2PHANDSHAKE",
    "protocol": "ascend",
    "device_name": "",
    "use_ascend_direct": true,
    "master_server_address": "<your_server_ip>:50088",
    "global_segment_size": 107374182400
}
Parameter Value Explanation
metadata_server P2PHANDSHAKE Point-to-point handshake mode
protocol ascend Ascend proprietary protocol
use_ascend_direct true Enable direct hardware access
master_server_address 90.90.100.188:50088(for example) Master server address
global_segment_size 107374182400 Size per segment (100 GB)

vLLM Instance Deployment

Create containers on both Node 1 and Node 2, and launch the Qwen2.5-72B-Instruct model service in each to test the reusability and performance of cross-node, cross-instance KV Cache. Instance 1 utilizes NPU cards [0-3] on the first Atlas 800T A2 server, while Instance 2 utilizes cards [0-3] on the second server.

Deploy Instance 1

Replace file paths, host, and port parameters based on your actual environment configuration.

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/\
latest/python/site-packages:$LD_LIBRARY_PATH
export MOONCAKE_CONFIG_PATH="/vllm-workspace/mooncake.json"
# NPU buffer pool: quantity:size(MB)
# Allocates 4 buffers of 8MB each for KV transfer
export ASCEND_BUFFER_POOL=4:8

vllm serve <path_to_your_model>/Qwen2.5-72B-Instruct/ \
--served-model-name qwen \
--dtype bfloat16 \
--max-model-len 25600 \
--tensor-parallel-size 4 \
--host <your_server_ip> \
--port 8002 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.9 \
--kv-transfer-config '{
      "kv_connector": "MooncakeConnectorStoreV1",
      "kv_role": "kv_both",
      "kv_connector_extra_config": {
          "use_layerwise": false,
          "mooncake_rpc_port": "0",
          "load_async": true,
          "register_buffer": true
      }
  }'

Deploy Instance 2

The deployment method for Instance 2 is identical to Instance 1. Simply modify the --host and --port parameters according to your Instance 2 configuration.

Configuration Parameters

Parameter Value Explanation
kv_connector MooncakeConnectorStoreV1 Use StoreV1 version
kv_role kv_both Enable both produce and consume
use_layerwise false Transfer entire cache (see note)
mooncake_rpc_port 0 Automatic port assignment
load_async true Enable asynchronous loading
register_buffer true Required for PD-colocated mode

Note on use_layerwise:

  • false: Transfer entire KV Cache (suitable for cross-node with sufficient bandwidth)
  • true: Layer-by-layer transfer (suitable for single-node memory constraints)

Benchmark

We recommend using the aisbench tool to assess performance. The test uses Dataset A, consisting of fully random data, with the following configuration:

  • Input/output tokens: 1024/10
  • Total requests: 100
  • Concurrency: 25

The test procedure consists of three steps:

Step 1: Baseline (No Cache)

Send Dataset A to Instance 1 on Node 1 and record the Time to First Token (TTFT) as TTFT1.

Preparation for Step 2

Before Step 2, send a fully random Dataset B to Instance 1. Due to the unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy, Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's cache only in Node 1's DRAM.

Step 2: Local DRAM Hit

Send Dataset A to Instance 1 again to measure the performance when hitting the KV Cache in local DRAM. Record the TTFT as TTFT2.

Step 3: Cross-Node DRAM Hit

Send Dataset A to Instance 2. With the Mooncake KV Cache pool, this results in a cross-node KV Cache hit from Node 1's DRAM. Record the TTFT as TTFT3.

Model Configuration:

from ais_bench.benchmark.models import VLLMCustomAPIChatStream
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="<path_to_your_model>/Qwen2.5-72B-Instruct",
        model="qwen",
        request_rate = 0,
        retry = 2,
        host_ip = "<your_server_ip>",
        host_port = 8002,
        max_out_len = 10,
        batch_size= 25,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0,
            ignore_eos = True,
        ),
    )
]

Performance Benchmarking Commands

ais_bench --models vllm_api_stream_chat \
  --datasets gsm8k_gen_0_shot_cot_str_perf \
  --debug --summarizer default_perf --mode perf

Test Results

Requests Concur TTFT1 (ms) TTFT2 (ms) TTFT3 (ms)
100 25 2322 739 948