Files

zhangmuzhi_yuwan 6c1a685b30 [Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415 )

### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.

The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
0bfd7484fd

---------

Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>

2026-01-05 14:19:57 +08:00

10 KiB

Raw Blame History

PD-Colocated with Mooncake Multi-Instance

Getting Started

vLLM-Ascend now supports PD-colocated deployment with Mooncake features. This guide provides step-by-step instructions to test these features with constrained resources.

Using the Qwen2.5-72B-Instruct model as an example, this guide demonstrates how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two Atlas 800T A2 nodes to deploy two vLLM instances. Each instance occupies 4 NPU cards and uses PD-colocated deployment.

Verify Multi-Node Communication Environment

Physical Layer Requirements

The two Atlas 800T A2 nodes must be physically interconnected via a RoCE network. Without RoCE interconnection, cross-node KV Cache access performance will be significantly degraded.
All NPU cards must communicate properly. Intra-node communication uses HCCS, while inter-node communication uses the RoCE network.

Verification Process

The following process serves as a reference example. Please modify parameters such as IP addresses according to your actual environment.

Single Node Verification:

Execute the following commands sequentially. The results must all be success and the status must be UP:

# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done

Check NPU Network Configuration:

Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
```
cat /etc/hccn.conf
```

Get NPU IP Addresses:

for i in {0..7}; do hccn_tool -i $i -ip -g; done

Cross-Node PING Test:

# Execute the following command on each node, replacing x.x.x.x
# with the target node's NPU card address.
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x; done

Run with Docker

Start a Docker container on each node.

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0
export NAME=vllm-ascend

# Run the container using the defined variables
# This test uses four NPU cards to create the container.
# Mount the hccn.conf file from the host node into the container.
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:\
/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

(Optional) Install Mooncake

Mooncake is pre-installed and functional in the v0.11.0 image. The following installation steps are optional.

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. Installation and compilation guide: https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries.

First, obtain the Mooncake project using the following command:

git clone -b v0.3.7.post2 --depth 1 https://github.com/kvcache-ai/Mooncake.git
cd Mooncake
git submodule update --init --recursive

Install MPI:

apt-get install mpich libmpich-dev -y

Install the relevant dependencies (Go installation is not required):

bash dependencies.sh -y

Compile and install:

mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON
make -j
make install

After installation, verify that Mooncake is installed correctly:

python -c "import mooncake; print(mooncake.__file__)"
# Expected output path:
# /usr/local/Ascend/ascend-toolkit/latest/python/
# site-packages/mooncake/__init__.py

Start Mooncake Master Service

To start the Mooncake master service in one of the node containers, use the following command:

docker exec -it vllm-ascend bash
cd /vllm-workspace/Mooncake
mooncake_master --port 50088 \
  --eviction_high_watermark_ratio 0.95 \
  --eviction_ratio 0.05

Parameter	Value	Explanation
port	50088	Port for the master service
eviction_high_watermark_ratio	0.95	High watermark ratio (95% threshold)
eviction_ratio	0.05	Percentage to evict when full (5%)

Create a Mooncake Configuration File Named mooncake.json

The template for the mooncake.json file is as follows:

{
    "metadata_server": "P2PHANDSHAKE",
    "protocol": "ascend",
    "device_name": "",
    "use_ascend_direct": true,
    "master_server_address": "<your_server_ip>:50088",
    "global_segment_size": 107374182400
}

Parameter	Value	Explanation
metadata_server	P2PHANDSHAKE	Point-to-point handshake mode
protocol	ascend	Ascend proprietary protocol
use_ascend_direct	true	Enable direct hardware access
master_server_address	90.90.100.188:50088(for example)	Master server address
global_segment_size	107374182400	Size per segment (100 GB)

vLLM Instance Deployment

Create containers on both Node 1 and Node 2, and launch the Qwen2.5-72B-Instruct model service in each to test the reusability and performance of cross-node, cross-instance KV Cache. Instance 1 utilizes NPU cards [0-3] on the first Atlas 800T A2 server, while Instance 2 utilizes cards [0-3] on the second server.

Deploy Instance 1

Replace file paths, host, and port parameters based on your actual environment configuration.

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/\
latest/python/site-packages:$LD_LIBRARY_PATH
export MOONCAKE_CONFIG_PATH="/vllm-workspace/mooncake.json"
# NPU buffer pool: quantity:size(MB)
# Allocates 4 buffers of 8MB each for KV transfer
export ASCEND_BUFFER_POOL=4:8

vllm serve <path_to_your_model>/Qwen2.5-72B-Instruct/ \
--served-model-name qwen \
--dtype bfloat16 \
--max-model-len 25600 \
--tensor-parallel-size 4 \
--host <your_server_ip> \
--port 8002 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.9 \
--kv-transfer-config '{
      "kv_connector": "MooncakeConnectorStoreV1",
      "kv_role": "kv_both",
      "kv_connector_extra_config": {
          "use_layerwise": false,
          "mooncake_rpc_port": "0",
          "load_async": true,
          "register_buffer": true
      }
  }'

Deploy Instance 2

The deployment method for Instance 2 is identical to Instance 1. Simply modify the --host and --port parameters according to your Instance 2 configuration.

Configuration Parameters

Parameter	Value	Explanation
kv_connector	MooncakeConnectorStoreV1	Use StoreV1 version
kv_role	kv_both	Enable both produce and consume
use_layerwise	false	Transfer entire cache (see note)
mooncake_rpc_port	0	Automatic port assignment
load_async	true	Enable asynchronous loading
register_buffer	true	Required for PD-colocated mode

Note on use_layerwise:

false: Transfer entire KV Cache (suitable for cross-node with sufficient bandwidth)
true: Layer-by-layer transfer (suitable for single-node memory constraints)

Benchmark

We recommend using the aisbench tool to assess performance. The test uses Dataset A, consisting of fully random data, with the following configuration:

Input/output tokens: 1024/10
Total requests: 100
Concurrency: 25

The test procedure consists of three steps:

Step 1: Baseline (No Cache)

Send Dataset A to Instance 1 on Node 1 and record the Time to First Token (TTFT) as TTFT1.

Preparation for Step 2

Before Step 2, send a fully random Dataset B to Instance 1. Due to the unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy, Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's cache only in Node 1's DRAM.

Step 2: Local DRAM Hit

Send Dataset A to Instance 1 again to measure the performance when hitting the KV Cache in local DRAM. Record the TTFT as TTFT2.

Step 3: Cross-Node DRAM Hit

Send Dataset A to Instance 2. With the Mooncake KV Cache pool, this results in a cross-node KV Cache hit from Node 1's DRAM. Record the TTFT as TTFT3.

Model Configuration:

from ais_bench.benchmark.models import VLLMCustomAPIChatStream
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="<path_to_your_model>/Qwen2.5-72B-Instruct",
        model="qwen",
        request_rate = 0,
        retry = 2,
        host_ip = "<your_server_ip>",
        host_port = 8002,
        max_out_len = 10,
        batch_size= 25,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0,
            ignore_eos = True,
        ),
    )
]

Performance Benchmarking Commands：

ais_bench --models vllm_api_stream_chat \
  --datasets gsm8k_gen_0_shot_cot_str_perf \
  --debug --summarizer default_perf --mode perf

Test Results

Requests	Concur	TTFT1 (ms)	TTFT2 (ms)	TTFT3 (ms)
100	25	2322	739	948

10 KiB Raw Blame History Unescape Escape