[Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415)

### What this PR does / why we need it? This documentation provides a comprehensive technical guide for deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated architecture** integrated with **Mooncake**, a high-performance distributed KV Cache transfer engine. As Large Language Model (LLM) serving scales, managing KV Cache efficiently across distributed nodes is essential for reducing latency and optimizing hardware utilization. The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM instances can achieve seamless **cross-node KV Cache reuse**. This capability allows an instance to retrieve precomputed cache from a remote node's DRAM via high-speed **RoCE** networks, effectively bypassing redundant prefill computations. ### Does this PR introduce _any_ user-facing change? No - vLLM version: release/v0.13.0 - vLLM main: 0bfd7484fd --------- Signed-off-by: zhangmuzhibangde <1037640609@qq.com> Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2026-01-05 14:19:57 +08:00
parent 549be94397
commit 6c1a685b30
2 changed files with 337 additions and 0 deletions
--- a/docs/source/tutorials/index.md
+++ b/docs/source/tutorials/index.md
@@ -20,6 +20,7 @@ DeepSeek-V3.1.md
 DeepSeek-V3.2.md
 DeepSeek-R1.md
 Kimi-K2-Thinking
+pd_colocated_mooncake_multi_instance
 pd_disaggregation_mooncake_single_node
 pd_disaggregation_mooncake_multi_node
 long_sequence_context_parallel_single_node
--- a/docs/source/tutorials/pd_colocated_mooncake_multi_instance.md
+++ b/docs/source/tutorials/pd_colocated_mooncake_multi_instance.md
@@ -0,0 +1,336 @@
+# PD-Colocated with Mooncake Multi-Instance
+
+## Getting Started
+
+vLLM-Ascend now supports PD-colocated deployment with Mooncake features.
+This guide provides step-by-step instructions to test these features with
+constrained resources.
+
+Using the Qwen2.5-72B-Instruct model as an example, this guide demonstrates
+how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two Atlas 800T A2
+nodes to deploy two vLLM instances. Each instance occupies 4 NPU cards and
+uses PD-colocated deployment.
+
+## Verify Multi-Node Communication Environment
+
+### Physical Layer Requirements
+
+- The two Atlas 800T A2 nodes must be physically interconnected via a RoCE
+  network. Without RoCE interconnection, cross-node KV Cache access
+  performance will be significantly degraded.
+- All NPU cards must communicate properly. Intra-node communication uses HCCS,
+  while inter-node communication uses the RoCE network.
+
+### Verification Process
+
+The following process serves as a reference example. Please modify parameters
+such as IP addresses according to your actual environment.
+
+1. Single Node Verification:
+
+   Execute the following commands sequentially. The results must all be
+   `success` and the status must be `UP`:
+
+   ```bash
+   # Check the remote switch ports
+   for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
+   # Get the link status of the Ethernet ports (UP or DOWN)
+   for i in {0..7}; do hccn_tool -i $i -link -g ; done
+   # Check the network health status
+   for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
+   # View the network detected IP configuration
+   for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
+   # View gateway configuration
+   for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
+   ```
+
+2. Check NPU Network Configuration:
+
+   Ensure that the hccn.conf file exists in the environment. If using Docker,
+   mount it into the container.
+
+   ```bash
+   cat /etc/hccn.conf
+   ```
+
+3. Get NPU IP Addresses:
+
+   ```bash
+   for i in {0..7}; do hccn_tool -i $i -ip -g; done
+   ```
+
+4. Cross-Node PING Test:
+
+   ```bash
+   # Execute the following command on each node, replacing x.x.x.x
+   # with the target node's NPU card address.
+   for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x; done
+   ```
+
+## Run with Docker
+
+Start a Docker container on each node.
+
+```bash
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0
+export NAME=vllm-ascend
+
+# Run the container using the defined variables
+# This test uses four NPU cards to create the container.
+# Mount the hccn.conf file from the host node into the container.
+docker run --rm \
+--name $NAME \
+--net=host \
+--shm-size=1g \
+--device /dev/davinci0 \
+--device /dev/davinci1 \
+--device /dev/davinci2 \
+--device /dev/davinci3 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/Ascend/driver/tools/hccn_tool:\
+/usr/local/Ascend/driver/tools/hccn_tool \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /etc/hccn.conf:/etc/hccn.conf \
+-v /root/.cache:/root/.cache \
+-it $IMAGE bash
+```
+
+## (Optional) Install Mooncake
+
+Mooncake is pre-installed and functional in the v0.11.0 image.
+The following installation steps are optional.
+
+Mooncake is the serving platform for Kimi, a leading LLM service provided by
+Moonshot AI. Installation and compilation guide:
+<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
+
+First, obtain the Mooncake project using the following command:
+
+```bash
+git clone -b v0.3.7.post2 --depth 1 https://github.com/kvcache-ai/Mooncake.git
+cd Mooncake
+git submodule update --init --recursive
+```
+
+Install MPI:
+
+```bash
+apt-get install mpich libmpich-dev -y
+```
+
+Install the relevant dependencies (Go installation is not required):
+
+```bash
+bash dependencies.sh -y
+```
+
+Compile and install:
+
+```bash
+mkdir build
+cd build
+cmake .. -DUSE_ASCEND_DIRECT=ON
+make -j
+make install
+```
+
+After installation, verify that Mooncake is installed correctly:
+
+```bash
+python -c "import mooncake; print(mooncake.__file__)"
+# Expected output path:
+# /usr/local/Ascend/ascend-toolkit/latest/python/
+# site-packages/mooncake/__init__.py
+```
+
+## Start Mooncake Master Service
+
+To start the Mooncake master service in one of the node containers, use the
+following command:
+
+```bash
+docker exec -it vllm-ascend bash
+cd /vllm-workspace/Mooncake
+mooncake_master --port 50088 \
+  --eviction_high_watermark_ratio 0.95 \
+  --eviction_ratio 0.05
+```
+
+| Parameter                     | Value | Explanation                           |
+| ----------------------------- | ----- | ------------------------------------- |
+| port                          | 50088 | Port for the master service           |
+| eviction_high_watermark_ratio | 0.95  | High watermark ratio (95% threshold)  |
+| eviction_ratio                | 0.05  | Percentage to evict when full (5%)    |
+
+## Create a Mooncake Configuration File Named mooncake.json
+
+The template for the mooncake.json file is as follows:
+
+```json
+{
+    "metadata_server": "P2PHANDSHAKE",
+    "protocol": "ascend",
+    "device_name": "",
+    "use_ascend_direct": true,
+    "master_server_address": "<your_server_ip>:50088",
+    "global_segment_size": 107374182400
+}
+```
+
+| Parameter   | Value                  | Explanation                           |
+| --------------| ------------------------| -----------------------------------|
+| metadata_server | P2PHANDSHAKE              | Point-to-point handshake mode  |
+| protocol              | ascend              | Ascend proprietary protocol    |
+| use_ascend_direct     | true                | Enable direct hardware access  |
+| master_server_address | 90.90.100.188:50088(for example) | Master server address|
+| global_segment_size   | 107374182400    | Size per segment (100 GB)      |
+
+## vLLM Instance Deployment
+
+Create containers on both Node 1 and Node 2, and launch the
+Qwen2.5-72B-Instruct model service in each to test the reusability and
+performance of cross-node, cross-instance KV Cache. Instance 1 utilizes NPU
+cards [0-3] on the first Atlas 800T A2 server, while Instance 2 utilizes
+cards [0-3] on the second server.
+
+### Deploy Instance 1
+
+Replace file paths, host, and port parameters based on your actual environment
+configuration.
+
+```bash
+export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/\
+latest/python/site-packages:$LD_LIBRARY_PATH
+export MOONCAKE_CONFIG_PATH="/vllm-workspace/mooncake.json"
+# NPU buffer pool: quantity:size(MB)
+# Allocates 4 buffers of 8MB each for KV transfer
+export ASCEND_BUFFER_POOL=4:8
+
+vllm serve <path_to_your_model>/Qwen2.5-72B-Instruct/ \
+--served-model-name qwen \
+--dtype bfloat16 \
+--max-model-len 25600 \
+--tensor-parallel-size 4 \
+--host <your_server_ip> \
+--port 8002 \
+--max-num-batched-tokens 4096 \
+--gpu-memory-utilization 0.9 \
+--kv-transfer-config '{
+      "kv_connector": "MooncakeConnectorStoreV1",
+      "kv_role": "kv_both",
+      "kv_connector_extra_config": {
+          "use_layerwise": false,
+          "mooncake_rpc_port": "0",
+          "load_async": true,
+          "register_buffer": true
+      }
+  }'
+```
+
+### Deploy Instance 2
+
+The deployment method for Instance 2 is identical to Instance 1. Simply
+modify the `--host` and `--port` parameters according to your Instance 2
+configuration.
+
+### Configuration Parameters
+
+| Parameter         | Value                 | Explanation                      |
+| ----------------- | ----------------------| -------------------------------- |
+| kv_connector      | MooncakeConnectorStoreV1 | Use StoreV1 version           |
+| kv_role         | kv_both                | Enable both produce and consume  |
+| use_layerwise     | false                | Transfer entire cache (see note) |
+| mooncake_rpc_port | 0                    | Automatic port assignment        |
+| load_async        | true                 | Enable asynchronous loading      |
+| register_buffer   | true                 | Required for PD-colocated mode   |
+
+**Note on use_layerwise:**
+
+- `false`: Transfer entire KV Cache (suitable for cross-node with sufficient
+  bandwidth)
+- `true`: Layer-by-layer transfer (suitable for single-node memory
+  constraints)
+
+## Benchmark
+
+We recommend using the **aisbench** tool to assess performance. The test uses
+**Dataset A**, consisting of fully random data, with the following
+configuration:
+
+- Input/output tokens: 1024/10
+- Total requests: 100
+- Concurrency: 25
+
+The test procedure consists of three steps:
+
+### Step 1: Baseline (No Cache)
+
+Send Dataset A to Instance 1 on Node 1 and record the Time to First Token
+(TTFT) as **TTFT1**.
+
+### Preparation for Step 2
+
+Before Step 2, send a fully random Dataset B to Instance 1. Due to the
+unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy,
+Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's
+cache only in Node 1's DRAM.
+
+### Step 2: Local DRAM Hit
+
+Send Dataset A to Instance 1 again to measure the performance when hitting
+the KV Cache in local DRAM. Record the TTFT as **TTFT2**.
+
+### Step 3: Cross-Node DRAM Hit
+
+Send Dataset A to Instance 2. With the Mooncake KV Cache pool, this results
+in a cross-node KV Cache hit from Node 1's DRAM. Record the TTFT as
+**TTFT3**.
+
+**Model Configuration**:
+
+```python
+from ais_bench.benchmark.models import VLLMCustomAPIChatStream
+from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
+
+models = [
+    dict(
+        attr="service",
+        type=VLLMCustomAPIChatStream,
+        abbr='vllm-api-stream-chat',
+        path="<path_to_your_model>/Qwen2.5-72B-Instruct",
+        model="qwen",
+        request_rate = 0,
+        retry = 2,
+        host_ip = "<your_server_ip>",
+        host_port = 8002,
+        max_out_len = 10,
+        batch_size= 25,
+        trust_remote_code=False,
+        generation_kwargs = dict(
+            temperature = 0,
+            ignore_eos = True,
+        ),
+    )
+]
+```
+
+**Performance Benchmarking Commands**：
+
+```shell
+ais_bench --models vllm_api_stream_chat \
+  --datasets gsm8k_gen_0_shot_cot_str_perf \
+  --debug --summarizer default_perf --mode perf
+```
+
+### Test Results
+
+| Requests | Concur | TTFT1 (ms) | TTFT2 (ms) | TTFT3 (ms) |
+| -------- | ------ | ---------- | ---------- | ---------- |
+| 100      | 25     | 2322       | 739        | 948        |