xc-llm-ascend/docs/source/tutorials/pd_colocated_mooncake_multi_instance.md

# PD-Colocated with Mooncake Multi-Instance

## Getting Started

vLLM-Ascend now supports PD-colocated deployment with Mooncake features.
This guide provides step-by-step instructions to test these features with
constrained resources.

Using the Qwen2.5-72B-Instruct model as an example, this guide demonstrates
how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two Atlas 800T A2
nodes to deploy two vLLM instances. Each instance occupies 4 NPU cards and
uses PD-colocated deployment.

## Verify Multi-Node Communication Environment

### Physical Layer Requirements

- The two Atlas 800T A2 nodes must be physically interconnected via a RoCE
  network. Without RoCE interconnection, cross-node KV Cache access
  performance will be significantly degraded.
- All NPU cards must communicate properly. Intra-node communication uses HCCS,
  while inter-node communication uses the RoCE network.

### Verification Process

The following process serves as a reference example. Please modify parameters
such as IP addresses according to your actual environment.

1. Single Node Verification:

   Execute the following commands sequentially. The results must all be
   `success` and the status must be `UP`:

   ```bash
   # Check the remote switch ports
   for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
   # Get the link status of the Ethernet ports (UP or DOWN)
   for i in {0..7}; do hccn_tool -i $i -link -g ; done
   # Check the network health status
   for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
   # View the network detected IP configuration
   for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
   # View gateway configuration
   for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
   ```

2. Check NPU HCCN Configuration:

   Ensure that the hccn.conf file exists in the environment. If using Docker,
   mount it into the container.

   ```bash
   cat /etc/hccn.conf
   ```

3. Get NPU IP Addresses:

   ```bash
   for i in {0..7}; do hccn_tool -i $i -ip -g; done
   ```

4. Cross-Node PING Test:

   ```bash
   # Execute the following command on each node, replacing x.x.x.x
   # with the target node's NPU card address.
   for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x; done
   ```

5. Check NPU TLS Configuration

   ```bash
   # The tls settings should be consistent across all nodes
   for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
   ```

## Run with Docker

Start a Docker container on each node.

```bash
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0
export NAME=vllm-ascend

# Run the container using the defined variables
# This test uses four NPU cards to create the container.
# Mount the hccn.conf file from the host node into the container.
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:\
/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```

## (Optional) Install Mooncake

Mooncake is pre-installed and functional in the v0.11.0 image.
The following installation steps are optional.

Mooncake is the serving platform for Kimi, a leading LLM service provided by
Moonshot AI. Installation and compilation guide:
<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.

First, obtain the Mooncake project using the following command:

```bash
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
cd Mooncake
git submodule update --init --recursive
```

Install MPI:

```bash
apt-get install mpich libmpich-dev -y
```

Install the relevant dependencies (Go installation is not required):

```bash
bash dependencies.sh -y
```

Compile and install:

```bash
mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON
make -j
make install
```

After installation, verify that Mooncake is installed correctly:

```bash
python -c "import mooncake; print(mooncake.__file__)"
# Expected output path:
# /usr/local/Ascend/ascend-toolkit/latest/python/
# site-packages/mooncake/__init__.py
```

## Start Mooncake Master Service

To start the Mooncake master service in one of the node containers, use the
following command:

```bash
docker exec -it vllm-ascend bash
cd /vllm-workspace/Mooncake
mooncake_master --port 50088 \
  --eviction_high_watermark_ratio 0.95 \
  --eviction_ratio 0.05
```

| Parameter                     | Value | Explanation                           |
| ----------------------------- | ----- | ------------------------------------- |
| port                          | 50088 | Port for the master service           |
| eviction_high_watermark_ratio | 0.95  | High watermark ratio (95% threshold)  |
| eviction_ratio                | 0.05  | Percentage to evict when full (5%)    |

## Create a Mooncake Configuration File Named mooncake.json

The template for the mooncake.json file is as follows:

```json
{
    "metadata_server": "P2PHANDSHAKE",
    "protocol": "ascend",
    "device_name": "",
    "use_ascend_direct": true,
    "master_server_address": "<your_server_ip>:50088",
    "global_segment_size": 107374182400
}
```

| Parameter   | Value                  | Explanation                           |
| --------------| ------------------------| -----------------------------------|
| metadata_server | P2PHANDSHAKE              | Point-to-point handshake mode  |
| protocol              | ascend              | Ascend proprietary protocol    |
| use_ascend_direct     | true                | Enable direct hardware access  |
| master_server_address | 90.90.100.188:50088(for example) | Master server address|
| global_segment_size   | 107374182400    | Size per segment (100 GB)      |

## vLLM Instance Deployment

Create containers on both Node 1 and Node 2, and launch the
Qwen2.5-72B-Instruct model service in each to test the reusability and
performance of cross-node, cross-instance KV Cache. Instance 1 utilizes NPU
cards [0-3] on the first Atlas 800T A2 server, while Instance 2 utilizes
cards [0-3] on the second server.

### Deploy Instance 1

Replace file paths, host, and port parameters based on your actual environment
configuration.

```bash
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/\
latest/python/site-packages:$LD_LIBRARY_PATH
export MOONCAKE_CONFIG_PATH="/vllm-workspace/mooncake.json"
# NPU buffer pool: quantity:size(MB)
# Allocates 4 buffers of 8MB each for KV transfer
export ASCEND_BUFFER_POOL=4:8

vllm serve <path_to_your_model>/Qwen2.5-72B-Instruct/ \
--served-model-name qwen \
--dtype bfloat16 \
--max-model-len 25600 \
--tensor-parallel-size 4 \
--host <your_server_ip> \
--port 8002 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.9 \
--kv-transfer-config '{
      "kv_connector": "MooncakeConnectorStoreV1",
      "kv_role": "kv_both",
      "kv_connector_extra_config": {
          "use_layerwise": false,
          "mooncake_rpc_port": "0",
          "load_async": true,
          "register_buffer": true
      }
  }'
```

### Deploy Instance 2

The deployment method for Instance 2 is identical to Instance 1. Simply
modify the `--host` and `--port` parameters according to your Instance 2
configuration.

### Configuration Parameters

| Parameter         | Value                 | Explanation                      |
| ----------------- | ----------------------| -------------------------------- |
| kv_connector      | MooncakeConnectorStoreV1 | Use StoreV1 version           |
| kv_role         | kv_both                | Enable both produce and consume  |
| use_layerwise     | false                | Transfer entire cache (see note) |
| mooncake_rpc_port | 0                    | Automatic port assignment        |
| load_async        | true                 | Enable asynchronous loading      |
| register_buffer   | true                 | Required for PD-colocated mode   |

**Note on use_layerwise:**

- `false`: Transfer entire KV Cache (suitable for cross-node with sufficient
  bandwidth)
- `true`: Layer-by-layer transfer (suitable for single-node memory
  constraints)

## Benchmark

We recommend using the **aisbench** tool to assess performance. The test uses
**Dataset A**, consisting of fully random data, with the following
configuration:

- Input/output tokens: 1024/10
- Total requests: 100
- Concurrency: 25

The test procedure consists of three steps:

### Step 1: Baseline (No Cache)

Send Dataset A to Instance 1 on Node 1 and record the Time to First Token
(TTFT) as **TTFT1**.

### Preparation for Step 2

Before Step 2, send a fully random Dataset B to Instance 1. Due to the
unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy,
Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's
cache only in Node 1's DRAM.

### Step 2: Local DRAM Hit

Send Dataset A to Instance 1 again to measure the performance when hitting
the KV Cache in local DRAM. Record the TTFT as **TTFT2**.

### Step 3: Cross-Node DRAM Hit

Send Dataset A to Instance 2. With the Mooncake KV Cache pool, this results
in a cross-node KV Cache hit from Node 1's DRAM. Record the TTFT as
**TTFT3**.

**Model Configuration**:

```python
from ais_bench.benchmark.models import VLLMCustomAPIChatStream
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="<path_to_your_model>/Qwen2.5-72B-Instruct",
        model="qwen",
        request_rate = 0,
        retry = 2,
        host_ip = "<your_server_ip>",
        host_port = 8002,
        max_out_len = 10,
        batch_size= 25,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0,
            ignore_eos = True,
        ),
    )
]
```

**Performance Benchmarking Commands**：

```shell
ais_bench --models vllm_api_stream_chat \
  --datasets gsm8k_gen_0_shot_cot_str_perf \
  --debug --summarizer default_perf --mode perf
```

### Test Results

| Requests | Concur | TTFT1 (ms) | TTFT2 (ms) | TTFT3 (ms) |
| -------- | ------ | ---------- | ---------- | ---------- |
| 100      | 25     | 2322       | 739        | 948        |
-												[Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415)

### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.

The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/pull/31477/commits/0bfd7484fd91ea13f2de1c73c78730df043a3300

---------

Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2026-01-05 14:19:57 +08:00
+								# PD-Colocated with Mooncake Multi-Instance
 								## Getting Started
 								vLLM-Ascend now supports PD-colocated deployment with Mooncake features.
 								This guide provides step-by-step instructions to test these features with
 								constrained resources.
 								Using the Qwen2.5-72B-Instruct model as an example, this guide demonstrates
 								how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two Atlas 800T A2
 								nodes to deploy two vLLM instances. Each instance occupies 4 NPU cards and
 								uses PD-colocated deployment.
 								## Verify Multi-Node Communication Environment
 								### Physical Layer Requirements
 								- The two Atlas 800T A2 nodes must be physically interconnected via a RoCE
 								  network. Without RoCE interconnection, cross-node KV Cache access
 								  performance will be significantly degraded.
 								- All NPU cards must communicate properly. Intra-node communication uses HCCS,
 								  while inter-node communication uses the RoCE network.
 								### Verification Process
 								The following process serves as a reference example. Please modify parameters
 								such as IP addresses according to your actual environment.
 . Single Node Verification:
 								   Execute the following commands sequentially. The results must all be
 								   `success` and the status must be `UP`:
 								   ```bash
 								   # Check the remote switch ports
 								   for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
 								   # Get the link status of the Ethernet ports (UP or DOWN)
 								   for i in {0..7}; do hccn_tool -i $i -link -g ; done
 								   # Check the network health status
 								   for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
 								   # View the network detected IP configuration
 								   for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
 								   # View gateway configuration
 								   for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
 								   ```
-												[Doc] add tls check to pd disaggregation readme  (#5638)

### What this PR does / why we need it?

update pd disaggregation multi_node readme, update the environment check
command for A3, add tls check
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/8be6432bdaf6275664d857b1e5e9bf8ed1ce299e

Signed-off-by: liziyu <liziyu16@huawei.com>
											
										
										
											2026-01-12 15:49:18 +08:00
+. Check NPU HCCN Configuration:
-												[Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415)

### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.

The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/pull/31477/commits/0bfd7484fd91ea13f2de1c73c78730df043a3300

---------

Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2026-01-05 14:19:57 +08:00
 								   Ensure that the hccn.conf file exists in the environment. If using Docker,
 								   mount it into the container.
 								   ```bash
 								   cat /etc/hccn.conf
 								   ```
 . Get NPU IP Addresses:
 								   ```bash
 								   for i in {0..7}; do hccn_tool -i $i -ip -g; done
 								   ```
 . Cross-Node PING Test:
 								   ```bash
 								   # Execute the following command on each node, replacing x.x.x.x
 								   # with the target node's NPU card address.
 								   for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x; done
 								   ```
-												[Doc] add tls check to pd disaggregation readme  (#5638)

### What this PR does / why we need it?

update pd disaggregation multi_node readme, update the environment check
command for A3, add tls check
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/8be6432bdaf6275664d857b1e5e9bf8ed1ce299e

Signed-off-by: liziyu <liziyu16@huawei.com>
											
										
										
											2026-01-12 15:49:18 +08:00
+. Check NPU TLS Configuration
 								   ```bash
 								   # The tls settings should be consistent across all nodes
 								   for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
 								   ```
-												[Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415)

### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.

The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/pull/31477/commits/0bfd7484fd91ea13f2de1c73c78730df043a3300

---------

Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2026-01-05 14:19:57 +08:00
+								## Run with Docker
 								Start a Docker container on each node.
 								```bash
 								# Update the vllm-ascend image
 								export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0
 								export NAME=vllm-ascend
 								# Run the container using the defined variables
 								# This test uses four NPU cards to create the container.
 								# Mount the hccn.conf file from the host node into the container.
 								docker run --rm \
 								--name $NAME \
 								--net=host \
 								--shm-size=1g \
 								--device /dev/davinci0 \
 								--device /dev/davinci1 \
 								--device /dev/davinci2 \
 								--device /dev/davinci3 \
 								--device /dev/davinci_manager \
 								--device /dev/devmm_svm \
 								--device /dev/hisi_hdc \
 								-v /usr/local/dcmi:/usr/local/dcmi \
 								-v /usr/local/Ascend/driver/tools/hccn_tool:\
 								/usr/local/Ascend/driver/tools/hccn_tool \
 								-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 								-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 								-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 								-v /etc/ascend_install.info:/etc/ascend_install.info \
 								-v /etc/hccn.conf:/etc/hccn.conf \
 								-v /root/.cache:/root/.cache \
 								-it $IMAGE bash
 								```
 								## (Optional) Install Mooncake
 								Mooncake is pre-installed and functional in the v0.11.0 image.
 								The following installation steps are optional.
 								Mooncake is the serving platform for Kimi, a leading LLM service provided by
 								Moonshot AI. Installation and compilation guide:
 								<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
 								First, obtain the Mooncake project using the following command:
 								```bash
-												[Misc] Bump mooncake version to v0.3.8.post1 (#6110)

### What this PR does / why we need it?
Since the mooncake has the newer
[release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1),
we pin the tag to latest release

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2026-01-22 11:03:16 +08:00
+								git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
-												[Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415)

### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.

The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/pull/31477/commits/0bfd7484fd91ea13f2de1c73c78730df043a3300

---------

Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2026-01-05 14:19:57 +08:00
+								cd Mooncake
 								git submodule update --init --recursive
 								```
 								Install MPI:
 								```bash
 								apt-get install mpich libmpich-dev -y
 								```
 								Install the relevant dependencies (Go installation is not required):
 								```bash
 								bash dependencies.sh -y
 								```
 								Compile and install:
 								```bash
 								mkdir build
 								cd build
 								cmake .. -DUSE_ASCEND_DIRECT=ON
 								make -j
 								make install
 								```
 								After installation, verify that Mooncake is installed correctly:
 								```bash
 								python -c "import mooncake; print(mooncake.__file__)"
 								# Expected output path:
 								# /usr/local/Ascend/ascend-toolkit/latest/python/
 								# site-packages/mooncake/__init__.py
 								```
 								## Start Mooncake Master Service
 								To start the Mooncake master service in one of the node containers, use the
 								following command:
 								```bash
 								docker exec -it vllm-ascend bash
 								cd /vllm-workspace/Mooncake
 								mooncake_master --port 50088 \
 								  --eviction_high_watermark_ratio 0.95 \
 								  --eviction_ratio 0.05
 								```
 								| Parameter                     | Value | Explanation                           |
 								| ----------------------------- | ----- | ------------------------------------- |
 								| port                          | 50088 | Port for the master service           |
 								| eviction_high_watermark_ratio | 0.95  | High watermark ratio (95% threshold)  |
 								| eviction_ratio                | 0.05  | Percentage to evict when full (5%)    |
 								## Create a Mooncake Configuration File Named mooncake.json
 								The template for the mooncake.json file is as follows:
 								```json
 								{
 								    "metadata_server": "P2PHANDSHAKE",
 								    "protocol": "ascend",
 								    "device_name": "",
 								    "use_ascend_direct": true,
 								    "master_server_address": "<your_server_ip>:50088",
 								    "global_segment_size": 107374182400
 								}
 								```
 								| Parameter   | Value                  | Explanation                           |
 								| --------------| ------------------------| -----------------------------------|
 								| metadata_server | P2PHANDSHAKE              | Point-to-point handshake mode  |
 								| protocol              | ascend              | Ascend proprietary protocol    |
 								| use_ascend_direct     | true                | Enable direct hardware access  |
 								| master_server_address | 90.90.100.188:50088(for example) | Master server address|
 								| global_segment_size   | 107374182400    | Size per segment (100 GB)      |
 								## vLLM Instance Deployment
 								Create containers on both Node 1 and Node 2, and launch the
 								Qwen2.5-72B-Instruct model service in each to test the reusability and
 								performance of cross-node, cross-instance KV Cache. Instance 1 utilizes NPU
 								cards [0-3] on the first Atlas 800T A2 server, while Instance 2 utilizes
 								cards [0-3] on the second server.
 								### Deploy Instance 1
 								Replace file paths, host, and port parameters based on your actual environment
 								configuration.
 								```bash
 								export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/\
 								latest/python/site-packages:$LD_LIBRARY_PATH
 								export MOONCAKE_CONFIG_PATH="/vllm-workspace/mooncake.json"
 								# NPU buffer pool: quantity:size(MB)
 								# Allocates 4 buffers of 8MB each for KV transfer
 								export ASCEND_BUFFER_POOL=4:8
 								vllm serve <path_to_your_model>/Qwen2.5-72B-Instruct/ \
 								--served-model-name qwen \
 								--dtype bfloat16 \
 								--max-model-len 25600 \
 								--tensor-parallel-size 4 \
 								--host <your_server_ip> \
 								--port 8002 \
 								--max-num-batched-tokens 4096 \
 								--gpu-memory-utilization 0.9 \
 								--kv-transfer-config '{
 								      "kv_connector": "MooncakeConnectorStoreV1",
 								      "kv_role": "kv_both",
 								      "kv_connector_extra_config": {
 								          "use_layerwise": false,
 								          "mooncake_rpc_port": "0",
 								          "load_async": true,
 								          "register_buffer": true
 								      }
 								  }'
 								```
 								### Deploy Instance 2
 								The deployment method for Instance 2 is identical to Instance 1. Simply
 								modify the `--host` and `--port` parameters according to your Instance 2
 								configuration.
 								### Configuration Parameters
 								| Parameter         | Value                 | Explanation                      |
 								| ----------------- | ----------------------| -------------------------------- |
 								| kv_connector      | MooncakeConnectorStoreV1 | Use StoreV1 version           |
 								| kv_role         | kv_both                | Enable both produce and consume  |
 								| use_layerwise     | false                | Transfer entire cache (see note) |
 								| mooncake_rpc_port | 0                    | Automatic port assignment        |
 								| load_async        | true                 | Enable asynchronous loading      |
 								| register_buffer   | true                 | Required for PD-colocated mode   |
 								**Note on use_layerwise:**
 								- `false`: Transfer entire KV Cache (suitable for cross-node with sufficient
 								  bandwidth)
 								- `true`: Layer-by-layer transfer (suitable for single-node memory
 								  constraints)
 								## Benchmark
 								We recommend using the **aisbench** tool to assess performance. The test uses
 								**Dataset A**, consisting of fully random data, with the following
 								configuration:
 								- Input/output tokens: 1024/10
 								- Total requests: 100
 								- Concurrency: 25
 								The test procedure consists of three steps:
 								### Step 1: Baseline (No Cache)
 								Send Dataset A to Instance 1 on Node 1 and record the Time to First Token
 								(TTFT) as **TTFT1**.
 								### Preparation for Step 2
 								Before Step 2, send a fully random Dataset B to Instance 1. Due to the
 								unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy,
 								Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's
 								cache only in Node 1's DRAM.
 								### Step 2: Local DRAM Hit
 								Send Dataset A to Instance 1 again to measure the performance when hitting
 								the KV Cache in local DRAM. Record the TTFT as **TTFT2**.
 								### Step 3: Cross-Node DRAM Hit
 								Send Dataset A to Instance 2. With the Mooncake KV Cache pool, this results
 								in a cross-node KV Cache hit from Node 1's DRAM. Record the TTFT as
 								**TTFT3**.
 								**Model Configuration**:
 								```python
 								from ais_bench.benchmark.models import VLLMCustomAPIChatStream
 								from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
 								models = [
 								    dict(
 								        attr="service",
 								        type=VLLMCustomAPIChatStream,
 								        abbr='vllm-api-stream-chat',
 								        path="<path_to_your_model>/Qwen2.5-72B-Instruct",
 								        model="qwen",
 								        request_rate = 0,
 								        retry = 2,
 								        host_ip = "<your_server_ip>",
 								        host_port = 8002,
 								        max_out_len = 10,
 								        batch_size= 25,
 								        trust_remote_code=False,
 								        generation_kwargs = dict(
 								            temperature = 0,
 								            ignore_eos = True,
 								        ),
 								    )
 								]
 								```
 								**Performance Benchmarking Commands**：
 								```shell
 								ais_bench --models vllm_api_stream_chat \
 								  --datasets gsm8k_gen_0_shot_cot_str_perf \
 								  --debug --summarizer default_perf --mode perf
 								```
 								### Test Results
 								| Requests | Concur | TTFT1 (ms) | TTFT2 (ms) | TTFT3 (ms) |
 								| -------- | ------ | ---------- | ---------- | ---------- |
 								| 100      | 25     | 2322       | 739        | 948        |