diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md
index 5f924209..3c7979a7 100644
--- a/docs/source/tutorials/index.md
+++ b/docs/source/tutorials/index.md
@@ -20,6 +20,7 @@ DeepSeek-V3.1.md
 DeepSeek-V3.2.md
 DeepSeek-R1.md
 Kimi-K2-Thinking
+pd_colocated_mooncake_multi_instance
 pd_disaggregation_mooncake_single_node
 pd_disaggregation_mooncake_multi_node
 long_sequence_context_parallel_single_node
diff --git a/docs/source/tutorials/pd_colocated_mooncake_multi_instance.md b/docs/source/tutorials/pd_colocated_mooncake_multi_instance.md
new file mode 100644
index 00000000..4319e2d1
--- /dev/null
+++ b/docs/source/tutorials/pd_colocated_mooncake_multi_instance.md
@@ -0,0 +1,336 @@
+# PD-Colocated with Mooncake Multi-Instance
+
+## Getting Started
+
+vLLM-Ascend now supports PD-colocated deployment with Mooncake features.
+This guide provides step-by-step instructions to test these features with
+constrained resources.
+
+Using the Qwen2.5-72B-Instruct model as an example, this guide demonstrates
+how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two Atlas 800T A2
+nodes to deploy two vLLM instances. Each instance occupies 4 NPU cards and
+uses PD-colocated deployment.
+
+## Verify Multi-Node Communication Environment
+
+### Physical Layer Requirements
+
+- The two Atlas 800T A2 nodes must be physically interconnected via a RoCE
+  network. Without RoCE interconnection, cross-node KV Cache access
+  performance will be significantly degraded.
+- All NPU cards must communicate properly. Intra-node communication uses HCCS,
+  while inter-node communication uses the RoCE network.
+
+### Verification Process
+
+The following process serves as a reference example. Please modify parameters
+such as IP addresses according to your actual environment.
+
+1. Single Node Verification:
+
+   Execute the following commands sequentially. The results must all be
+   `success` and the status must be `UP`:
+
+   ```bash
+   # Check the remote switch ports
+   for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
+   # Get the link status of the Ethernet ports (UP or DOWN)
+   for i in {0..7}; do hccn_tool -i $i -link -g ; done
+   # Check the network health status
+   for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
+   # View the network detected IP configuration
+   for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
+   # View gateway configuration
+   for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
+   ```
+
+2. Check NPU Network Configuration:
+
+   Ensure that the hccn.conf file exists in the environment. If using Docker,
+   mount it into the container.
+
+   ```bash
+   cat /etc/hccn.conf
+   ```
+
+3. Get NPU IP Addresses:
+
+   ```bash
+   for i in {0..7}; do hccn_tool -i $i -ip -g; done
+   ```
+
+4. Cross-Node PING Test:
+
+   ```bash
+   # Execute the following command on each node, replacing x.x.x.x
+   # with the target node's NPU card address.
+   for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x; done
+   ```
+
+## Run with Docker
+
+Start a Docker container on each node.
+
+```bash
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0
+export NAME=vllm-ascend
+
+# Run the container using the defined variables
+# This test uses four NPU cards to create the container.
+# Mount the hccn.conf file from the host node into the container.
+docker run --rm \
+--name $NAME \
+--net=host \
+--shm-size=1g \
+--device /dev/davinci0 \
+--device /dev/davinci1 \
+--device /dev/davinci2 \
+--device /dev/davinci3 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/Ascend/driver/tools/hccn_tool:\
+/usr/local/Ascend/driver/tools/hccn_tool \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /etc/hccn.conf:/etc/hccn.conf \
+-v /root/.cache:/root/.cache \
+-it $IMAGE bash
+```
+
+## (Optional) Install Mooncake
+
+Mooncake is pre-installed and functional in the v0.11.0 image.
+The following installation steps are optional.
+
+Mooncake is the serving platform for Kimi, a leading LLM service provided by
+Moonshot AI. Installation and compilation guide:
+<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
+
+First, obtain the Mooncake project using the following command:
+
+```bash
+git clone -b v0.3.7.post2 --depth 1 https://github.com/kvcache-ai/Mooncake.git
+cd Mooncake
+git submodule update --init --recursive
+```
+
+Install MPI:
+
+```bash
+apt-get install mpich libmpich-dev -y
+```
+
+Install the relevant dependencies (Go installation is not required):
+
+```bash
+bash dependencies.sh -y
+```
+
+Compile and install:
+
+```bash
+mkdir build
+cd build
+cmake .. -DUSE_ASCEND_DIRECT=ON
+make -j
+make install
+```
+
+After installation, verify that Mooncake is installed correctly:
+
+```bash
+python -c "import mooncake; print(mooncake.__file__)"
+# Expected output path:
+# /usr/local/Ascend/ascend-toolkit/latest/python/
+# site-packages/mooncake/__init__.py
+```
+
+## Start Mooncake Master Service
+
+To start the Mooncake master service in one of the node containers, use the
+following command:
+
+```bash
+docker exec -it vllm-ascend bash
+cd /vllm-workspace/Mooncake
+mooncake_master --port 50088 \
+  --eviction_high_watermark_ratio 0.95 \
+  --eviction_ratio 0.05
+```
+
+| Parameter                     | Value | Explanation                           |
+| ----------------------------- | ----- | ------------------------------------- |
+| port                          | 50088 | Port for the master service           |
+| eviction_high_watermark_ratio | 0.95  | High watermark ratio (95% threshold)  |
+| eviction_ratio                | 0.05  | Percentage to evict when full (5%)    |
+
+## Create a Mooncake Configuration File Named mooncake.json
+
+The template for the mooncake.json file is as follows:
+
+```json
+{
+    "metadata_server": "P2PHANDSHAKE",
+    "protocol": "ascend",
+    "device_name": "",
+    "use_ascend_direct": true,
+    "master_server_address": "<your_server_ip>:50088",
+    "global_segment_size": 107374182400
+}
+```
+
+| Parameter   | Value                  | Explanation                           |
+| --------------| ------------------------| -----------------------------------|
+| metadata_server | P2PHANDSHAKE              | Point-to-point handshake mode  |
+| protocol              | ascend              | Ascend proprietary protocol    |
+| use_ascend_direct     | true                | Enable direct hardware access  |
+| master_server_address | 90.90.100.188:50088(for example) | Master server address|
+| global_segment_size   | 107374182400    | Size per segment (100 GB)      |
+
+## vLLM Instance Deployment
+
+Create containers on both Node 1 and Node 2, and launch the
+Qwen2.5-72B-Instruct model service in each to test the reusability and
+performance of cross-node, cross-instance KV Cache. Instance 1 utilizes NPU
+cards [0-3] on the first Atlas 800T A2 server, while Instance 2 utilizes
+cards [0-3] on the second server.
+
+### Deploy Instance 1
+
+Replace file paths, host, and port parameters based on your actual environment
+configuration.
+
+```bash
+export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/\
+latest/python/site-packages:$LD_LIBRARY_PATH
+export MOONCAKE_CONFIG_PATH="/vllm-workspace/mooncake.json"
+# NPU buffer pool: quantity:size(MB)
+# Allocates 4 buffers of 8MB each for KV transfer
+export ASCEND_BUFFER_POOL=4:8
+
+vllm serve <path_to_your_model>/Qwen2.5-72B-Instruct/ \
+--served-model-name qwen \
+--dtype bfloat16 \
+--max-model-len 25600 \
+--tensor-parallel-size 4 \
+--host <your_server_ip> \
+--port 8002 \
+--max-num-batched-tokens 4096 \
+--gpu-memory-utilization 0.9 \
+--kv-transfer-config '{
+      "kv_connector": "MooncakeConnectorStoreV1",
+      "kv_role": "kv_both",
+      "kv_connector_extra_config": {
+          "use_layerwise": false,
+          "mooncake_rpc_port": "0",
+          "load_async": true,
+          "register_buffer": true
+      }
+  }'
+```
+
+### Deploy Instance 2
+
+The deployment method for Instance 2 is identical to Instance 1. Simply
+modify the `--host` and `--port` parameters according to your Instance 2
+configuration.
+
+### Configuration Parameters
+
+| Parameter         | Value                 | Explanation                      |
+| ----------------- | ----------------------| -------------------------------- |
+| kv_connector      | MooncakeConnectorStoreV1 | Use StoreV1 version           |
+| kv_role         | kv_both                | Enable both produce and consume  |
+| use_layerwise     | false                | Transfer entire cache (see note) |
+| mooncake_rpc_port | 0                    | Automatic port assignment        |
+| load_async        | true                 | Enable asynchronous loading      |
+| register_buffer   | true                 | Required for PD-colocated mode   |
+
+**Note on use_layerwise:**
+
+- `false`: Transfer entire KV Cache (suitable for cross-node with sufficient
+  bandwidth)
+- `true`: Layer-by-layer transfer (suitable for single-node memory
+  constraints)
+
+## Benchmark
+
+We recommend using the **aisbench** tool to assess performance. The test uses
+**Dataset A**, consisting of fully random data, with the following
+configuration:
+
+- Input/output tokens: 1024/10
+- Total requests: 100
+- Concurrency: 25
+
+The test procedure consists of three steps:
+
+### Step 1: Baseline (No Cache)
+
+Send Dataset A to Instance 1 on Node 1 and record the Time to First Token
+(TTFT) as **TTFT1**.
+
+### Preparation for Step 2
+
+Before Step 2, send a fully random Dataset B to Instance 1. Due to the
+unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy,
+Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's
+cache only in Node 1's DRAM.
+
+### Step 2: Local DRAM Hit
+
+Send Dataset A to Instance 1 again to measure the performance when hitting
+the KV Cache in local DRAM. Record the TTFT as **TTFT2**.
+
+### Step 3: Cross-Node DRAM Hit
+
+Send Dataset A to Instance 2. With the Mooncake KV Cache pool, this results
+in a cross-node KV Cache hit from Node 1's DRAM. Record the TTFT as
+**TTFT3**.
+
+**Model Configuration**:
+
+```python
+from ais_bench.benchmark.models import VLLMCustomAPIChatStream
+from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
+
+models = [
+    dict(
+        attr="service",
+        type=VLLMCustomAPIChatStream,
+        abbr='vllm-api-stream-chat',
+        path="<path_to_your_model>/Qwen2.5-72B-Instruct",
+        model="qwen",
+        request_rate = 0,
+        retry = 2,
+        host_ip = "<your_server_ip>",
+        host_port = 8002,
+        max_out_len = 10,
+        batch_size= 25,
+        trust_remote_code=False,
+        generation_kwargs = dict(
+            temperature = 0,
+            ignore_eos = True,
+        ),
+    )
+]
+```
+
+**Performance Benchmarking Commands**：
+
+```shell
+ais_bench --models vllm_api_stream_chat \
+  --datasets gsm8k_gen_0_shot_cot_str_perf \
+  --debug --summarizer default_perf --mode perf
+```
+
+### Test Results
+
+| Requests | Concur | TTFT1 (ms) | TTFT2 (ms) | TTFT3 (ms) |
+| -------- | ------ | ---------- | ---------- | ---------- |
+| 100      | 25     | 2322       | 739        | 948        |