[Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415)

### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.

The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
0bfd7484fd

---------

Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>

This commit is contained in:

zhangmuzhi_yuwan

2026-01-05 14:19:57 +08:00

committed by

GitHub

parent 549be94397

commit 6c1a685b30

2 changed files with 337 additions and 0 deletions

1

docs/source/tutorials/index.md

View File

@@ -20,6 +20,7 @@ DeepSeek-V3.1.md
 DeepSeek-V3.2.md
 DeepSeek-R1.md
 Kimi-K2-Thinking
 pd_colocated_mooncake_multi_instance
 pd_disaggregation_mooncake_single_node
 pd_disaggregation_mooncake_multi_node
 long_sequence_context_parallel_single_node

[Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415)

1 docs/source/tutorials/index.md Unescape Escape View File

1

docs/source/tutorials/index.md

View File