[Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415)

### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.

The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
0bfd7484fd

---------

Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
This commit is contained in:
zhangmuzhi_yuwan
2026-01-05 14:19:57 +08:00
committed by GitHub
parent 549be94397
commit 6c1a685b30
2 changed files with 337 additions and 0 deletions

View File

@@ -20,6 +20,7 @@ DeepSeek-V3.1.md
DeepSeek-V3.2.md
DeepSeek-R1.md
Kimi-K2-Thinking
pd_colocated_mooncake_multi_instance
pd_disaggregation_mooncake_single_node
pd_disaggregation_mooncake_multi_node
long_sequence_context_parallel_single_node