Files
xc-llm-ascend/docs/source/tutorials/index.md
zhangmuzhi_yuwan 6c1a685b30 [Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415)
### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.

The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
0bfd7484fd

---------

Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2026-01-05 14:19:57 +08:00

31 lines
552 B
Markdown

# Tutorials
:::{toctree}
:caption: Deployment
:maxdepth: 1
Qwen2.5-Omni.md
Qwen2.5-7B
Qwen3-Dense
Qwen-VL-Dense.md
Qwen3-30B-A3B.md
Qwen3-235B-A22B.md
Qwen3-VL-235B-A22B-Instruct.md
Qwen3-Coder-30B-A3B
Qwen3_embedding
Qwen3_reranker
Qwen3-8B-W4A8
Qwen3-32B-W4A4
Qwen3-Next
DeepSeek-V3.1.md
DeepSeek-V3.2.md
DeepSeek-R1.md
Kimi-K2-Thinking
pd_colocated_mooncake_multi_instance
pd_disaggregation_mooncake_single_node
pd_disaggregation_mooncake_multi_node
long_sequence_context_parallel_single_node
long_sequence_context_parallel_multi_node
ray
310p
:::