### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.
The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: release/v0.13.0
- vLLM main:
0bfd7484fd
---------
Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
552 B
552 B
Tutorials
:::{toctree} :caption: Deployment :maxdepth: 1 Qwen2.5-Omni.md Qwen2.5-7B Qwen3-Dense Qwen-VL-Dense.md Qwen3-30B-A3B.md Qwen3-235B-A22B.md Qwen3-VL-235B-A22B-Instruct.md Qwen3-Coder-30B-A3B Qwen3_embedding Qwen3_reranker Qwen3-8B-W4A8 Qwen3-32B-W4A4 Qwen3-Next DeepSeek-V3.1.md DeepSeek-V3.2.md DeepSeek-R1.md Kimi-K2-Thinking pd_colocated_mooncake_multi_instance pd_disaggregation_mooncake_single_node pd_disaggregation_mooncake_multi_node long_sequence_context_parallel_single_node long_sequence_context_parallel_multi_node ray 310p :::