Files
xc-llm-ascend/vllm_ascend
lidenghui1110 0563106477 [Feature] mooncake connector support GQA transport (#2947)
### What this PR does / why we need it?
The previous implementation of the Mooncake connector only supported
scenarios where the Tensor Parallel sizes for the Prefill and Decode
phases were the same for MLA and GQA/MHA.

For heterogeneous TP scenarios, a single rank on a decode node needs to
pull the KV cache from multiple ranks on the prefill nodes and then
merge them (only support prefill TP >= decode TP now). During this
merge, a transpose operation is required because the layouts of the KV
caches are different. To minimize transpose overhead, we use the
npu_paged_cache_load operation to extract the blocks corresponding to
the request from the KV cache. After performing the transpose, we use
_npu_reshape_and_cache to write the blocks back to their original
positions.

This process is illustrated in the diagram below.

b means block_size, this diagram illustrates transpose kv cache layout
for one block. In the implementation, we transpose kv cache by layer for
one request.

<img width="1464" height="916" alt="image"
src="https://github.com/user-attachments/assets/09d96a98-e41c-4733-9535-05544163081a"
/>

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested

- vLLM version: v0.11.0
---------

Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: zzy-ContiLearn <1831242919@qq.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: Kurumi5210 <jaychou1620@gmail.com>
Co-authored-by: zzy-ContiLearn <1831242919@qq.com>
Co-authored-by: chenxiao <cx02308786@antgroup.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
2025-10-13 15:48:37 +08:00
..
2025-10-09 10:28:38 +08:00
2025-10-09 19:22:46 +08:00
2025-09-30 03:25:58 +08:00