[Perf][PCP][DCP] add multi-stream for GQA to enable computation-communication overlap (#5382)

### What this PR does / why we need it?
This PR adds multi-stream for GQA to enable computation-communication
overlap. For chunked prefill, we reduce TTFT by approximately 4%.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
This commit is contained in:
Qiu
2026-01-04 16:33:18 +08:00
committed by GitHub
parent 37fd48bee5
commit 7c210225a2
5 changed files with 276 additions and 224 deletions

View File

@@ -63,6 +63,7 @@ class AscendMetadataForPrefill:
cp_kv_recover_idx_for_chunk: Optional[list[int]] = None
kv_inverse_idx_for_chunk: Optional[list[int]] = None
batch_chunk_seq_mask: Optional[list[bool]] = None
local_total_toks: Optional[int] = None
""" Prefill Specific Metadata for Ascend"""
pcp_metadata: Optional[AscendPCPMetadata] = None