[Perf][PCP][DCP] add multi-stream for GQA to enable computation-communication overlap (#5382)

### What this PR does / why we need it? This PR adds multi-stream for GQA to enable computation-communication overlap. For chunked prefill, we reduce TTFT by approximately 4%. ### Does this PR introduce _any_ user-facing change? No - vLLM version: release/v0.13.0 - vLLM main: bc0a5a0c08 --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-04 16:33:18 +08:00
parent 37fd48bee5
commit 7c210225a2
5 changed files with 276 additions and 224 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -1771,9 +1771,6 @@ class NPUModelRunner(GPUModelRunner):
                    kv_cache_group_id].get_device_tensor()
                slot_mapping = self.input_batch.block_table[
                    kv_cache_group_id].slot_mapping
-                self.cp_kv_recover_idx = torch.zeros(self.max_num_tokens,
-                                                     dtype=torch.int32,
-                                                     device=self.device)
                long_seq_metadata = None if self.pcp_size * self.dcp_size == 1 else self.pcp_manager.generate_pcp_metadata(
                    num_tokens, self.query_lens, self.attn_mask,
                    self.input_batch)