[Perf][PCP][DCP] add multi-stream for GQA to enable computation-communication overlap (#5382)
### What this PR does / why we need it?
This PR adds multi-stream for GQA to enable computation-communication
overlap. For chunked prefill, we reduce TTFT by approximately 4%.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
This commit is contained in:
@@ -1771,9 +1771,6 @@ class NPUModelRunner(GPUModelRunner):
|
||||
kv_cache_group_id].get_device_tensor()
|
||||
slot_mapping = self.input_batch.block_table[
|
||||
kv_cache_group_id].slot_mapping
|
||||
self.cp_kv_recover_idx = torch.zeros(self.max_num_tokens,
|
||||
dtype=torch.int32,
|
||||
device=self.device)
|
||||
long_seq_metadata = None if self.pcp_size * self.dcp_size == 1 else self.pcp_manager.generate_pcp_metadata(
|
||||
num_tokens, self.query_lens, self.attn_mask,
|
||||
self.input_batch)
|
||||
|
||||
Reference in New Issue
Block a user