[BugFix] [dcp] Fix GQA Model Error when Enable both DP and DCP (#7012)
### What this PR does / why we need it? For GQA model, when we enable both dp and dcp (disable pcp), the key-value pairs were not being captured correctly; we have now fixed it. Signed-off-by: dsxsteven <dsxsteven@sina.com>
This commit is contained in:
@@ -938,8 +938,8 @@ class AscendAttentionCPImpl(AscendAttentionBackendImpl):
|
|||||||
prefill_query = query[self.pcp_size * num_decode_tokens :]
|
prefill_query = query[self.pcp_size * num_decode_tokens :]
|
||||||
else:
|
else:
|
||||||
prefill_query = query[num_decode_tokens:num_actual_tokens_pcp_padded].contiguous()
|
prefill_query = query[num_decode_tokens:num_actual_tokens_pcp_padded].contiguous()
|
||||||
key = key[self.pcp_size * num_decode_tokens :].contiguous()
|
key = key[self.pcp_size * num_decode_tokens : attn_metadata.num_actual_tokens_pcp_padded].contiguous()
|
||||||
value = value[self.pcp_size * num_decode_tokens :].contiguous()
|
value = value[self.pcp_size * num_decode_tokens : attn_metadata.num_actual_tokens_pcp_padded].contiguous()
|
||||||
|
|
||||||
if has_chunked_context:
|
if has_chunked_context:
|
||||||
# all_gather q for chunked prefill // overlap the computation inner current chunk
|
# all_gather q for chunked prefill // overlap the computation inner current chunk
|
||||||
|
|||||||
Reference in New Issue
Block a user