[Misc] Remove CP Redundant Variables after FIA operator enables for CANN 8.5 (#6013)

### What this PR does / why we need it? PCP/DCP splits the kv-cache onto different cards. After introducing the parameter cp-kv-cache-interleave-size, the first size tokens will be cached at Card 0, and so on. However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem. After we integrate FIA operator in mla_cp._forward_decode and CANN updates to 8.5.0, we now can remove these additional operations. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? passed all CI by CANN 8.5.0 - vLLM version: v0.13.0 - vLLM main: 2c24bc6996 Signed-off-by: dsxsteven <dsxsteven@sina.com> Signed-off-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>
2026-01-23 14:13:12 +08:00
parent 418a43e2a2
commit 8378bc28b0
8 changed files with 78 additions and 57 deletions
--- a/vllm_ascend/attention/mla_v1.py
+++ b/vllm_ascend/attention/mla_v1.py
@@ -134,7 +134,6 @@ class AscendMLADecodeMetadata:
    sin: torch.Tensor = None
    cos: torch.Tensor = None
    cp_seq_len: torch.Tensor = None
-    batch_seq_mask: torch.Tensor = None


@dataclass
@@ -577,7 +576,7 @@ class AscendMLAMetadataBuilder(MLACommonMetadataBuilder[AscendMLAMetadata]):
            self.block_table = self.block_table[:self.graph_pad_size, ...]
        seq_lens_list = self.seq_lens.tolist()

-        cp_seq_len, batch_seq_mask = None, None
+        cp_seq_len = None

        if self.graph_pad_size > num_reqs:
            if self.speculative_config.disable_padded_drafter_batch:
@@ -638,8 +637,7 @@ class AscendMLAMetadataBuilder(MLACommonMetadataBuilder[AscendMLAMetadata]):
            actual_seq_lengths_q=actual_seq_lengths_q,
            sin=sin[:self.num_decode_tokens, ...],
            cos=cos[:self.num_decode_tokens, ...],
-            cp_seq_len=cp_seq_len,
-            batch_seq_mask=batch_seq_mask)
+            cp_seq_len=cp_seq_len)
        return decode_metadata

    def build_for_graph_capture(