[v0.11.0][Perf] Eliminating the zerolike operator through patch (#3632)

### What this PR does / why we need it? There is a zero-like operator before the attention operation in each decoding stage. After analysis, this operator can be eliminated. The purpose of this PR is to remove this operator and improve performance. --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-23 14:49:28 +08:00
parent 74903af460
commit 6975d46627
9 changed files with 111 additions and 6 deletions
--- a/vllm_ascend/attention/sfa_v1.py
+++ b/vllm_ascend/attention/sfa_v1.py
@@ -808,7 +808,7 @@ class AscendSFAImpl(MLAAttentionImpl):
        assert output is not None, "Output tensor must be provided."
        if attn_metadata is None:
            # Profiling run.
-            return output
+            return output.fill_(0)
        num_actual_tokens = attn_metadata.num_actual_tokens
        assert attn_metadata.num_decodes is not None and \
        attn_metadata.num_prefills is not None and \