[Core]Append padding logic for Attention (#3256)

### What this PR does / why we need it? This PR aims to add padding logic to seq_lens、block_tables when running in full decode scenario. Before this PR, the number of input tokens with padding might exceeds corresponding seq_lens. For example, when running in full decode scenario: ``` input_ids : [1, 3, 0, 0] seq_lens: [2, 1] query_start_loc: [0, 1, 2] ``` Here, `input_ids` is padded by 2 tokens while `seq_lens`/`query_start_loc` are not. The mismatch between `input_ids` and `seq_lens`/`query_start_loc` might cause some potential bugs. This PR would change it into : ``` input_ids : [1, 3, 0, 0] seq_lens: [2, 1, 1, 1] query_start_loc: [0, 1, 2, 3, 4] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Angazenn <supperccell@163.com>
2025-10-17 21:56:01 +08:00
parent b154a8e22c
commit 9547d6f0d9
5 changed files with 30 additions and 2 deletions
--- a/vllm_ascend/attention/utils.py
+++ b/vllm_ascend/attention/utils.py
@@ -64,6 +64,10 @@ class AscendCommonAttentionMetadata:

    graph_pad_size: int = -1

+    # num_input_tokens refers to total number of tokens including
+    # padding tokens. It is used to handle some padding operations.
+    num_input_tokens: int = 0
+
    # NOTE: This is a temporary solution for rotary embedding in MLA
    cos: torch.Tensor = None
    sin: torch.Tensor = None