[Cherry-pick][0.11.0] Adapted to torch_npu.npu_fused_infer_attention_score (#4202)

### What this PR does / why we need it?
Fixes a compatible bug with torch_npu.npu_fused_infer_attention_score
which is discribed in
https://github.com/vllm-project/vllm-ascend/issues/4020.
@momo609 tells us this solution.
cherry-pick: https://github.com/vllm-project/vllm-ascend/pull/4025

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

Signed-off-by: Icey <1790571317@qq.com>
This commit is contained in:
Icey
2025-11-17 10:56:23 +08:00
committed by GitHub
parent a7eb42cf0a
commit 378e92a2a2
2 changed files with 2 additions and 2 deletions

View File

@@ -115,7 +115,7 @@ class AscendAttentionBackend(AttentionBackend):
@staticmethod
def get_supported_block_size() -> list[int]:
return [64]
return [128]
class AscendAttentionState(Enum):

View File

@@ -51,7 +51,7 @@ def verify_and_update_config(cls, vllm_config) -> None:
block_size=model_config.max_model_len,
).page_size_bytes
block_alignment_bytes = 64
block_alignment_bytes = 128
# some attention backends (e.g. FA) only support setting
# block size to multiple of 16, so let's suggest a value