[Cherry-pick][0.11.0] Adapted to torch_npu.npu_fused_infer_attention_score (#4202)
### What this PR does / why we need it? Fixes a compatible bug with torch_npu.npu_fused_infer_attention_score which is discribed in https://github.com/vllm-project/vllm-ascend/issues/4020. @momo609 tells us this solution. cherry-pick: https://github.com/vllm-project/vllm-ascend/pull/4025 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: Icey <1790571317@qq.com>
This commit is contained in:
@@ -115,7 +115,7 @@ class AscendAttentionBackend(AttentionBackend):
|
||||
|
||||
@staticmethod
|
||||
def get_supported_block_size() -> list[int]:
|
||||
return [64]
|
||||
return [128]
|
||||
|
||||
|
||||
class AscendAttentionState(Enum):
|
||||
|
||||
@@ -51,7 +51,7 @@ def verify_and_update_config(cls, vllm_config) -> None:
|
||||
block_size=model_config.max_model_len,
|
||||
).page_size_bytes
|
||||
|
||||
block_alignment_bytes = 64
|
||||
block_alignment_bytes = 128
|
||||
|
||||
# some attention backends (e.g. FA) only support setting
|
||||
# block size to multiple of 16, so let's suggest a value
|
||||
|
||||
Reference in New Issue
Block a user