[Attention] Temporarily add back pa for small batch sizes. (#4765)

### What this PR does / why we need it? This PR adds back pa in scenarios of small batch sizes due to performance consideration. Will remove pa once fia performs better than pa in all scenarios. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.12.0 - vLLM main: ad32e3e19c --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-15 20:35:50 +08:00
parent 95e6400128
commit a9625851ef
4 changed files with 163 additions and 6 deletions
--- a/vllm_ascend/ascend_config.py
+++ b/vllm_ascend/ascend_config.py
@@ -153,6 +153,13 @@ class AscendConfig:
            raise NotImplementedError(
                "This feature is still in the experiment and will be supported soon."
            )
+        # We find that _npu_paged_attention still performs better than
+        # npu_fused_infer_attention_score in some cases. We allow to execute
+        # _npu_paged_attention in this cases. This should be removed once
+        # npu_fused_infer_attention_score performs better on all scenarios.
+        self.pa_shape_list = additional_config.get("pa_shape_list",
+                                                   [1, 2, 3, 4])
+
        kv_cfg = vllm_config.kv_transfer_config
        if kv_cfg is not None and not getattr(kv_cfg, "_engine_id_patched",
                                              False):