[Attention] Temporarily add back pa for small batch sizes. (#4765)
### What this PR does / why we need it?
This PR adds back pa in scenarios of small batch sizes due to
performance consideration. Will remove pa once fia performs better than
pa in all scenarios.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
This commit is contained in:
@@ -153,6 +153,13 @@ class AscendConfig:
|
||||
raise NotImplementedError(
|
||||
"This feature is still in the experiment and will be supported soon."
|
||||
)
|
||||
# We find that _npu_paged_attention still performs better than
|
||||
# npu_fused_infer_attention_score in some cases. We allow to execute
|
||||
# _npu_paged_attention in this cases. This should be removed once
|
||||
# npu_fused_infer_attention_score performs better on all scenarios.
|
||||
self.pa_shape_list = additional_config.get("pa_shape_list",
|
||||
[1, 2, 3, 4])
|
||||
|
||||
kv_cfg = vllm_config.kv_transfer_config
|
||||
if kv_cfg is not None and not getattr(kv_cfg, "_engine_id_patched",
|
||||
False):
|
||||
|
||||
Reference in New Issue
Block a user