[Feature] Support for cross-attention and whisper model (#5592)

### What this PR does / why we need it? To solve the problem of the issue：https://github.com/vllm-project/vllm-ascend/issues/2262 - support for cross-attention when the model is encoder-decoder - support for whisper model - vLLM version: v0.13.0 - vLLM main: 7157596103 Signed-off-by: gh924 <guihao2@huawei.com> Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>
2026-01-11 11:38:45 +08:00
parent db12c1e2c8
commit 6880c1b383
5 changed files with 103 additions and 68 deletions
--- a/vllm_ascend/platform.py
+++ b/vllm_ascend/platform.py
@@ -238,6 +238,14 @@ class NPUPlatform(Platform):
        if compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEWISE:
            compilation_config.cudagraph_mode = CUDAGraphMode.PIECEWISE

+        # encoder-decoder models currently only support piecewise mode
+        if model_config and model_config.is_encoder_decoder is True:
+            if compilation_config.cudagraph_mode == CUDAGraphMode.FULL_DECODE_ONLY:
+                logger.warning(
+                    "encoder-decoder model doesn't support FULL_DECODE_ONLY, fallback to PIECEWISE "
+                )
+            compilation_config.cudagraph_mode = CUDAGraphMode.PIECEWISE
+
        # get custom compile backend for graph fusion
        compilation_config.oot_compiler = cls.get_compile_backend()