[Feature] Support for cross-attention and whisper model (#5592)

### What this PR does / why we need it?
To solve the problem of the
issue:https://github.com/vllm-project/vllm-ascend/issues/2262

- support for cross-attention when the model is encoder-decoder
- support for whisper model

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: gh924 <guihao2@huawei.com>
Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>
This commit is contained in:
gh924
2026-01-11 11:38:45 +08:00
committed by GitHub
parent db12c1e2c8
commit 6880c1b383
5 changed files with 103 additions and 68 deletions

View File

@@ -238,6 +238,14 @@ class NPUPlatform(Platform):
if compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEWISE:
compilation_config.cudagraph_mode = CUDAGraphMode.PIECEWISE
# encoder-decoder models currently only support piecewise mode
if model_config and model_config.is_encoder_decoder is True:
if compilation_config.cudagraph_mode == CUDAGraphMode.FULL_DECODE_ONLY:
logger.warning(
"encoder-decoder model doesn't support FULL_DECODE_ONLY, fallback to PIECEWISE "
)
compilation_config.cudagraph_mode = CUDAGraphMode.PIECEWISE
# get custom compile backend for graph fusion
compilation_config.oot_compiler = cls.get_compile_backend()