[Feature] Support for cross-attention and whisper model (#5592)
### What this PR does / why we need it?
To solve the problem of the
issue:https://github.com/vllm-project/vllm-ascend/issues/2262
- support for cross-attention when the model is encoder-decoder
- support for whisper model
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: gh924 <guihao2@huawei.com>
Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>
This commit is contained in:
@@ -238,6 +238,14 @@ class NPUPlatform(Platform):
|
||||
if compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEWISE:
|
||||
compilation_config.cudagraph_mode = CUDAGraphMode.PIECEWISE
|
||||
|
||||
# encoder-decoder models currently only support piecewise mode
|
||||
if model_config and model_config.is_encoder_decoder is True:
|
||||
if compilation_config.cudagraph_mode == CUDAGraphMode.FULL_DECODE_ONLY:
|
||||
logger.warning(
|
||||
"encoder-decoder model doesn't support FULL_DECODE_ONLY, fallback to PIECEWISE "
|
||||
)
|
||||
compilation_config.cudagraph_mode = CUDAGraphMode.PIECEWISE
|
||||
|
||||
# get custom compile backend for graph fusion
|
||||
compilation_config.oot_compiler = cls.get_compile_backend()
|
||||
|
||||
|
||||
Reference in New Issue
Block a user