[v0.18.0][Feature] support qkv_rmsnorm_mrope for qwen3vl (#7852)
### What this PR does / why we need it? Qwen3vl full attention supports enabling the split_qkv_rmsnorm_mrope fusion operator. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - [x] Run Qwen3-VL dense model with the fusion operator, verify correct output - [x] Run Qwen3-VL MoE model with the fusion operator, verify correct output --------- Signed-off-by: jiangmengyu18 <451528648@qq.com> Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com> Signed-off-by: betta18 <jiangmengyu1@huawei.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com>
This commit is contained in:
@@ -701,4 +701,23 @@
|
||||
# Let vLLM support triton ops dispatch.
|
||||
# Future Plan:
|
||||
# Remove this patch when vLLM support the dispatch function.
|
||||
#
|
||||
# ** 27. File: worker/patch_qwen3vl.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.model_executor.models.qwen3.Qwen3Attention.forward` and
|
||||
# `vllm.model_executor.models.qwen3_moe.Qwen3MoeAttention.forward`
|
||||
# Why:
|
||||
# support triton_split_qkv_rmsnorm_mrope fused kernel for Qwen3Attention and Qwen3MoeAttention.
|
||||
# How:
|
||||
# override forward method with the triton_split_qkv_rmsnorm_mrope fused kernel,
|
||||
# when using mrope.
|
||||
# Future Plan:
|
||||
# Remove this patch when vllm-ascend supports pattern matching for this fused kernel.
|
||||
# ** 28. File: worker/patch_qwen3vl.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.model_executor.models.qwen3_vl.Qwen3VLForConditionalGeneration._get_deepstack_input_embeds`
|
||||
# Why:
|
||||
# support flash comm v1 for qwen3vl.
|
||||
# How:
|
||||
# override _get_deepstack_input_embeds method with the flash comm v1 implementation.
|
||||
# Future Plan:
|
||||
# Remove this patch when https://github.com/vllm-project/vllm-ascend/issues/5712 is completed.
|
||||
|
||||
Reference in New Issue
Block a user