Add patch_qwen3_5 for triton ops fused_recurrent_gated_delta_rule (#7109)

### What this PR does / why we need it? The ops `torch_npu.npu_recurrent_gated_delta_rule` currently does not support `ssm_state` inputs in float32 format, we temporarily retain the _forward_core implementation with triton for Qwen3_5 --------- Signed-off-by: pppeng <zepengliu912@qq.com> Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>
2026-03-10 23:28:58 +08:00
parent a78a00e0b1
commit 0f289fa2a8
4 changed files with 275 additions and 0 deletions
--- a/vllm_ascend/patch/init.py
+++ b/vllm_ascend/patch/init.py
@@ -332,3 +332,14 @@
 #       https://github.com/vllm-project/vllm/pull/36225
 #    Future Plan:
 #       Remove this patch when vLLM merges the PR.
+#
+# ** 17. File: worker/patch_qwen3_5.py**
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#   1. `vllm.model_executor.models.qwen3_5.Qwen3_5GatedDeltaNet._forward_core`
+#    Why:
+#       The class Qwen3_5GatedDeltaNet reuse the `_forward_core` method of Qwen3NextGatedDeltaNet,
+#       but the ascendC ops of Qwen3NextGatedDeltaNet do not support ssm_state with float32 format.
+#    How：
+#       patch Qwen3_5GatedDeltaNet._forward_core to use triton ops like `fused_recurrent_gated_delta_rule`.
+#    Future Plan:
+#       Remove this patch when all ops in _forward_core support both Qwen3_5 and Qwen3Next.