[Feature] Optimize Qwen3.5/Qwen3Next GDN prefill by prebuilding chunk metadata (#7487)

### What this PR does / why we need it? This PR optimizes the Qwen3.5 and Qwen3Next GDN prefill path on Ascend by reducing host/device synchronization overhead. The current implementation of the `chunk_gated_delta_rule` path for variable-length sequences prepares chunk metadata during the forward pass. This approach triggers frequent CPU intervention and host/device round-trips. When running prefill-heavy workloads with asynchronous scheduling enabled, these synchronizations result in execution "bubbles" and prefill stalling (stuttering). **Note that this does not cause asynchronous scheduling to fail; rather, it prevents the system from reaching its theoretical throughput due to these unnecessary stalls.** To resolve this, the patch moves metadata preparation out of the hot path: - **Prebuilt Metadata:** All non-speculative varlen chunk metadata for GDN is now prebuilt on the CPU. - **Asynchronous Transfer:** Staging buffers are kept in pinned memory and transferred to the NPU asynchronously. - **Integration:** The prebuilt bundle is attached to GDN attention metadata via `patch_gdn_attn.py` and passed into Triton wrappers. - **Backward Compatibility:** Triton wrappers fall back to the legacy preparation path if no prebuilt metadata is provided. - vLLM version: v0.17.0 - vLLM main: 8b6325758c --------- Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
2026-03-22 23:09:23 +08:00
parent b2e71b7930
commit 9bf9b4b267
13 changed files with 824 additions and 21 deletions
--- a/vllm_ascend/patch/init.py
+++ b/vllm_ascend/patch/init.py
@@ -259,6 +259,20 @@
 #       Remove this patch when bool is supported in 'torch.argsort' func of npu.
 #       Make 'torch.argsort' in `vllm.v1.attention.backends.gdn_attn` be stable.
 #
+# ** 7a. File: worker/patch_gdn_attn.py**
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#   1. `vllm.v1.attention.backends.gdn_attn.GDNAttentionMetadataBuilder.build`
+#    Why:
+#       Qwen3.5/Qwen3Next GDN prefill on NPU needs prebuilt varlen chunk metadata
+#       to avoid forward-time host round-trips that break async scheduling.
+#    How：
+#       Monkey-patch the upstream builder in-place, keep upstream code untouched,
+#       and attach prebuilt device metadata bundle onto the returned attention
+#       metadata object for Ascend-specific consumers.
+#    Future Plan:
+#       Remove this patch when upstream exposes a backend hook for extending GDN
+#       metadata or when the optimization is accepted upstream directly.
+#
 # ** 8. File: worker/patch_qwen3_next.py**
 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 #   1. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet.forward`