[Feature] Optimize Qwen3.5/Qwen3Next GDN prefill by prebuilding chunk metadata (#7487)

### What this PR does / why we need it? This PR optimizes the Qwen3.5 and Qwen3Next GDN prefill path on Ascend by reducing host/device synchronization overhead. The current implementation of the `chunk_gated_delta_rule` path for variable-length sequences prepares chunk metadata during the forward pass. This approach triggers frequent CPU intervention and host/device round-trips. When running prefill-heavy workloads with asynchronous scheduling enabled, these synchronizations result in execution "bubbles" and prefill stalling (stuttering). **Note that this does not cause asynchronous scheduling to fail; rather, it prevents the system from reaching its theoretical throughput due to these unnecessary stalls.** To resolve this, the patch moves metadata preparation out of the hot path: - **Prebuilt Metadata:** All non-speculative varlen chunk metadata for GDN is now prebuilt on the CPU. - **Asynchronous Transfer:** Staging buffers are kept in pinned memory and transferred to the NPU asynchronously. - **Integration:** The prebuilt bundle is attached to GDN attention metadata via `patch_gdn_attn.py` and passed into Triton wrappers. - **Backward Compatibility:** Triton wrappers fall back to the legacy preparation path if no prebuilt metadata is provided. - vLLM version: v0.17.0 - vLLM main: 8b6325758c --------- Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
2026-03-22 23:09:23 +08:00
parent b2e71b7930
commit 9bf9b4b267
13 changed files with 824 additions and 21 deletions
--- a/vllm_ascend/ops/triton/fla/chunk_scaled_dot_kkt.py
+++ b/vllm_ascend/ops/triton/fla/chunk_scaled_dot_kkt.py
@@ -85,6 +85,7 @@ def chunk_scaled_dot_kkt_fwd(
    beta: torch.Tensor,
    g_cumsum: torch.Tensor | None = None,
    cu_seqlens: torch.LongTensor | None = None,
+    chunk_indices: torch.Tensor | None = None,
    chunk_size: int = 64,
    output_dtype: torch.dtype = torch.float32,
 ) -> torch.Tensor:
@@ -115,13 +116,8 @@ def chunk_scaled_dot_kkt_fwd(

    H = beta.shape[-1]
    BT = chunk_size
-    if cu_seqlens is not None:
-        cu_seqlens = cu_seqlens.cpu()
-        chunk_indices = prepare_chunk_indices(cu_seqlens, BT) if cu_seqlens is not None else None
-        chunk_indices = chunk_indices.npu()
-        cu_seqlens = cu_seqlens.npu()
-    else:
-        chunk_indices = None
+    if cu_seqlens is not None and chunk_indices is None:
+        chunk_indices = prepare_chunk_indices(cu_seqlens, BT)
    NT = triton.cdiv(T, BT) if cu_seqlens is None else len(chunk_indices)
    A = torch.empty(B, T, H, BT, device=k.device, dtype=output_dtype)