[Feature] Optimize Qwen3.5/Qwen3Next GDN prefill by prebuilding chunk metadata (#7487)
### What this PR does / why we need it?
This PR optimizes the Qwen3.5 and Qwen3Next GDN prefill path on Ascend
by reducing host/device synchronization overhead.
The current implementation of the `chunk_gated_delta_rule` path for
variable-length sequences prepares chunk metadata during the forward
pass. This approach triggers frequent CPU intervention and host/device
round-trips. When running prefill-heavy workloads with asynchronous
scheduling enabled, these synchronizations result in execution "bubbles"
and prefill stalling (stuttering). **Note that this does not cause
asynchronous scheduling to fail; rather, it prevents the system from
reaching its theoretical throughput due to these unnecessary stalls.**
To resolve this, the patch moves metadata preparation out of the hot
path:
- **Prebuilt Metadata:** All non-speculative varlen chunk metadata for
GDN is now prebuilt on the CPU.
- **Asynchronous Transfer:** Staging buffers are kept in pinned memory
and transferred to the NPU asynchronously.
- **Integration:** The prebuilt bundle is attached to GDN attention
metadata via `patch_gdn_attn.py` and passed into Triton wrappers.
- **Backward Compatibility:** Triton wrappers fall back to the legacy
preparation path if no prebuilt metadata is provided.
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
This commit is contained in:
@@ -259,6 +259,20 @@
|
||||
# Remove this patch when bool is supported in 'torch.argsort' func of npu.
|
||||
# Make 'torch.argsort' in `vllm.v1.attention.backends.gdn_attn` be stable.
|
||||
#
|
||||
# ** 7a. File: worker/patch_gdn_attn.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.v1.attention.backends.gdn_attn.GDNAttentionMetadataBuilder.build`
|
||||
# Why:
|
||||
# Qwen3.5/Qwen3Next GDN prefill on NPU needs prebuilt varlen chunk metadata
|
||||
# to avoid forward-time host round-trips that break async scheduling.
|
||||
# How:
|
||||
# Monkey-patch the upstream builder in-place, keep upstream code untouched,
|
||||
# and attach prebuilt device metadata bundle onto the returned attention
|
||||
# metadata object for Ascend-specific consumers.
|
||||
# Future Plan:
|
||||
# Remove this patch when upstream exposes a backend hook for extending GDN
|
||||
# metadata or when the optimization is accepted upstream directly.
|
||||
#
|
||||
# ** 8. File: worker/patch_qwen3_next.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet.forward`
|
||||
|
||||
Reference in New Issue
Block a user