[Bugfix] Qwen3Next support FlashComm1 (#6830)
### What this PR does / why we need it?
Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence
Parallel (SP)
and resolve precision problems in shared_out when both FlashComm1 is
enabled.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
This commit is contained in:
@@ -1214,7 +1214,12 @@ class NPUModelRunner(GPUModelRunner):
|
||||
# Currently, Graph Mode and SP will both pad num_tokens,
|
||||
# Another possible condition is num_tokens_padded != num_tokens_unpadded
|
||||
# but this scope is way too big and the consequences are unpredictable
|
||||
old_num_reqs_padded = num_reqs_padded
|
||||
num_reqs_padded = self._pad_query_start_loc_for_fia(num_tokens_padded, num_reqs_padded, num_reqs)
|
||||
if enable_sp() and num_tokens_padded == num_tokens_unpadded:
|
||||
if num_reqs_padded > old_num_reqs_padded:
|
||||
num_reqs_padded = old_num_reqs_padded
|
||||
self.query_start_loc.np[num_reqs_padded + 1] = 0
|
||||
|
||||
(attn_metadata, spec_decode_common_attn_metadata) = self._build_attention_metadata(
|
||||
num_tokens=num_tokens_unpadded
|
||||
|
||||
Reference in New Issue
Block a user