[Bugfix] Qwen3Next support FlashComm1 (#6830)

### What this PR does / why we need it? Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence Parallel (SP) and resolve precision problems in shared_out when both FlashComm1 is enabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.15.0 - vLLM main: 83b47f67b1 --------- Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com> Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
2026-03-06 17:14:08 +08:00
parent a2696006d1
commit a51d6366b9
4 changed files with 63 additions and 8 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -1214,7 +1214,12 @@ class NPUModelRunner(GPUModelRunner):
                    # Currently, Graph Mode and SP will both pad num_tokens,
                    # Another possible condition is num_tokens_padded != num_tokens_unpadded
                    # but this scope is way too big and the consequences are unpredictable
+                    old_num_reqs_padded = num_reqs_padded
                    num_reqs_padded = self._pad_query_start_loc_for_fia(num_tokens_padded, num_reqs_padded, num_reqs)
+                    if enable_sp() and num_tokens_padded == num_tokens_unpadded:
+                        if num_reqs_padded > old_num_reqs_padded:
+                            num_reqs_padded = old_num_reqs_padded
+                            self.query_start_loc.np[num_reqs_padded + 1] = 0

                (attn_metadata, spec_decode_common_attn_metadata) = self._build_attention_metadata(
                    num_tokens=num_tokens_unpadded