[Feat][SP] Suport SP for VL MoE models (#7044)

### What this PR does / why we need it? 2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712, extend SP to VL MoE models. ### Does this PR introduce _any_ user-facing change? remove `sp_threshold` in additional config and reuse `sp_min_token_num` from vLLM. ### How was this patch tested? - Model: Qwen3-VL-30B-A3B, - TP4 DP2 - 100 reqs - max concurrency 1 | Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR | |------------|---------------------|------------------------| | 4k | 429.40 | 323.3 | | 16k | 1297.01 | 911.74 | - vLLM version: v0.16.0 - vLLM main: 4034c3d32e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-24 17:16:00 +08:00
parent 9615bc33fd
commit 5d12446573
21 changed files with 947 additions and 54 deletions
--- a/vllm_ascend/patch/init.py
+++ b/vllm_ascend/patch/init.py
@@ -167,15 +167,12 @@
 #   1. `vllm.distributed.parallel_state.GroupCoordinator`
 #    Why:
 #       vllm doesn't support all_to_all for GroupCoordinator.
-#       all_reduce in vLLM not is a customop, which will make MatmulAllReduceAddRMSNorm fusion failure.
 #    How：
 #       Add all_to_all implementation for GroupCoordinator.
-#       make all_reduce as a customop.
 #    Related PR (if no, explain why):
 #       No, we should use vlLM all2all manager to support all_to_all for npu.
 #    Future Plan:
 #       Remove this patch when the refactor of all2all manager is done.
-#       Remove this patch when vLLM support all_reduce as customop.
 #
 # ** 2. File: worker/patch_multimodal_merge.py**
 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--- a/vllm_ascend/patch/worker/patch_distributed.py
+++ b/vllm_ascend/patch/worker/patch_distributed.py
@@ -84,7 +84,7 @@ class GroupCoordinatorPatch(GroupCoordinator):
        if use_message_queue_broadcaster and self.world_size > 1:
            self.mq_broadcaster = MessageQueue.create_from_process_group(self.cpu_group, 1 << 22, 6)

-        self.use_custom_op_call = False
+        self.use_custom_op_call = True
        self.use_cpu_custom_send_recv = False

    def all_to_all(
@@ -106,10 +106,5 @@ class GroupCoordinatorPatch(GroupCoordinator):
        assert self.device_communicator is not None, "device_communicator should be initialized when world_size > 1"
        return self.device_communicator.all_to_all(input_, scatter_dim, gather_dim, scatter_sizes, gather_sizes)

-    def all_reduce(self, input_):
-        if self.world_size == 1:
-            return input_
-        return torch.ops.vllm.all_reduce(input_, group_name=self.unique_name)
-

 vllm.distributed.parallel_state.GroupCoordinator = GroupCoordinatorPatch