[Fix] Fix DP-related padding logic (#2582)

### What this PR does / why we need it? The determination of attention state, padding, and other forward metadata has been moved to an earlier stage within the input preparation process. This change enables us to utilize a single all-reduce operation, maximizing synchronization efficiency as early as possible. The logic for synchronizing metadata—such as the number of tokens, prefill status, and DBO status—across data parallel (DP) ranks has now been unified and simplified. For performance improvements, the all-reduce operation has been switched from the `gloo` backend to the `npu` backend, which results in an reduction of several milliseconds per step (**approximately 10% performance gain for TPOT!**). Additionally, the multi-DP server hang issue has been resolved, ensuring no more hangs occur when `num_requests < dp_size`. Alas, a relief. Finally, the miscalculated memory usage issue has been addressed by removing the unnecessary `DummyCommImpl`, allowing the system to use the real communication method when determining available memory. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Maybe we should add an test case for multi-DP online server? @MengqingCao - vLLM version: v0.10.1.1 - vLLM main: c5d004aaaf --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-28 19:39:58 +08:00
parent 175f6bc445
commit dfc7eb39ad
5 changed files with 110 additions and 160 deletions
--- a/vllm_ascend/ops/common_fused_moe.py
+++ b/vllm_ascend/ops/common_fused_moe.py
@@ -26,7 +26,6 @@ from vllm.model_executor.layers.fused_moe.layer import (

 from vllm_ascend.ascend_config import get_ascend_config
 from vllm_ascend.distributed.moe_comm_method import (AllGatherCommImpl,
-                                                     DummyCommImpl,
                                                     MC2CommImpl,
                                                     MoECommMethod)
 from vllm_ascend.distributed.parallel_state import get_mc2_group
@@ -230,7 +229,7 @@ class AscendFusedMoE(FusedMoE):
        self.moe_config.ep_group = get_ep_group()
        self.moe_config.mc2_group = get_mc2_group()

-        for method in {AllGatherCommImpl, DummyCommImpl, MC2CommImpl}:
+        for method in {AllGatherCommImpl, MC2CommImpl}:
            setattr(
                self, method.__name__.lower(),
                method(moe_config=self.moe_config))  # type: ignore[abstract]
@@ -241,8 +240,11 @@ class AscendFusedMoE(FusedMoE):

        forward_context = get_forward_context()
        moe_comm_method_name = forward_context.moe_comm_method_name
-        if not self.moe_config.use_ep and moe_comm_method_name != "dummycommimpl":
+
+        # TODO: Can we refactor this logic to model_runner?
+        if not self.moe_config.use_ep:
            moe_comm_method_name = "allgathercommimpl"
+
        forward_context.moe_comm_method = getattr(self, moe_comm_method_name)

        hidden_states, router_logits = forward_context.moe_comm_method.prepare(