[Fix] Fix DP-related padding logic (#2582)

### What this PR does / why we need it? The determination of attention state, padding, and other forward metadata has been moved to an earlier stage within the input preparation process. This change enables us to utilize a single all-reduce operation, maximizing synchronization efficiency as early as possible. The logic for synchronizing metadata—such as the number of tokens, prefill status, and DBO status—across data parallel (DP) ranks has now been unified and simplified. For performance improvements, the all-reduce operation has been switched from the `gloo` backend to the `npu` backend, which results in an reduction of several milliseconds per step (**approximately 10% performance gain for TPOT!**). Additionally, the multi-DP server hang issue has been resolved, ensuring no more hangs occur when `num_requests < dp_size`. Alas, a relief. Finally, the miscalculated memory usage issue has been addressed by removing the unnecessary `DummyCommImpl`, allowing the system to use the real communication method when determining available memory. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Maybe we should add an test case for multi-DP online server? @MengqingCao - vLLM version: v0.10.1.1 - vLLM main: c5d004aaaf --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-28 19:39:58 +08:00
parent 175f6bc445
commit dfc7eb39ad
5 changed files with 110 additions and 160 deletions
--- a/vllm_ascend/torchair/torchair_model_runner.py
+++ b/vllm_ascend/torchair/torchair_model_runner.py
@@ -70,7 +70,7 @@ class NPUTorchairModelRunner(NPUModelRunner):
        register_torchair_model()
        torchair_quant_method_register()

-    def _get_forward_metadata_across_dp_and_pad(
+    def _sync_metadata_across_dp(
            self, num_tokens: int, with_prefill: bool, enable_dbo: bool
    ) -> tuple[int, Optional[torch.Tensor], bool, bool]:
        """Override from NPUModelRunner to pad num_tokens"""
@@ -81,8 +81,17 @@ class NPUTorchairModelRunner(NPUModelRunner):
                return maybe_padded_num_tokens, None, with_prefill, enable_dbo
            return num_tokens, None, with_prefill, enable_dbo

-        num_tokens_across_dp, with_prefill, enable_dbo = self._get_forward_metadata_across_dp(
-            num_tokens, with_prefill, enable_dbo)
+        num_tokens_across_dp = torch.zeros(self.dp_size + 2,
+                                           dtype=torch.int32,
+                                           device="npu")
+        num_tokens_across_dp[self.dp_rank] = num_tokens
+        num_tokens_across_dp[-2] = int(with_prefill)
+        num_tokens_across_dp[-1] = int(not enable_dbo)
+        dist.all_reduce(num_tokens_across_dp,
+                        group=get_dp_group().device_group)
+        with_prefill = bool(num_tokens_across_dp[-2])
+        enable_dbo = not bool(num_tokens_across_dp[-1])
+        num_tokens_across_dp = num_tokens_across_dp[:-2]

        if not with_prefill:
            max_num_token = num_tokens_across_dp.max().item()