[main][BugFix] Avoided a bug of torch_npu.npu_mm_reduce_scatter_base when sp size >= 16 (#6168)

### What this PR does / why we need it? If `sp` is enabled and `tp_size` >= 16, `torch_npu.npu_mm_reduce_scatter_base` will raises a exception. After consulting with the operator developer, we learned that the operator does not work when `tp` = 16. So, we disable the operator when `tp` = 16. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested We started a server with `sp` enabled and `tp` = 16. It started successfully. ```text [0;36m(APIServer pid=1855938)[0;0m INFO: Started server process [1855938] [0;36m(APIServer pid=1855938)[0;0m INFO: Waiting for application startup. [0;36m(APIServer pid=1855938)[0;0m INFO: Application startup complete. ``` - vLLM version: v0.13.0 - vLLM main: d68209402d Signed-off-by: drslark <slarksblood@qq.com>
2026-01-23 21:12:23 +08:00
parent e90b14140b
commit 44a4ff6960
1 changed files with 5 additions and 1 deletions
--- a/vllm_ascend/ascend_forward_context.py
+++ b/vllm_ascend/ascend_forward_context.py
@@ -72,13 +72,16 @@ def set_ascend_forward_context(
        # due to multiple warmups before actual capturing
        forward_context.capturing = False

+        # TODO: remove it when torch_npu.npu_mm_reduce_scatter_base supports tp_size >= 16.
+        mmrs_fusion = tp_world_size <= 8
+
        # set for sequence parallelism, 1000 is the batch size concurrency threshold
        # for enabling the flashcomm_v1 or sequence_parallelism feature.
        # Currently, it is an empirical value. In normal scenarios, if the concurrency
        # exceeds this threshold, the performance benefits can be maximized.
        # Conversely, if the concurrency is below the threshold,
        # the performance may degrade due to the switching of communication methods.
-        mmrs_fusion = True
+
        # main model and drafter model may have different architecture
        is_context_moe_model = is_drafter_moe_model(vllm_config) if is_draft_model else is_moe_model(vllm_config)
        if is_context_moe_model:
@@ -86,6 +89,7 @@ def set_ascend_forward_context(
            mmrs_fusion = False
        else:
            sp_enabled = enable_sp(vllm_config) and num_tokens is not None and num_tokens > 1000
+
        forward_context.mmrs_fusion = mmrs_fusion
        forward_context.num_tokens = num_tokens
        forward_context.sp_enabled = sp_enabled