[v0.11.0][Bugfix]Avoid using the fusion operator in the MOE model (#3837)

### What this PR does / why we need it? The current MatmulReduceScatter operator experiences performance degradation in small-shape scenarios, so it determines whether to use this operator by judging the size of the shape. --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-28 23:31:19 +08:00
parent e48ca0b6ec
commit 6188450269
2 changed files with 13 additions and 6 deletions
--- a/vllm_ascend/ascend_forward_context.py
+++ b/vllm_ascend/ascend_forward_context.py
@@ -112,13 +112,16 @@ def set_ascend_forward_context(
        # Currently, it is an empirical value. In normal scenarios, if the concurrency exceeds this threshold,
        # the performance benefits can be maximized. Conversely, if the concurrency is below the threshold,
        # the performance may degrade due to the switching of communication methods.
+        mmrs_fusion = True
        if is_moe_model(vllm_config):
            sp_enabled = enable_sp(vllm_config) and \
                tp_world_size > 1 and num_tokens is not None
+            mmrs_fusion = False
        else:
            sp_enabled = enable_sp(vllm_config) and \
                tp_world_size > 1 and \
                num_tokens is not None and num_tokens > 1000
+        forward_context.mmrs_fusion = mmrs_fusion

        if sp_enabled:
            pad_size = (tp_world_size -