[Fix][MoE] Refine MoE communication strategy (#2734)

### What this PR does / why we need it? Refactors the Mixture-of-Experts (MoE) communication method selection logic. The choice between all-gather, all-to-all, and mc2 is now determined by expert parallel configuration, SoC version (A2/A3), and token count for better performance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Added. - vLLM version: v0.10.1.1 - vLLM main: eafa8dcde6 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-05 09:04:04 +08:00
parent 4c90fa79ca
commit 83eb40a51c
3 changed files with 123 additions and 9 deletions
--- a/vllm_ascend/ops/common_fused_moe.py
+++ b/vllm_ascend/ops/common_fused_moe.py
@@ -482,11 +482,6 @@ class AscendFusedMoE(FusedMoE):
        forward_context = get_forward_context()
        moe_comm_method_name = forward_context.moe_comm_method_name

-        # TODO: Can we refactor this logic to model_runner?
-        # TODO: Adjusted logic to differentiate between A2 and A3, we check ep_size here since mc2 only support ep_size >= 16 on A3 now
-        if self.moe_config.ep_size < 16:
-            moe_comm_method_name = "allgathercommimpl"
-
        forward_context.moe_comm_method = getattr(self, moe_comm_method_name)

        hidden_states, router_logits = forward_context.moe_comm_method.prepare(