[Fix][MoE] Refine MoE communication strategy (#2734)

### What this PR does / why we need it?
Refactors the Mixture-of-Experts (MoE) communication method selection
logic. The choice between all-gather, all-to-all, and mc2 is now
determined by expert parallel configuration, SoC version (A2/A3), and
token count for better performance.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Added.


- vLLM version: v0.10.1.1
- vLLM main:
eafa8dcde6

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
This commit is contained in:
yiz-liu
2025-09-05 09:04:04 +08:00
committed by GitHub
parent 4c90fa79ca
commit 83eb40a51c
3 changed files with 123 additions and 9 deletions

View File

@@ -482,11 +482,6 @@ class AscendFusedMoE(FusedMoE):
forward_context = get_forward_context()
moe_comm_method_name = forward_context.moe_comm_method_name
# TODO: Can we refactor this logic to model_runner?
# TODO: Adjusted logic to differentiate between A2 and A3, we check ep_size here since mc2 only support ep_size >= 16 on A3 now
if self.moe_config.ep_size < 16:
moe_comm_method_name = "allgathercommimpl"
forward_context.moe_comm_method = getattr(self, moe_comm_method_name)
hidden_states, router_logits = forward_context.moe_comm_method.prepare(