[bugfix] Fix dummy-run and multi-node issues in MoE routing and MTP (#4947)

### What this PR does / why we need it? - Fix a premature `return` in `moe_init_routing_quant_v2.cpp` so the routing kernel completes correctly instead of exiting early in certain paths. - Switch `FusedAlltoAllCommImpl` to use the MC2-based token dispatcher and prepare/finalize routines, aligning MoE communication with the MC2 algorithm optimized for Ascend devices. - Add a temporary override in `MtpProposer` to map `FUSED_ALLTOALL` back to `ALLTOALL` until the MoE communication type selection logic is fully finalized, avoiding incorrect behavior in dummy-run flows. - Simplify the MoE communication selection for Ascend 910-93 in `NPUModelRunner` by removing the EP-size guard on `FUSED_ALLTOALL`, which fixes failures in multi-node / larger-EP configurations while keeping MC2 routing under the configured token capacity. - vLLM version: v0.12.0 - vLLM main: ad32e3e19c Signed-off-by: mojave2 <chenchen145@huawei.com>
2025-12-15 14:18:23 +08:00
parent cc7b302020
commit aa02a85e4d
3 changed files with 7 additions and 4 deletions
--- a/vllm_ascend/spec_decode/mtp_proposer.py
+++ b/vllm_ascend/spec_decode/mtp_proposer.py
@@ -734,6 +734,9 @@ class MtpProposer(Proposer):
             num_input_tokens, self.runner.with_prefill)

        moe_comm_type = self.runner._select_moe_comm_method(num_input_tokens)
+        # TODO: remove this after moe_comm_type selection logic is finalized
+        moe_comm_type = (MoECommType.ALLTOALL if moe_comm_type
+                         == MoECommType.FUSED_ALLTOALL else moe_comm_type)

        # Enable shared_expert_dp and MTP FULL graph may cause accuracy issues.
        if scheduler_output and not self.enable_shared_expert_dp: