[Bugfix] dynamic eplb does't use fused_alltoall (#4919)

### What this PR does / why we need it? The fused alltoall operator itself was not designed or implemented to handle the scenario where tensors are lists, but the weights for dynamic load balancing are in list form. Therefore, we have disabled this operator when using dynamic load balancing. - vLLM version: v0.12.0 - vLLM main: ad32e3e19c Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-12-16 10:59:30 +08:00
parent 195eac665b
commit 0918de58d5
1 changed files with 7 additions and 4 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -1434,10 +1434,13 @@ class NPUModelRunner(GPUModelRunner):
                    moe_comm_type = MoECommType.ALLGATHER
        elif soc_version in {AscendDeviceType._910_93}:
-            moe_comm_type = (
+            # TODO: drop the EP-size guard when dispatch_ffn_combine supports larger EP sizes
-                MoECommType.MC2 if num_tokens <= mc2_tokens_capacity else
+            fused_all2all_enable = quant_type == "w8a8_dynamic" and get_ep_group(
-                MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic"
+            ).world_size <= 16 and (not self.dynamic_eplb)
-                and get_ep_group().world_size <= 16 else MoECommType.ALLTOALL)
+            moe_comm_type = (MoECommType.MC2
                             if num_tokens <= mc2_tokens_capacity else
                             MoECommType.FUSED_ALLTOALL
                             if fused_all2all_enable else MoECommType.ALLTOALL)
        else:
            raise ValueError(f"Unsupported soc_version: {soc_version}")