[Bugfix] dynamic eplb does't use fused_alltoall (#4919)

### What this PR does / why we need it? The fused alltoall operator itself was not designed or implemented to handle the scenario where tensors are lists, but the weights for dynamic load balancing are in list form. Therefore, we have disabled this operator when using dynamic load balancing. - vLLM version: v0.12.0 - vLLM main: ad32e3e19c Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-12-16 10:59:30 +08:00
parent 195eac665b
commit 0918de58d5
1 changed files with 7 additions and 4 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -1434,10 +1434,13 @@ class NPUModelRunner(GPUModelRunner):
                    moe_comm_type = MoECommType.ALLGATHER

        elif soc_version in {AscendDeviceType._910_93}:
-            moe_comm_type = (
-                MoECommType.MC2 if num_tokens <= mc2_tokens_capacity else
-                MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic"
-                and get_ep_group().world_size <= 16 else MoECommType.ALLTOALL)
+            # TODO: drop the EP-size guard when dispatch_ffn_combine supports larger EP sizes
+            fused_all2all_enable = quant_type == "w8a8_dynamic" and get_ep_group(
+            ).world_size <= 16 and (not self.dynamic_eplb)
+            moe_comm_type = (MoECommType.MC2
+                             if num_tokens <= mc2_tokens_capacity else
+                             MoECommType.FUSED_ALLTOALL
+                             if fused_all2all_enable else MoECommType.ALLTOALL)
        else:
            raise ValueError(f"Unsupported soc_version: {soc_version}")