[Bugfix] dynamic eplb does't use fused_alltoall (#4919)

### What this PR does / why we need it?
The fused alltoall operator itself was not designed or implemented to
handle the scenario where tensors are lists, but the weights for dynamic
load balancing are in list form.
Therefore, we have disabled this operator when using dynamic load
balancing.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
This commit is contained in:
LI SHENGYONG
2025-12-16 10:59:30 +08:00
committed by GitHub
parent 195eac665b
commit 0918de58d5

View File

@@ -1434,10 +1434,13 @@ class NPUModelRunner(GPUModelRunner):
moe_comm_type = MoECommType.ALLGATHER moe_comm_type = MoECommType.ALLGATHER
elif soc_version in {AscendDeviceType._910_93}: elif soc_version in {AscendDeviceType._910_93}:
moe_comm_type = ( # TODO: drop the EP-size guard when dispatch_ffn_combine supports larger EP sizes
MoECommType.MC2 if num_tokens <= mc2_tokens_capacity else fused_all2all_enable = quant_type == "w8a8_dynamic" and get_ep_group(
MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic" ).world_size <= 16 and (not self.dynamic_eplb)
and get_ep_group().world_size <= 16 else MoECommType.ALLTOALL) moe_comm_type = (MoECommType.MC2
if num_tokens <= mc2_tokens_capacity else
MoECommType.FUSED_ALLTOALL
if fused_all2all_enable else MoECommType.ALLTOALL)
else: else:
raise ValueError(f"Unsupported soc_version: {soc_version}") raise ValueError(f"Unsupported soc_version: {soc_version}")