[Bugfix] dynamic eplb does't use fused_alltoall (#4919)

### What this PR does / why we need it?
The fused alltoall operator itself was not designed or implemented to
handle the scenario where tensors are lists, but the weights for dynamic
load balancing are in list form.
Therefore, we have disabled this operator when using dynamic load
balancing.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
This commit is contained in:
LI SHENGYONG
2025-12-16 10:59:30 +08:00
committed by GitHub
parent 195eac665b
commit 0918de58d5

View File

@@ -1434,10 +1434,13 @@ class NPUModelRunner(GPUModelRunner):
moe_comm_type = MoECommType.ALLGATHER
elif soc_version in {AscendDeviceType._910_93}:
moe_comm_type = (
MoECommType.MC2 if num_tokens <= mc2_tokens_capacity else
MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic"
and get_ep_group().world_size <= 16 else MoECommType.ALLTOALL)
# TODO: drop the EP-size guard when dispatch_ffn_combine supports larger EP sizes
fused_all2all_enable = quant_type == "w8a8_dynamic" and get_ep_group(
).world_size <= 16 and (not self.dynamic_eplb)
moe_comm_type = (MoECommType.MC2
if num_tokens <= mc2_tokens_capacity else
MoECommType.FUSED_ALLTOALL
if fused_all2all_enable else MoECommType.ALLTOALL)
else:
raise ValueError(f"Unsupported soc_version: {soc_version}")