[Bugfix] dynamic eplb does't use fused_alltoall (#4919)
### What this PR does / why we need it?
The fused alltoall operator itself was not designed or implemented to
handle the scenario where tensors are lists, but the weights for dynamic
load balancing are in list form.
Therefore, we have disabled this operator when using dynamic load
balancing.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
This commit is contained in:
@@ -1434,10 +1434,13 @@ class NPUModelRunner(GPUModelRunner):
|
||||
moe_comm_type = MoECommType.ALLGATHER
|
||||
|
||||
elif soc_version in {AscendDeviceType._910_93}:
|
||||
moe_comm_type = (
|
||||
MoECommType.MC2 if num_tokens <= mc2_tokens_capacity else
|
||||
MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic"
|
||||
and get_ep_group().world_size <= 16 else MoECommType.ALLTOALL)
|
||||
# TODO: drop the EP-size guard when dispatch_ffn_combine supports larger EP sizes
|
||||
fused_all2all_enable = quant_type == "w8a8_dynamic" and get_ep_group(
|
||||
).world_size <= 16 and (not self.dynamic_eplb)
|
||||
moe_comm_type = (MoECommType.MC2
|
||||
if num_tokens <= mc2_tokens_capacity else
|
||||
MoECommType.FUSED_ALLTOALL
|
||||
if fused_all2all_enable else MoECommType.ALLTOALL)
|
||||
else:
|
||||
raise ValueError(f"Unsupported soc_version: {soc_version}")
|
||||
|
||||
|
||||
Reference in New Issue
Block a user