perf: use multicast to avoid padding decode request to prefill size (#1555)

### What this PR does / why we need it?
perf: use multicast to avoid padding decode request to prefill size

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
1fd471e957

Signed-off-by: boying <897013703@qq.com>
This commit is contained in:
NeverRaR
2025-07-07 22:36:03 +08:00
committed by GitHub
parent f08c4f15a2
commit df84cceca8
3 changed files with 81 additions and 34 deletions

View File

@@ -780,7 +780,9 @@ class AscendW8A8DynamicFusedMoEMethod:
log2phy=log2phy,
global_redundant_expert_num=global_redundant_expert_num,
shared_experts=shared_experts)
elif fused_moe_state == FusedMoEState.AllGather:
elif fused_moe_state in [
FusedMoEState.AllGather, FusedMoEState.NaiveMulticast
]:
return fused_experts(hidden_states=x,
w1=layer.w13_weight,
w1_scale=layer.w13_weight_scale,