perf: use multicast to avoid padding decode request to prefill size (#1555)

### What this PR does / why we need it? perf: use multicast to avoid padding decode request to prefill size ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: 1fd471e957 Signed-off-by: boying <897013703@qq.com>
2025-07-07 22:36:03 +08:00
parent f08c4f15a2
commit df84cceca8
3 changed files with 81 additions and 34 deletions
--- a/vllm_ascend/quantization/w8a8_dynamic.py
+++ b/vllm_ascend/quantization/w8a8_dynamic.py
@@ -780,7 +780,9 @@ class AscendW8A8DynamicFusedMoEMethod:
                log2phy=log2phy,
                global_redundant_expert_num=global_redundant_expert_num,
                shared_experts=shared_experts)
-        elif fused_moe_state == FusedMoEState.AllGather:
+        elif fused_moe_state in [
+                FusedMoEState.AllGather, FusedMoEState.NaiveMulticast
+        ]:
            return fused_experts(hidden_states=x,
                                 w1=layer.w13_weight,
                                 w1_scale=layer.w13_weight_scale,