[Perf]Enable npu_moe_gating_top_k_softmax on quantized scenarios (#2633)

### What this PR does / why we need it? This PR enables `npu_moe_gating_top_k_softmax` when running quantized MoE (such as W8A8). This op in fact makes no distinction between quantized and non-quantized scenarios. Introducing this op reduces 3~4ms for TPOT. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: ce30dca5c4 Signed-off-by: Angazenn <supperccell@163.com>
2025-09-03 09:14:17 +08:00
parent 24d4dad7b2
commit b84465c525
3 changed files with 33 additions and 15 deletions
--- a/vllm_ascend/ops/fused_moe.py
+++ b/vllm_ascend/ops/fused_moe.py
@@ -173,8 +173,7 @@ class AscendUnquantizedFusedMoEMethod(UnquantizedFusedMoEMethod):
            custom_routing_function=custom_routing_function,
            scoring_func=scoring_func,
            e_score_correction_bias=e_score_correction_bias,
-            global_num_experts=global_num_experts,
-            is_unquantized=True)
+            global_num_experts=global_num_experts)

        topk_weights = topk_weights.to(x.dtype)
        # this is a naive implementation for experts load balance so as