use npu_moe_gating_top_k_softmax (#1355)

### What this PR does / why we need it? The optimization solution for non-deepseek select_experts is to replace gating_topk_softmax with softmax+topk+to, which is optimized from 37us to 14us on bf16/fp16 of qwen3-235b - vLLM version: v0.9.2 - vLLM main: 1a4f35e2ea --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-11 08:55:06 +08:00
parent 9d16c9982e
commit ee40d3d850
4 changed files with 107 additions and 14 deletions
--- a/vllm_ascend/envs.py
+++ b/vllm_ascend/envs.py
@@ -117,6 +117,11 @@ env_variables: Dict[str, Callable[[], Any]] = {
    # value to False to disable the optimized model.
    "USE_OPTIMIZED_MODEL":
    lambda: bool(int(os.getenv('USE_OPTIMIZED_MODEL', '1'))),
+    # SELECT_GATING_TOPK_SOTFMAX_EXPERTS is the equivalent of select_experts in non-quantized scenarios.
+    # In theory, it should have better performance than select_experts.
+    # Subsequent versions will remove the SELECT_GATING_TOPK_SOTFMAX_EXPERTS tag and use it as the default mode.
+    "SELECT_GATING_TOPK_SOTFMAX_EXPERTS":
+    lambda: bool(int(os.getenv("SELECT_GATING_TOPK_SOTFMAX_EXPERTS", '0'))),
    # The tolerance of the kv cache size, if the difference between the
    # actual kv cache size and the cached kv cache size is less than this value,
    # then the cached kv cache size will be used.