GMM custom operator optimization in small batch scenarios (#7100)

### What this PR does / why we need it? GMM custom operator optimization in small batch scenarios ### How was this patch tested? Qwen3-30B input: 4k, output: 1k batch 1： TPOT 7.9 ms -> 7.0 ms Output Token Throughput 125.4651 token/s -> 140.6278 token/s batch 2： TPOT 9.4 ms -> 8.8 ms Output Token Throughput 211.8187 token/s -> 225.2254 token/s batch 16： TPOT 13.6 ms -> 13.5 ms Output Token Throughput 1159.8213 token/s -> 1165.0982 token/s - vLLM version: v0.16.0 - vLLM main: 4034c3d32e --------- Signed-off-by: chenxi-hh <chen464822955@163.com>
2026-03-19 16:10:30 +08:00
parent 8e0ebb470a
commit 42bcad7e9b
3 changed files with 71 additions and 30 deletions
--- a/csrc/torch_binding.cpp
+++ b/csrc/torch_binding.cpp
@@ -697,7 +697,7 @@ std::vector<at::Tensor> moe_grouped_matmul(
    y.emplace_back(y_0);
    at::TensorList result = at::TensorList(y);

-    EXEC_NPU_CMD(aclnnMoeGroupedMatmulWeightNz,
+    EXEC_NPU_CMD(aclnnMoeGroupedMatmul,
                x_list, weight_list, group_list, transpose_weight, result);

    return y;