GMM custom operator optimization in small batch scenarios (#7100)

### What this PR does / why we need it?
GMM custom operator optimization in small batch scenarios

### How was this patch tested?

Qwen3-30B input: 4k, output: 1k

batch 1:
TPOT 7.9 ms -> 7.0 ms
Output Token Throughput 125.4651 token/s -> 140.6278 token/s

batch 2:
TPOT 9.4 ms -> 8.8 ms
Output Token Throughput 211.8187 token/s -> 225.2254 token/s

batch 16:
TPOT 13.6 ms -> 13.5 ms
Output Token Throughput 1159.8213 token/s -> 1165.0982 token/s

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: chenxi-hh <chen464822955@163.com>
This commit is contained in:
chenxi-hh
2026-03-19 16:10:30 +08:00
committed by GitHub
parent 8e0ebb470a
commit 42bcad7e9b
3 changed files with 71 additions and 30 deletions

View File

@@ -697,7 +697,7 @@ std::vector<at::Tensor> moe_grouped_matmul(
y.emplace_back(y_0);
at::TensorList result = at::TensorList(y);
EXEC_NPU_CMD(aclnnMoeGroupedMatmulWeightNz,
EXEC_NPU_CMD(aclnnMoeGroupedMatmul,
x_list, weight_list, group_list, transpose_weight, result);
return y;