[Feature] Add token mask for DispatchGmmCombineDecode operator (#5171)

### What this PR does / why we need it?
In this PR, DispatchGmmCombineDecode add an optional input
x_active_mask, with which
only token masked True will be dispatched and handle.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
This commit is contained in:
wangqiankun13
2025-12-19 16:31:48 +08:00
committed by GitHub
parent 636265be6d
commit 118b0ed346
14 changed files with 292 additions and 96 deletions

View File

@@ -161,8 +161,9 @@ std::tuple<at::Tensor, at::Tensor> dispatch_gmm_combine_decode_meta(
const at::Tensor &gmm1_permuted_weight_scale,
const at::Tensor &gmm2_weight,
const at::Tensor &gmm2_weight_scale,
const at::Tensor &expert_scales,
const c10::optional<at::Tensor> &expert_smooth_scales,
const c10::optional<at::Tensor> &expert_scales,
const c10::optional<at::Tensor> &x_active_mask,
c10::string_view group_ep,
int64_t ep_rank_size,
int64_t ep_rank_id,