Remove qwen3 moe MC2 cumsum & cast (#3126)
What this PR does / why we need it? The Qwen3 moe MC2 graph currently has two redundant computational operator implementations. After npu_moe_distribute_dispatch_v2, the cumsum and cast operations have been added. By using expert_token_nums_type=0 and not converting weight_scale to float32, these two operators can be eliminated, thereby improving inference performance. Does this PR introduce any user-facing change? No How was this patch tested? No need vLLM version: v0.10.2 vLLM main:f225ea7dd9- vLLM version: v0.10.2 - vLLM main:f225ea7dd9--------- Signed-off-by: florenceCH <gaoxiang120@huawei.com> Co-authored-by: florenceCH <gaoxiang120@huawei.com>
This commit is contained in:
@@ -79,8 +79,6 @@ def quant_apply_mlp(hidden_states: torch.Tensor,
|
||||
|
||||
is_mc2 = get_forward_context().moe_comm_type == MoECommType.MC2
|
||||
if w1_scale_bias is None and is_mc2:
|
||||
if w1_scale.dtype != torch.float32:
|
||||
w1_scale = w1_scale.to(torch.float32)
|
||||
if fusion:
|
||||
# gmm1: gate_up_proj & act_fn: swiglu
|
||||
hidden_states, swiglu_out_scale, _ = torch_npu.npu_grouped_matmul_swiglu_quant(
|
||||
@@ -90,6 +88,8 @@ def quant_apply_mlp(hidden_states: torch.Tensor,
|
||||
weight_scale=w1_scale,
|
||||
x_scale=pertoken_scale)
|
||||
else:
|
||||
if w1_scale.dtype != torch.float32:
|
||||
w1_scale = w1_scale.to(torch.float32)
|
||||
# gmm1: gate_up_proj
|
||||
hidden_states = torch_npu.npu_grouped_matmul(
|
||||
x=[hidden_states],
|
||||
|
||||
Reference in New Issue
Block a user