[Bugfix] Support Qwen3-MOE on aclgraph mode (#1381)

### What this PR does / why we need it? Fix the shape of the `npu_moe_init_routing` input parameters to support aclgraph mode on qwen3-moe In addition to this PR, resolving the `gatherv3` error might be necessary. See related PR https://github.com/vllm-project/vllm-ascend/pull/1297 https://github.com/vllm-project/vllm-ascend/pull/1446 Thanks to @yiz-liu for providing the idea ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested on Qwen3-30B-A3B Closes: https://github.com/vllm-project/vllm-ascend/issues/1368 --------- Signed-off-by: ApsarasX <apsarax@outlook.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-06 15:29:36 +08:00
parent 14373f65d7
commit c58accc15e
3 changed files with 21 additions and 3 deletions
--- a/vllm_ascend/ops/fused_moe.py
+++ b/vllm_ascend/ops/fused_moe.py
@@ -655,6 +655,7 @@ def fused_experts(
    top_k: int,
    expert_map: torch.Tensor = None,
    apply_router_weight_on_input: bool = False,
+    max_num_tokens: Optional[int] = None,
 ) -> torch.Tensor:
    """
    Fused experts with top-k routing.
@@ -748,11 +749,12 @@ def fused_experts(
                                dtype=torch.int32,
                                device=device).view(top_k, -1).permute(
                                    1, 0).contiguous())
+        active_num = max_num_tokens if max_num_tokens is not None else num_tokens
        sorted_hidden_states, expanded_row_idx, expanded_expert_idx = torch_npu.npu_moe_init_routing(
            hidden_states,
            row_idx=row_idx,
            expert_idx=topk_ids,
-            active_num=num_tokens)
+            active_num=active_num)

        expert_tokens = torch_npu.npu_moe_compute_expert_tokens(
            expanded_expert_idx, num_experts)