Remove qwen3 moe MC2 cumsum & cast (#3126)

What this PR does / why we need it? The Qwen3 moe MC2 graph currently has two redundant computational operator implementations. After npu_moe_distribute_dispatch_v2, the cumsum and cast operations have been added. By using expert_token_nums_type=0 and not converting weight_scale to float32, these two operators can be eliminated, thereby improving inference performance. Does this PR introduce any user-facing change? No How was this patch tested? No need vLLM version: v0.10.2 vLLM main: f225ea7dd9 - vLLM version: v0.10.2 - vLLM main: f225ea7dd9 --------- Signed-off-by: florenceCH <gaoxiang120@huawei.com> Co-authored-by: florenceCH <gaoxiang120@huawei.com>
2025-09-26 08:51:30 +08:00
parent 2930e4a6bd
commit 14497b748d
3 changed files with 5 additions and 4 deletions
--- a/tests/ut/ops/test_token_dispatcher.py
+++ b/tests/ut/ops/test_token_dispatcher.py
@@ -98,7 +98,7 @@ class TestTokenDispatcherWithMC2(TestBase):
                                                    self.row_idx, expert_map)
            mock_dispatch.assert_called_once()
            self.assertEqual(output["group_list_type"],
-                             1)  # group_list_type == 1
+                             0)  # group_list_type == 0

    def test_token_dispatch_with_shared_experts_and_quant(self):
        self.shared_experts = MagicMock()