Remove qwen3 moe MC2 cumsum & cast (#3126)
What this PR does / why we need it? The Qwen3 moe MC2 graph currently has two redundant computational operator implementations. After npu_moe_distribute_dispatch_v2, the cumsum and cast operations have been added. By using expert_token_nums_type=0 and not converting weight_scale to float32, these two operators can be eliminated, thereby improving inference performance. Does this PR introduce any user-facing change? No How was this patch tested? No need vLLM version: v0.10.2 vLLM main:f225ea7dd9- vLLM version: v0.10.2 - vLLM main:f225ea7dd9--------- Signed-off-by: florenceCH <gaoxiang120@huawei.com> Co-authored-by: florenceCH <gaoxiang120@huawei.com>
This commit is contained in:
@@ -98,7 +98,7 @@ class TestTokenDispatcherWithMC2(TestBase):
|
||||
self.row_idx, expert_map)
|
||||
mock_dispatch.assert_called_once()
|
||||
self.assertEqual(output["group_list_type"],
|
||||
1) # group_list_type == 1
|
||||
0) # group_list_type == 0
|
||||
|
||||
def test_token_dispatch_with_shared_experts_and_quant(self):
|
||||
self.shared_experts = MagicMock()
|
||||
|
||||
Reference in New Issue
Block a user