[2/N][Feat] Add MC2 communication method for MoE layers (#2469)

### What this PR does / why we need it? This method replaces the previous all-gather approach for small numbers of tokens. The key changes include: - A new `AscendFusedMoE` layer that handles token splitting, local computation, and final aggregation via all-gather. - Logic in the model runner to dynamically select between the new MC2 method and the existing all-gather method based on the number of input tokens. - Sharding the MoE communication mask across tensor-parallel ranks. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test case fixed. - vLLM version: v0.10.1.1 - vLLM main: b00e69f8ca --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-26 19:05:23 +08:00
parent 5d8ec28009
commit a6bb502e70
11 changed files with 506 additions and 410 deletions
--- a/tests/ut/test_utils.py
+++ b/tests/ut/test_utils.py
@@ -289,13 +289,13 @@ class TestUtils(TestBase):
        # ascend custom op is not registered
        utils.register_ascend_customop()
        # should call register_oot three
-        self.assertEqual(mock_customop.register_oot.call_count, 8)
+        self.assertEqual(mock_customop.register_oot.call_count, 9)
        self.assertTrue(utils._ASCEND_CUSTOMOP_IS_REIGISTERED)

        # ascend custom op is already registered
        utils.register_ascend_customop()
        # should not register_oot again, thus only called three in this ut
-        self.assertEqual(mock_customop.register_oot.call_count, 8)
+        self.assertEqual(mock_customop.register_oot.call_count, 9)


 class TestProfileExecuteDuration(TestBase):