[2/N][Feat] Add MC2 communication method for MoE layers (#2469)

### What this PR does / why we need it?
This method replaces the previous all-gather approach for small numbers
of tokens.

The key changes include:
- A new `AscendFusedMoE` layer that handles token splitting, local
computation, and final aggregation via all-gather.
- Logic in the model runner to dynamically select between the new MC2
method and the existing all-gather method based on the number of input
tokens.
- Sharding the MoE communication mask across tensor-parallel ranks.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test case fixed.


- vLLM version: v0.10.1.1
- vLLM main:
b00e69f8ca

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
This commit is contained in:
yiz-liu
2025-08-26 19:05:23 +08:00
committed by GitHub
parent 5d8ec28009
commit a6bb502e70
11 changed files with 506 additions and 410 deletions

View File

@@ -497,6 +497,10 @@ class PanguProMoESparseMoeBlock(nn.Module):
router_logits, _ = self.gate(hidden_states)
global _ROUTER_SCALE
_ROUTER_SCALE = self.router_scale
# TODO(angazenn): Does not support MC2 currently
get_forward_context().moe_comm_method_name = "allgathercommimpl"
if not use_h2p():
final_hidden_states = self.experts.forward_impl(
hidden_states=hidden_states, router_logits=router_logits)