[Refactor] Add expert processed token count output for DispatchFFNCombine/DispatchFFNCombineBF16 (#6402)

### What this PR does / why we need it? Add New Output for Expert Token Count An additional output tensor expert_token_nums is added to both operators to meet the requirement of tracking token distribution among experts: Tensor Name: expert_token_nums Dimension: 1D tensor Shape: (local_expert_num,) Data Type: int32 Semantics: Represents the number of tokens actually received by each expert on the current card. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: dc917cceb8 --------- Signed-off-by: guanguan0308 <1546542263@qq.com> Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
2026-02-03 10:41:06 +08:00
parent 26b83f8bde
commit dffac6db73
18 changed files with 97 additions and 84 deletions
--- a/vllm_ascend/ops/fused_moe/moe_comm_method.py
+++ b/vllm_ascend/ops/fused_moe/moe_comm_method.py
@@ -315,6 +315,7 @@ class FusedMC2CommImpl(MoECommMethod):
        expert_tokens = None
        if envs_ascend.VLLM_ASCEND_ENABLE_FUSED_MC2 == 1:
            out = torch.empty_like(hidden_states)
+            expert_token_nums = torch.zeros([self.moe_config.num_local_experts], dtype=torch.int32)
            torch.ops._C_ascend.dispatch_ffn_combine(  # type: ignore
                x=hidden_states,
                weight1=w1,
@@ -326,7 +327,9 @@ class FusedMC2CommImpl(MoECommMethod):
                group=self.token_dispatcher.moe_all_to_all_group_name,
                max_output_size=65536,
                out=out,
+                expert_token_nums=expert_token_nums,
            )
+            expert_tokens = expert_token_nums
        elif envs_ascend.VLLM_ASCEND_ENABLE_FUSED_MC2 == 2:
            assert expert_map is not None, "expert_map cannot be None."
            group_list_type = 1