[Bugfix][DispatchFFNCombine] resolve vec error caused by unaligned UB access (#6707)

### What this PR does / why we need it? 1. Fix a vec error caused by unaligned UB accesss in the DispatchFFNCombine; 2. Fix expert_token_nums tensor defined on host instead of NPU in moe_comm_method.py 3. Fix multi-core copy issue of expert_token_nums in dispatchffnCombine op (single aiv copy is sufficient) ### Does this PR introduce _any_ user-facing change? No, this PR does not introduce any user-facing changes. The fix only addresses internal memory access logic and does not modify any public APIs, interfaces, or user-visible behaviors. ### How was this patch tested? `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` vLLM version: v0.15.0 - vLLM version: v0.15.0 - vLLM main: 9562912cea Signed-off-by: xulei_ict <xulei292@huawei.com> Co-authored-by: xulei_ict <xulei292@huawei.com>
2026-02-14 10:32:50 +08:00
parent e2175d9c7e
commit 1e77077788
3 changed files with 19 additions and 12 deletions
--- a/csrc/dispatch_ffn_combine_bf16/op_kernel/dispatch_ffn_combine_bf16_kernel.hpp
+++ b/csrc/dispatch_ffn_combine_bf16/op_kernel/dispatch_ffn_combine_bf16_kernel.hpp
@@ -756,8 +756,9 @@ CATLASS_DEVICE
        ExpertTokenNums.SetGlobalBuffer(reinterpret_cast<__gm__ int32_t*>(params.ptrExpertTokenNums));
        AscendC::GlobalTensor<int32_t> LcalCumsumMM;
        LcalCumsumMM.SetGlobalBuffer(reinterpret_cast<__gm__ int32_t*>(workspaceInfo.ptrcumsumMM + (params.EP - 1) * params.expertPerRank * sizeof(int32_t)));
-        CopyGMToGM(ExpertTokenNums, LcalCumsumMM, params.expertPerRank, params.ubMoveNum);
-        AscendC::SyncAll<true>();
+        if (coreIdx == 0) {
+            CopyGMToGM(ExpertTokenNums, LcalCumsumMM, params.expertPerRank, params.ubMoveNum);
+        }

        uint32_t curGroupOffset = 0;
        int32_t prevSumBeforeRank = 0;