Files
xc-llm-ascend/csrc/dispatch_ffn_combine/op_kernel
lhchg 717d299ae5 [BugFix]bug fix for dispatch_ffn_combine (#6156)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Some synchronization logic of the fusion operator copies EP *
expertPerRank int32 values. This part of data contains synchronization
signals and data.

The 512B DataBlock of Ascend A3 writes all data in the same block
atomically to the HBM.

For the DeepSeek model, when expertPerRank per device is 16, the 512B
alignment is met in both 16-device single-node and 32-device two-node
scenarios. Therefore, we check the first position of each 512B data. If
the value is not 0, it indicates that the current 512B data has been
sent.

However, for other cases where expertPerRank per device is not 16, EP *
expertPerRank does not meet the 512B alignment. If the above logic is
used for checking, there will be problems.

Therefore, here we will pad the EP * expertPerRank data length to the
length aligned to 512B.

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: lhchg <lhao_cheng@163.com>
Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>
2026-01-23 21:14:18 +08:00
..
2025-12-05 15:07:31 +08:00