[fix]: fix precision issue in dispatch_ffn_combine_bf16 and remove redundant sync (#7198)

### What this PR does / why we need it?
Fix the precision issue in dispatch_ffn_combine_bf16 operator.
Remove redundant synchronization operations in dispatch_ffn_combine
operator.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: guanguan0308 <1546542263@qq.com>
This commit is contained in:
guanguan0308
2026-03-23 10:14:03 +08:00
committed by GitHub
parent e68464a1d6
commit 44ef9a36ac
8 changed files with 531 additions and 462 deletions

View File

@@ -85,8 +85,8 @@ KernelMoeTokenUnpermute<T1, T2, T3, PROBS>::Init(GM_ADDR permuted_tokens, GM_ADD
GM_ADDR unpermuted_tokens,
const MoeTokenUnpermuteTilingData *__restrict tiling_data)
{
this->blockIdx = get_block_idx() + get_subblockid() * get_block_num();
this->blockNum = get_block_num() * get_subblockdim();
this->blockIdx = get_block_idx();
this->blockNum = get_block_num();
if (blockIdx >= blockNum) {
return;