[Bugfix] fix accuracy problem for quantized deepseek models (#768)

### What this PR does / why we need it?

The root cause of the bug is that numerical computations involving NaNs
cannot eliminate them. We addressed it by using `masked_fill_` to
eliminate NaNs while avoiding memory-wasting `torch.where` approach.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
This patch was tested with vllm v0.8.5 and vllm-ascend master. I run
deepseek_v3 model with offline inference scripts
(examples/dp_offline/run_dp.sh & data_parallel.py).

Signed-off-by: linfeng-yuan <1102311262@qq.com>
This commit is contained in:
linfeng-yuan
2025-05-06 22:09:56 +08:00
committed by GitHub
parent d6e9417652
commit 2cd036ee8e

View File

@@ -285,7 +285,8 @@ def fused_experts(hidden_states: torch.Tensor,
valid_token_mask = torch.arange(
0, sorted_token_indices.shape[0],
device=device).unsqueeze(1) < num_valid_tokens
down_out_list.mul_(valid_token_mask)
down_out_list = down_out_list.masked_fill_(~valid_token_mask,
0).to(dtype)
final_hidden_states.index_add_(0, sorted_token_indices, down_out_list)
else:
# TODO: Reorder device memory 2 times here, replace the current