From e04a5e3dd3d41a2f0757ae0eb3b6ba17cee965c5 Mon Sep 17 00:00:00 2001
From: Jade Zheng <zheng.shoujian@outlook.com>
Date: Mon, 20 Oct 2025 18:24:21 +0800
Subject: [PATCH] [Bugfix] Fix race condition in d2h transfer (#3372)

### What this PR does / why we need it?

Using non-blocking operations for device-to-host transfers can lead to
data corruption in later steps. The CPU tensor is accessed right after
the transfer is triggered, but the transfer might not be complete yet.
As a result, the data could be wrong. This problem was seen in the A3
environment during `profile_run`.

### How was this patch tested?
CI pass.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
---
 vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py b/vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py
index ceba1c4..5bd622e 100644
--- a/vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py
+++ b/vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py
@@ -459,7 +459,7 @@ def torchair_fused_experts_with_all2all(
         token_counts_combined = token_counts_combined.view(
             2, ep_group.world_size, -1).sum(dim=2)
         token_counts_combined_cpu = token_counts_combined.to(
-            torch.device("cpu"), non_blocking=True).numpy()
+            torch.device("cpu"), non_blocking=False).numpy()
         all_tokens = gather_sizes.sum()
 
         gathered_tokens = quantized_tokens.new_empty(all_tokens.item(),