### What this PR does / why we need it? Using non-blocking operations for device-to-host transfers can lead to data corruption in later steps. The CPU tensor is accessed right after the transfer is triggered, but the transfer might not be complete yet. As a result, the data could be wrong. This problem was seen in the A3 environment during `profile_run`. ### How was this patch tested? CI pass. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>