### What this PR does / why we need it?
The same fix from https://github.com/vllm-project/vllm/pull/36013.
In _update_states_after_model_execute, num_accepted_tokens is copied
from GPU to pinned CPU memory with non_blocking=True. The CPU-side numpy
view is later read in _build_attention_metadata during the next
execute_model call. With async scheduling, _bookkeeping_sync
deliberately avoids any CUDA synchronization (the whole point of async
scheduling), so there is no guarantee the DMA has landed before the CPU
read.
Signed-off-by: ppppeng <zepengliu912@qq.com>