### What this PR does / why we need it? The same fix from https://github.com/vllm-project/vllm/pull/36013. In _update_states_after_model_execute, num_accepted_tokens is copied from GPU to pinned CPU memory with non_blocking=True. The CPU-side numpy view is later read in _build_attention_metadata during the next execute_model call. With async scheduling, _bookkeeping_sync deliberately avoids any CUDA synchronization (the whole point of async scheduling), so there is no guarantee the DMA has landed before the CPU read. Signed-off-by: ppppeng <zepengliu912@qq.com>