Files
xc-llm-ascend/vllm_ascend
pppeng 9a0b786f2b [bugfix][0.18.0] Fix race in non-blocking num_accepted_tokens (#8764)
### What this PR does / why we need it?
The same fix from https://github.com/vllm-project/vllm/pull/36013.
In _update_states_after_model_execute, num_accepted_tokens is copied
from GPU to pinned CPU memory with non_blocking=True. The CPU-side numpy
view is later read in _build_attention_metadata during the next
execute_model call. With async scheduling, _bookkeeping_sync
deliberately avoids any CUDA synchronization (the whole point of async
scheduling), so there is no guarantee the DMA has landed before the CPU
read.

Signed-off-by: ppppeng <zepengliu912@qq.com>
2026-04-27 23:28:52 +08:00
..
2026-03-21 16:05:38 +08:00