[bugfix][0.18.0] Fix race in non-blocking num_accepted_tokens (#8764)
### What this PR does / why we need it? The same fix from https://github.com/vllm-project/vllm/pull/36013. In _update_states_after_model_execute, num_accepted_tokens is copied from GPU to pinned CPU memory with non_blocking=True. The CPU-side numpy view is later read in _build_attention_metadata during the next execute_model call. With async scheduling, _bookkeeping_sync deliberately avoids any CUDA synchronization (the whole point of async scheduling), so there is no guarantee the DMA has landed before the CPU read. Signed-off-by: ppppeng <zepengliu912@qq.com>
This commit is contained in:
@@ -2041,6 +2041,8 @@ class NPUModelRunner(GPUModelRunner):
|
||||
else:
|
||||
max_seq_len = self.seq_lens.np[:num_reqs].max().item()
|
||||
if use_spec_decode and self.need_accepted_tokens:
|
||||
if self.num_accepted_tokens_event is not None:
|
||||
self.num_accepted_tokens_event.synchronize()
|
||||
self.num_accepted_tokens.np[:num_reqs] = self.input_batch.num_accepted_tokens_cpu[:num_reqs]
|
||||
self.num_accepted_tokens.np[num_reqs:].fill(1)
|
||||
self.num_accepted_tokens.copy_to_gpu()
|
||||
|
||||
Reference in New Issue
Block a user