[Bugfix] Synchronize only the current stream to avoid device sync (#6432)
### What this PR does / why we need it?
Following [PR
#4233](https://github.com/vllm-project/vllm-ascend/pull/4233), a
synchronization mechanism was introduced between steps in asynchronous
scheduling with ACL Graph to address a hanging issue. However, full
device-level synchronization is unnecessary—only the operations on the
current stream need to be synchronized. Otherwise, if other background
operations (such as send and recv) are running concurrently, they may
negatively impact inference performance for the instance.
hang problem

Synchronizing only the current stream can also resolve the hang issue.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: For_YL <zhangtangwei@huawei.com>
Co-authored-by: For_YL <zhangtangwei@huawei.com>
This commit is contained in:
@@ -196,7 +196,7 @@ class ACLGraphWrapper:
|
||||
else False
|
||||
)
|
||||
if self.runtime_mode != CUDAGraphMode.FULL or not forward_context.is_draft_model or not use_eagle:
|
||||
torch.npu.synchronize()
|
||||
torch.npu.current_stream().synchronize()
|
||||
entry.aclgraph.replay()
|
||||
return entry.output
|
||||
|
||||
|
||||
Reference in New Issue
Block a user