[Bugfix] Synchronize only the current stream to avoid device sync (#6432)

### What this PR does / why we need it?

Following [PR
#4233](https://github.com/vllm-project/vllm-ascend/pull/4233), a
synchronization mechanism was introduced between steps in asynchronous
scheduling with ACL Graph to address a hanging issue. However, full
device-level synchronization is unnecessary—only the operations on the
current stream need to be synchronized. Otherwise, if other background
operations (such as send and recv) are running concurrently, they may
negatively impact inference performance for the instance.

hang problem

![c4bbfac9a9088acec0ad335b4c2af437](https://github.com/user-attachments/assets/b7c8c612-4d45-48ec-9465-954869f9643d)

Synchronizing only the current stream can also resolve the hang issue.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: For_YL <zhangtangwei@huawei.com>
Co-authored-by: For_YL <zhangtangwei@huawei.com>
This commit is contained in:
IWantFight
2026-02-04 10:59:45 +08:00
committed by GitHub
parent bfcc372f75
commit e7a13beedb

View File

@@ -196,7 +196,7 @@ class ACLGraphWrapper:
else False
)
if self.runtime_mode != CUDAGraphMode.FULL or not forward_context.is_draft_model or not use_eagle:
torch.npu.synchronize()
torch.npu.current_stream().synchronize()
entry.aclgraph.replay()
return entry.output