From 4453c602626c6bce50b376bbb6e803d7b0131a6e Mon Sep 17 00:00:00 2001
From: wangyongjun <wangyongjun7@huawei.com>
Date: Mon, 12 Jan 2026 16:46:21 +0800
Subject: [PATCH] [bugfix]limit graph replay sync (#5761)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

### What this PR does / why we need it?
when graph mode is picewise，replay by synchronize will be effect
performance, sync almost cost 250us

![123](https://github.com/user-attachments/assets/04d2a1f3-1f57-4dbb-85ce-b250f2ee7ff0)

### Does this PR introduce _any_ user-facing change?
only sync when graph mode contain full mode
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d

---------

Signed-off-by: wangyongjun <wangyongjun7@huawei.com>
---
 vllm_ascend/compilation/acl_graph.py | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/vllm_ascend/compilation/acl_graph.py b/vllm_ascend/compilation/acl_graph.py
index 29ec5793..97bfb03d 100644
--- a/vllm_ascend/compilation/acl_graph.py
+++ b/vllm_ascend/compilation/acl_graph.py
@@ -192,12 +192,13 @@ class ACLGraphWrapper:
                 f"got {new_input_addresses}")
 
         logger.info_once("Replaying aclgraph")
-        # In async scheduling or multi-threaded (MT) scenarios, it is possible that
+        # In async scheduling or multi-threaded (MT) scenarios when graph mode is FULL, it is possible that
         # the CPU's record event (from update_attn_params) for the iteration i completes
         # before the grph replay of iteration i-1.
         # To ensure proper ordering, we must call synchronize here before replaying,
         # so that update_attn_params only executes after the previous graph replay has fully completed.
-        torch.npu.synchronize()
+        if self.runtime_mode == CUDAGraphMode.FULL:
+            torch.npu.synchronize()
         entry.aclgraph.replay()
         return entry.output