[bugfix]limit graph replay sync (#5761)
### What this PR does / why we need it?
when graph mode is picewise,replay by synchronize will be effect
performance, sync almost cost 250us

### Does this PR introduce _any_ user-facing change?
only sync when graph mode contain full mode
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: wangyongjun <wangyongjun7@huawei.com>
This commit is contained in:
@@ -192,12 +192,13 @@ class ACLGraphWrapper:
|
|||||||
f"got {new_input_addresses}")
|
f"got {new_input_addresses}")
|
||||||
|
|
||||||
logger.info_once("Replaying aclgraph")
|
logger.info_once("Replaying aclgraph")
|
||||||
# In async scheduling or multi-threaded (MT) scenarios, it is possible that
|
# In async scheduling or multi-threaded (MT) scenarios when graph mode is FULL, it is possible that
|
||||||
# the CPU's record event (from update_attn_params) for the iteration i completes
|
# the CPU's record event (from update_attn_params) for the iteration i completes
|
||||||
# before the grph replay of iteration i-1.
|
# before the grph replay of iteration i-1.
|
||||||
# To ensure proper ordering, we must call synchronize here before replaying,
|
# To ensure proper ordering, we must call synchronize here before replaying,
|
||||||
# so that update_attn_params only executes after the previous graph replay has fully completed.
|
# so that update_attn_params only executes after the previous graph replay has fully completed.
|
||||||
torch.npu.synchronize()
|
if self.runtime_mode == CUDAGraphMode.FULL:
|
||||||
|
torch.npu.synchronize()
|
||||||
entry.aclgraph.replay()
|
entry.aclgraph.replay()
|
||||||
return entry.output
|
return entry.output
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user