[Perf] Avoid CPU sync in mrope_positions copy by using full tensor copy (#7014)
### What this PR does / why we need it?
The index-select operation `mrope_positions.gpu[:,
:total_num_scheduled_tokens].copy_(...)` triggers a CPU-NPU
synchronization, which blocks subsequent operator dispatch and causes
bubbles visible in Profiling.
This PR changes to full tensor copy
(`mrope_positions.gpu.copy_(mrope_positions.cpu)`) to eliminate the sync
point. The trade-off is a negligible increase in memory usage since
`mrope_positions.cpu` is a small tensor.
**Result:** ~2-3% TPOT improvement with the profiling bubbles
eliminated.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Verified via Profiling that the CPU sync bubble is eliminated and TPOT
is reduced by 2-3%.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
This commit is contained in:
@@ -723,8 +723,8 @@ class NPUModelRunner(GPUModelRunner):
|
||||
if self.uses_mrope:
|
||||
# Only relevant for models using M-RoPE (e.g, Qwen2-VL)
|
||||
self._calc_mrope_positions(scheduler_output)
|
||||
self.mrope_positions.gpu[:, :total_num_scheduled_tokens].copy_(
|
||||
self.mrope_positions.cpu[:, :total_num_scheduled_tokens],
|
||||
self.mrope_positions.gpu.copy_(
|
||||
self.mrope_positions.cpu,
|
||||
non_blocking=True,
|
||||
)
|
||||
elif self.uses_xdrope_dim > 0:
|
||||
|
||||
Reference in New Issue
Block a user