[bugfix] solve dp scenario Host-Device sync (#5298)
### What this PR does / why we need it?
In the speculative decoding scenario, the original code performs
Host-Device synchronization, which slows down the main model's execution
speed.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: hwhaokun <haokun0405@163.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
This commit is contained in:
@@ -1863,10 +1863,11 @@ class NPUModelRunner(GPUModelRunner):
|
||||
# QUESTION: Why do we separately set query_start_loc for spec in the first place?
|
||||
# While in _prepare_inputs we don't?
|
||||
if self.speculative_config:
|
||||
self.query_start_loc.gpu[:num_reqs + 1] = torch.tensor(
|
||||
self.query_start_loc.cpu[:num_reqs + 1] = torch.tensor(
|
||||
[0] + self.actual_seq_lengths_q[:num_reqs],
|
||||
device=self.device,
|
||||
device="cpu",
|
||||
dtype=torch.int32)
|
||||
self.query_start_loc.copy_to_gpu()
|
||||
common_attn_metadata = AscendCommonAttentionMetadata(
|
||||
query_start_loc=self.query_start_loc.gpu[:num_reqs + 1],
|
||||
query_start_loc_cpu=self.query_start_loc.cpu[:num_reqs +
|
||||
|
||||
Reference in New Issue
Block a user