[bugfix] solve dp scenario Host-Device sync (#5298)

### What this PR does / why we need it? In the speculative decoding scenario, the original code performs Host-Device synchronization, which slows down the main model's execution speed. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: ad32e3e19c Signed-off-by: hwhaokun <haokun0405@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>
2025-12-27 10:36:59 +08:00
parent 69f96950e1
commit cb2fbf7df2
1 changed files with 3 additions and 2 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -1863,10 +1863,11 @@ class NPUModelRunner(GPUModelRunner):
                # QUESTION: Why do we separately set query_start_loc for spec in the first place?
                # While in _prepare_inputs we don't?
                if self.speculative_config:
-                    self.query_start_loc.gpu[:num_reqs + 1] = torch.tensor(
+                    self.query_start_loc.cpu[:num_reqs + 1] = torch.tensor(
                        [0] + self.actual_seq_lengths_q[:num_reqs],
-                        device=self.device,
+                        device="cpu",
                        dtype=torch.int32)
+                    self.query_start_loc.copy_to_gpu()
                common_attn_metadata = AscendCommonAttentionMetadata(
                    query_start_loc=self.query_start_loc.gpu[:num_reqs + 1],
                    query_start_loc_cpu=self.query_start_loc.cpu[:num_reqs +