[Bugfix] Fix the graph capture failure issue in the eagle3+full scenario. (#5553)

### What this PR does / why we need it? When launching the service in the scenario where the cudagraph_mode is set to FULL and Eagle3 acceleration is enabled for inference, an error in fia will cause graph capture to fail. This PR fixes the issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: 7157596103 Signed-off-by: WithHades <244036962@qq.com>
2026-01-07 15:57:16 +08:00
parent 2b8a9ce8bd
commit 1140789e83
2 changed files with 10 additions and 12 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -1939,14 +1939,7 @@ class NPUModelRunner(GPUModelRunner):
                        [0] * dcp_world_size for _ in range(pcp_world_size)
                    ] for _ in range(num_tokens)]
                    long_seq_metadata.num_computed_tokens_of_pcp_dcp = num_computed_tokens_of_pcp_dcp
-                # QUESTION: Why do we separately set query_start_loc for spec in the first place?
-                # While in _prepare_inputs we don't?
-                if self.speculative_config:
-                    self.query_start_loc.cpu[:num_reqs + 1] = torch.tensor(
-                        [0] + self.actual_seq_lengths_q[:num_reqs],
-                        device="cpu",
-                        dtype=torch.int32)
-                    self.query_start_loc.copy_to_gpu()
+
                common_attn_metadata = AscendCommonAttentionMetadata(
                    query_start_loc=self.query_start_loc.gpu[:num_reqs + 1],
                    query_start_loc_cpu=self.query_start_loc.cpu[:num_reqs +