[Bugfix] Fix seq_lens reset issue causing performance degradation (#6158)

### What this PR does / why we need it? Now `seq_lens` was not being reset correctly after each step due to missing code that clears the sequence lengths. As a result, when processing a smaller batch after a larger batch, the `seq_lens` from the larger batch was still carried over. This caused the attention operator to compute using an unnecessarily larger sequence length, leading to an increased computation load and performance degradation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: d68209402d Signed-off-by: ZYang6263 <zy626375@gmail.com>
2026-01-23 11:29:54 +08:00
parent 82a2b3bcc7
commit 418a43e2a2
1 changed files with 2 additions and 0 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -974,6 +974,8 @@ class NPUModelRunner(GPUModelRunner):
                                1:pad_size +
                                1] * self.uniform_decode_query_len + last_query_loc
                        self.query_start_loc.copy_to_gpu(num_reqs_padded + 1)
+                        self.seq_lens.np[num_reqs:].fill(0)
+                        self.seq_lens.copy_to_gpu(num_reqs_padded)             

                    # So we are trying to simulate the behavior of GPUModelRunner's
                    # prepare_inputs for uniform decode mode by padding query_start_loc