[Bugfix] Fix seq_lens reset issue causing performance degradation (#6158)

### What this PR does / why we need it?
Now `seq_lens` was not being reset correctly after each step due to
missing code that clears the sequence lengths. As a result, when
processing a smaller batch after a larger batch, the `seq_lens` from the
larger batch was still carried over. This caused the attention operator
to compute using an unnecessarily larger sequence length, leading to an
increased computation load and performance degradation.



### Does this PR introduce _any_ user-facing change?


### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: ZYang6263 <zy626375@gmail.com>
This commit is contained in:
ZYang6263
2026-01-23 11:29:54 +08:00
committed by GitHub
parent 82a2b3bcc7
commit 418a43e2a2

View File

@@ -974,6 +974,8 @@ class NPUModelRunner(GPUModelRunner):
1:pad_size +
1] * self.uniform_decode_query_len + last_query_loc
self.query_start_loc.copy_to_gpu(num_reqs_padded + 1)
self.seq_lens.np[num_reqs:].fill(0)
self.seq_lens.copy_to_gpu(num_reqs_padded)
# So we are trying to simulate the behavior of GPUModelRunner's
# prepare_inputs for uniform decode mode by padding query_start_loc