[Bugfix] Fix seq_lens reset issue causing performance degradation (#6158)
### What this PR does / why we need it?
Now `seq_lens` was not being reset correctly after each step due to
missing code that clears the sequence lengths. As a result, when
processing a smaller batch after a larger batch, the `seq_lens` from the
larger batch was still carried over. This caused the attention operator
to compute using an unnecessarily larger sequence length, leading to an
increased computation load and performance degradation.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: ZYang6263 <zy626375@gmail.com>
This commit is contained in:
@@ -974,6 +974,8 @@ class NPUModelRunner(GPUModelRunner):
|
||||
1:pad_size +
|
||||
1] * self.uniform_decode_query_len + last_query_loc
|
||||
self.query_start_loc.copy_to_gpu(num_reqs_padded + 1)
|
||||
self.seq_lens.np[num_reqs:].fill(0)
|
||||
self.seq_lens.copy_to_gpu(num_reqs_padded)
|
||||
|
||||
# So we are trying to simulate the behavior of GPUModelRunner's
|
||||
# prepare_inputs for uniform decode mode by padding query_start_loc
|
||||
|
||||
Reference in New Issue
Block a user