[Fix] Refines decode mode padding condition for uniform queries (#5164)
### What this PR does / why we need it?
The reason why we cannot use `self.cudagraph_batch_sizes[-1]` is that
it's actually not the max number of tokens to be padded in
`FULL_DECODE_ONLY` mode, much larger instead. And it's trimmed only
before capturing to `compilation_cases`, this really caused us lots of
trouble.
Updates the logic to ensure padding occurs only when the number of input
tokens falls within a valid uniform decode query range, improving
consistency and avoiding unnecessary padding in specific decode modes.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
This commit is contained in:
@@ -1008,8 +1008,9 @@ class NPUModelRunner(GPUModelRunner):
|
||||
|
||||
# TODO: We should make this official ASAP. Also note that if we pad here,
|
||||
# the builders won’t need to add any extra padding.
|
||||
max_decode_tokens = self.scheduler_config.max_num_seqs * self.uniform_decode_query_len
|
||||
if self.compilation_config.cudagraph_mode.decode_mode() == CUDAGraphMode.FULL and \
|
||||
uniform_decode and num_input_tokens <= self.cudagraph_batch_sizes[-1]:
|
||||
uniform_decode and self.uniform_decode_query_len <= num_input_tokens <= max_decode_tokens:
|
||||
num_reqs_padded = num_input_tokens // self.uniform_decode_query_len
|
||||
pad_size = num_reqs_padded - num_reqs
|
||||
if pad_size > 0:
|
||||
|
||||
Reference in New Issue
Block a user