[main][bugfix] Fix fullgraph padding bug in mtp eagle refactor (#5692)

### What this PR does / why we need it?
The condition for determining padding in the fullgraph overlay with MTP
and PCP has been modified to accommodate corner cases where the shape
capture size is manually specified.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut and tests

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
This commit is contained in:
lilinsiman
2026-01-10 23:07:48 +08:00
committed by GitHub
parent 78b554dda9
commit c5744e2350

View File

@@ -934,9 +934,13 @@ class NPUModelRunner(GPUModelRunner):
# TODO: We should make this official ASAP. Also note that if we pad here, # TODO: We should make this official ASAP. Also note that if we pad here,
# the builders wont need to add any extra padding. # the builders wont need to add any extra padding.
max_decode_tokens = self.scheduler_config.max_num_seqs * self.uniform_decode_query_len
if self.compilation_config.cudagraph_mode.decode_mode() == CUDAGraphMode.FULL and \ if self.compilation_config.cudagraph_mode.decode_mode() == CUDAGraphMode.FULL and \
uniform_decode and self.uniform_decode_query_len <= num_input_tokens <= max_decode_tokens: uniform_decode:
max_decode_tokens = min(
self.scheduler_config.max_num_seqs *
self.uniform_decode_query_len,
self.cudagraph_batch_sizes[-1])
if self.uniform_decode_query_len <= num_input_tokens <= max_decode_tokens:
num_reqs_padded = num_input_tokens // self.uniform_decode_query_len num_reqs_padded = num_input_tokens // self.uniform_decode_query_len
pad_size = num_reqs_padded - num_reqs pad_size = num_reqs_padded - num_reqs
if pad_size > 0: if pad_size > 0: