Files
xc-llm-ascend/vllm_ascend
Yizhou ff3914e31a [Fix] Refines decode mode padding condition for uniform queries (#5164)
### What this PR does / why we need it?
The reason why we cannot use `self.cudagraph_batch_sizes[-1]` is that
it's actually not the max number of tokens to be padded in
`FULL_DECODE_ONLY` mode, much larger instead. And it's trimmed only
before capturing to `compilation_cases`, this really caused us lots of
trouble.

Updates the logic to ensure padding occurs only when the number of input
tokens falls within a valid uniform decode query range, improving
consistency and avoiding unnecessary padding in specific decode modes.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-12-18 21:09:23 +08:00
..
2025-12-02 22:10:52 +08:00
2025-12-11 18:45:43 +08:00
2025-12-02 17:35:47 +08:00
2025-12-18 09:08:40 +08:00
2025-12-17 14:08:19 +08:00