[BugFix]Fix mtp torchair bug caused by #2719 (#3566)

### What this PR does / why we need it?
Fix mtp tochair bug cuased by #2719
Since FIA need extra space for padding, we need to enforce
`self.max_num_seqs > self.scheduler_config.max_num_seqs` in KV consumer
+ MTP
This means that, `self.max_num_seqs` **>** the actual maximum requests
(`self.scheduler_config.max_num_seqs`)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
This commit is contained in:
xuyexiong
2025-10-21 22:21:44 +08:00
committed by GitHub
parent 534f32d27c
commit 79821106e6
2 changed files with 9 additions and 4 deletions

View File

@@ -2390,7 +2390,7 @@ class NPUModelRunner(LoRAModelRunnerMixin):
# for dummy run with LoRA so that the num_reqs collectively
# has num_tokens in total.
assert num_tokens <= self.scheduler_config.max_num_batched_tokens
max_num_reqs = self.scheduler_config.max_num_seqs
max_num_reqs = self.max_num_reqs
if uniform_decode:
num_reqs = cdiv(num_tokens, max_query_len)
num_scheduled_tokens_list = [max_query_len] * num_reqs