### What this PR does / why we need it? Fix mtp tochair bug cuased by #2719 Since FIA need extra space for padding, we need to enforce `self.max_num_seqs > self.scheduler_config.max_num_seqs` in KV consumer + MTP This means that, `self.max_num_seqs` **>** the actual maximum requests (`self.scheduler_config.max_num_seqs`) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>
This commit is contained in:
@@ -2390,7 +2390,7 @@ class NPUModelRunner(LoRAModelRunnerMixin):
|
||||
# for dummy run with LoRA so that the num_reqs collectively
|
||||
# has num_tokens in total.
|
||||
assert num_tokens <= self.scheduler_config.max_num_batched_tokens
|
||||
max_num_reqs = self.scheduler_config.max_num_seqs
|
||||
max_num_reqs = self.max_num_reqs
|
||||
if uniform_decode:
|
||||
num_reqs = cdiv(num_tokens, max_query_len)
|
||||
num_scheduled_tokens_list = [max_query_len] * num_reqs
|
||||
|
||||
Reference in New Issue
Block a user