From 28a15299ea4d9a46772cdb90bd0961df02a3c2d7 Mon Sep 17 00:00:00 2001 From: Angazenn <92204292+Angazenn@users.noreply.github.com> Date: Wed, 12 Nov 2025 20:32:50 +0800 Subject: [PATCH] [cherry-pick][v0.11.0-dev][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (#4099) ### What this PR does / why we need it? This is cherry-pick from #4097 . Currently, we set `seq_lens` in dummy attn_metadata to be `max_model_len` to get max workspace for attention during capturing. However, setting it consistently to be `max_model_len` causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with `seqs_lens` as [8] but `max_model_len` is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? --------- Signed-off-by: Angazenn --- vllm_ascend/worker/model_runner_v1.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py index ffb2f44..5ad4340 100644 --- a/vllm_ascend/worker/model_runner_v1.py +++ b/vllm_ascend/worker/model_runner_v1.py @@ -2258,7 +2258,7 @@ class NPUModelRunner(LoRAModelRunnerMixin): attn_metadata = {} - seq_lens = self.model_config.max_model_len + seq_lens = max_query_len self.seq_lens_np[:num_reqs] = seq_lens self.seq_lens_np[num_reqs:] = 0