From 28a15299ea4d9a46772cdb90bd0961df02a3c2d7 Mon Sep 17 00:00:00 2001
From: Angazenn <92204292+Angazenn@users.noreply.github.com>
Date: Wed, 12 Nov 2025 20:32:50 +0800
Subject: [PATCH] [cherry-pick][v0.11.0-dev][bugfix] Change seq_lens in dummy
 attn_metadata to max_query_len (#4099)

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This is cherry-pick from #4097 .
Currently, we set `seq_lens` in dummy attn_metadata to be
`max_model_len` to get max workspace for attention during capturing.
However, setting it consistently to be `max_model_len` causing dummy_run
to execute a long attention when running actual inference. For example,
if there is a single req with `seqs_lens` as [8] but `max_model_len` is
131072, the whole process will be slow down by dummy_run as it execute a
fake long-seq attention. Therefore, we instead set it to max_query_len,
which is also consistent with vLLM gpu implementation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: Angazenn <supperccell@163.com>
---
 vllm_ascend/worker/model_runner_v1.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py
index ffb2f44..5ad4340 100644
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -2258,7 +2258,7 @@ class NPUModelRunner(LoRAModelRunnerMixin):
 
             attn_metadata = {}
 
-            seq_lens = self.model_config.max_model_len
+            seq_lens = max_query_len
             self.seq_lens_np[:num_reqs] = seq_lens
             self.seq_lens_np[num_reqs:] = 0