Files
xc-llm-ascend/vllm_ascend
Angazenn fc7e5cd9dc [main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (#4097)
### What this PR does / why we need it?
Currently, we set `seq_lens` in dummy attn_metadata to be
`max_model_len` to get max workspace for attention during capturing.
However, setting it consistently to be `max_model_len` causing dummy_run
to execute a long attention when running actual inference. For example,
if there is a single req with `seqs_lens` as [8] but `max_model_len` is
131072, the whole process will be slow down by dummy_run as it execute a
fake long-seq attention. Therefore, we instead set it to max_query_len,
which is also consistent with vLLM gpu implementation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-12 17:31:39 +08:00
..
2025-10-25 15:36:32 +08:00
2025-11-11 15:43:39 +08:00
2025-11-11 15:43:39 +08:00
2025-11-11 15:43:39 +08:00
2025-11-11 15:43:39 +08:00