xc-llm-ascend/vllm_ascend at fc7e5cd9dccc7e0ca635f2b00797d5b02e6f36ac - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

Angazenn fc7e5cd9dc [main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (#4097 )

### What this PR does / why we need it?
Currently, we set `seq_lens` in dummy attn_metadata to be
`max_model_len` to get max workspace for attention during capturing.
However, setting it consistently to be `max_model_len` causing dummy_run
to execute a long attention when running actual inference. For example,
if there is a single req with `seqs_lens` as [8] but `max_model_len` is
131072, the whole process will be slow down by dummy_run as it execute a
fake long-seq attention. Therefore, we instead set it to max_query_len,
which is also consistent with vLLM gpu implementation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Angazenn <supperccell@163.com>

2025-11-12 17:31:39 +08:00

..

[feature] support pcp + mtp (in pd co-locate scenario) (#4098 )

2025-11-12 17:22:21 +08:00

[feature] support pcp + mtp (in pd co-locate scenario) (#4098 )

2025-11-12 17:22:21 +08:00

Upgrade to new vllm commit (#3719 )

2025-10-25 15:36:32 +08:00

device_allocator

[Misc]Clean up useless import from vllm (#2049 )

2025-07-28 16:01:59 +08:00

[feature] support pcp + mtp (in pd co-locate scenario) (#4098 )

2025-11-12 17:22:21 +08:00

[CI]Fix oom of deepseek-eplb nigtly test. (#3884 )

2025-10-30 10:18:07 +08:00

Upgrade to 0.11.1 newest vllm commit (#3762 )

2025-10-28 14:55:03 +08:00

[1/N][Refactor] Refactor code to adapt with vllm main (#3612 )

2025-10-24 16:55:08 +08:00

Upgrade to new vllm commit (#3719 )

2025-10-25 15:36:32 +08:00

Remove VLLM_USE_V1 (#4086 )

2025-11-11 15:43:39 +08:00

[Feat] flashcomm_v2 optim solution (#3232 )

2025-11-10 11:01:45 +08:00

Remove VLLM_USE_V1 (#4086 )

2025-11-11 15:43:39 +08:00

[Feat] flashcomm_v2 optim solution (#3232 )

2025-11-10 11:01:45 +08:00

[Perf] Remove D2H operations to imporve performance (#4063 )

2025-11-12 09:08:55 +08:00

[feature] support pcp + mtp (in pd co-locate scenario) (#4098 )

2025-11-12 17:22:21 +08:00

Remove VLLM_USE_V1 (#4086 )

2025-11-11 15:43:39 +08:00

[main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (#4097 )

2025-11-12 17:31:39 +08:00

__init__.py

[Misc][Doc] Add service profiling feature with user guide (#3756 )

2025-11-12 09:07:14 +08:00

ascend_config.py

oproj TP support acl graph (#4073 )

2025-11-11 19:39:06 +08:00

ascend_forward_context.py

[Feat] flashcomm_v2 optim solution (#3232 )

2025-11-10 11:01:45 +08:00

cpu_binding.py

[main] support cpu binding (#3546 )

2025-10-21 09:17:03 +08:00

envs.py

[Feat] flashcomm_v2 optim solution (#3232 )

2025-11-10 11:01:45 +08:00

meta_registration.py

Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

2025-09-13 11:58:52 +08:00

platform.py

Remove VLLM_USE_V1 (#4086 )

2025-11-11 15:43:39 +08:00

profiling_config.py

[Misc][Doc] Add service profiling feature with user guide (#3756 )

2025-11-12 09:07:14 +08:00

utils.py

[BugFix] Fixes Qwen3-Next enable nz accuracy problem (#4058 )

2025-11-10 20:54:57 +08:00