xc-llm-ascend

Files

Yizhou 274b708e0c [Fix] Refactor dummy attention metadata creation (#3497 )

### What this PR does / why we need it?
The `force_attention` parameter is designed for flash infer kernel
warmup, we don't actually need it on Ascend device (at least for
now).And it tends to make things more complicated. So we replace the
`force_attention` parameter with `aclgraph_runtime_mode` in the
attention metadata creation logic.

This change makes the control flow more explicit by directly using the
graph runtime mode to determine how to build attention metadata, rather
than relying on an intermediate boolean flag. This simplification
removes redundant logic and clarifies the conditions for building
attention metadata for full decode graph mode.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
DP + `FULL_DECODE_ONLY` + online serving.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

2025-10-21 00:00:42 +08:00

test_input_batch.py

[New model] Qwen3-next support (#2917 )

2025-09-16 01:17:42 +08:00

test_model_runner_v1.py

Revert "[Disagg][Perf] Use NPU event sync instead of blocking tolist (#3194 )

2025-09-26 06:17:36 +08:00

test_worker_v1.py

[Fix] Refactor dummy attention metadata creation (#3497 )

2025-10-21 00:00:42 +08:00