xc-llm-ascend

Files

Jade Zheng 0c6349610e [Feature] Reduce host memory usage for attention mask generation (#3048 )

### What this PR does / why we need it?

Previously, the mask construction process created multiple tensors of
size (max_model_len, max_model_len). When max_model_len reached 128k,
single GPU host memory usage exceeded hundreds of GB, causing process
OOM crashes. This update optimizes the mask generation to significantly
reduce memory consumption.

### Does this PR introduce _any_ user-facing change?

No.
### How was this patch tested?

CI pass.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

2025-10-21 20:19:04 +08:00

__init__.py

[Core] Make V1 work and enable V1 engine test (#389 )

2025-03-28 19:34:23 +08:00

attention_mask.py

[Feature] Reduce host memory usage for attention mask generation (#3048 )

2025-10-21 20:19:04 +08:00

attention_v1.py

Revert "[Perf] Add FIA interface in FA case" (#3553 )

2025-10-20 19:56:10 +08:00

mla_v1.py

[Model][2/N] Remove deepseek_mtp modeling. (#3561 )