xc-llm-ascend

Files

panchao-hub 1756efa5fd [Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125 )

### What this PR does / why we need it?
Adds support for capturing the Multi-Layer Attention (MLA) decode
operation into an ACL graph. This improves performance by compiling the
attention kernel for single-token decoding.

Key changes include:
- Implementing the graph capture logic for the MLA kernel, including
workspace management and parameter updates.
- Modifying the rotary embedding (RoPE) handling to use pre-allocated
tensors, which is a requirement for graph capture.
- Adding a `build_for_graph_capture` method to the MLA metadata builder
to create dummy metadata during the graph compilation phase.

Known issues:
- Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're
working on a fix
- We are preparing to remove update_mla_attn_params with
auto_dispatch_capture

### Does this PR introduce _any_ user-facing change?
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: panchao-hub <315134829@qq.com>
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>

2025-10-10 16:31:20 +08:00

e2e

[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125 )

2025-10-10 16:31:20 +08:00

[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125 )

2025-10-10 16:31:20 +08:00

__init__.py

[SpecDecode] Add spec decode support (#500 )

2025-04-17 20:16:32 +08:00