[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125)

### What this PR does / why we need it?
Adds support for capturing the Multi-Layer Attention (MLA) decode
operation into an ACL graph. This improves performance by compiling the
attention kernel for single-token decoding.

Key changes include:
- Implementing the graph capture logic for the MLA kernel, including
workspace management and parameter updates.
- Modifying the rotary embedding (RoPE) handling to use pre-allocated
tensors, which is a requirement for graph capture.
- Adding a `build_for_graph_capture` method to the MLA metadata builder
to create dummy metadata during the graph compilation phase.

Known issues:
- Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're
working on a fix
- We are preparing to remove update_mla_attn_params with
auto_dispatch_capture

### Does this PR introduce _any_ user-facing change?
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: panchao-hub <315134829@qq.com>
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>

This commit is contained in:

panchao-hub

2025-10-10 16:31:20 +08:00

committed by

GitHub

parent ba19dd3183

commit 1756efa5fd

8 changed files with 303 additions and 50 deletions

									
										4

vllm_ascend/attention/utils.py
									
												View File
												
				@@ -63,6 +63,10 @@ class AscendCommonAttentionMetadata:

				    graph_pad_size: int = -1

				    # NOTE: This is a temporary solution for rotary embedding in MLA

				    cos: torch.Tensor = None

				    sin: torch.Tensor = None

				def split_decodes_and_prefills(

				    common_attn_metadata: AscendCommonAttentionMetadata,

[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125)

4 vllm_ascend/attention/utils.py Unescape Escape View File

4

vllm_ascend/attention/utils.py

View File