[Feat] Support MTP to running in full graph mode (#3892)

### What this PR does / why we need it? Currently, the MTP model still runs in eager in full graph mode. This PR adapts the MTP with the full graph capture and execution. When the graph mode is set to "FULL_DECODE_ONLY", the MTP will run in full-graph to improve the performance. The change in both disable_padded_drafter_batch is True and False case include: 1. Add _mtp_graph_params in acl_graph.py to isolate the data of main model and the data of MTP. 2. Padding some metadata in mla_v1.py when in fullgraph mode. 3. Fixed the essential data address that will be used in model.forward. 4. Adapted according to the aclgraph capture framwork: 1). Rebuild MTP model with ACLGraphWrapper. 2). Add common attn metadata when start capture in MTP dummy_run. 3). Add common attn metadata update in MTP. 4). Addapted data update when num_speculative_tokens > 1. 5. Add a patch of MTP to adapt vllm v0.11.0. Existing Issues: 1. When disable_padded_drafter_batch=True and running in FullGraph mode, the data of the first-round requests in MTP is abnormal. We need to identify the cause subsequently. 2. When disable_padded_drafter_batch=False and running in FullGraph mode, the acceptance rate of the second and third tokens will decrease (For example, if we set the num_speculative_tokens=3, the acceptance rate of first token is 90%, the second is only 50% lower than 60%, the third is only 20% lower than 30%). The reason is that the data processed after the model runs does not match. This is a problem from another PR. It works fine in eager and PIECEWISE mode, but has problem in FullGraph mode. Once we have a solution, we will submit a bugfix. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: 2918c1b49c --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
2025-11-20 20:34:54 +08:00
parent 15c1eb025c
commit 5c9f4a40c6
8 changed files with 536 additions and 42 deletions
--- a/vllm_ascend/compilation/acl_graph.py
+++ b/vllm_ascend/compilation/acl_graph.py
@@ -241,7 +241,10 @@ def update_attn_params(update_stream, forward_context, runtime_shape):

 def update_mla_attn_params(update_stream, forward_context, runtime_shape,
                           speculative_config):
-    graph_params = get_graph_params()
+    if forward_context.is_mtp_model:
+        graph_params = get_mtp_graph_params()
+    else:
+        graph_params = get_graph_params()
    # FIXME: Behold! We are using a temporary hack here to update the args
    # for each layer's attention op in the graph.
    with torch.npu.stream(update_stream):
@@ -257,7 +260,8 @@ def update_mla_attn_params(update_stream, forward_context, runtime_shape,
             softmax_lse) = param
            seq_lens_list = forward_context.attn_metadata[
                key].decode.seq_lens_list
-            if speculative_config and speculative_config.method == "deepseek_mtp":
+            if speculative_config and speculative_config.method == "deepseek_mtp" \
+                    and not forward_context.is_mtp_model:
                actual_seq_lengths = forward_context.attn_metadata[
                    key].decode.actual_seq_lengths_q
                spec_multiple = speculative_config.num_speculative_tokens + 1
@@ -267,6 +271,13 @@ def update_mla_attn_params(update_stream, forward_context, runtime_shape,
                    spec_multiple * (i + 1)
                    for i in range(runtime_shape // spec_multiple)
                ]
+            elif forward_context.is_mtp_model:
+                actual_seq_lengths = forward_context.attn_metadata[
+                    key].decode.actual_seq_lengths_q
+                block_table = forward_context.attn_metadata[
+                    key].decode.block_table
+                seq_lens_list = seq_lens_list + [0] * (
+                    len(actual_seq_lengths) - len(seq_lens_list))
            else:
                seq_lens_list = seq_lens_list + [0] * (runtime_shape -
                                                       len(seq_lens_list))
@@ -443,3 +454,32 @@ def update_graph_params_workspaces(num_tokens: int, workspace: int):

 def get_graph_params():
    return _graph_params
+
+
+_mtp_graph_params: Optional[GraphParams] = None
+
+
+def set_mtp_graph_params(aclgraph_capture_sizes: set[int]):
+    global _mtp_graph_params
+    if _mtp_graph_params is not None:
+        raise ValueError("MTPGraph parameters have already been set!")
+    _mtp_graph_params = GraphParams(
+        {size: []
+         for size in aclgraph_capture_sizes},
+        {size: None
+         for size in aclgraph_capture_sizes},
+        {size: []
+         for size in aclgraph_capture_sizes},
+        {size: []
+         for size in aclgraph_capture_sizes},
+    )
+
+
+def update_mtp_graph_params_workspaces(num_tokens: int, workspace: Any):
+    global _mtp_graph_params
+    if _mtp_graph_params is not None:
+        _mtp_graph_params.workspaces[num_tokens] = workspace
+
+
+def get_mtp_graph_params():
+    return _mtp_graph_params