[Feature] Support to use fullgraph with eagle (#5118)

### What this PR does / why we need it? We support to use full graph with eagle. Change list: 1. Distinguish between processing graph_params and draft_graph_params in attention_v1. 2. Adapt the full-graph mode in eagle_proposer, include: 1). If use full graph, make Fullgraph Wrapper when load model. 2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode. 3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping. 4). Add a descriptor. 5). Set running mode and triggered update metadata. 3. Trans is_mtp_model to is_draft_model, and add the update of workspace. NOTE: When set async_scheduling=True, the draft model will enforce execution in eager mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: ad32e3e19c --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
2025-12-29 09:54:51 +08:00
parent f81cf694b2
commit 3e67e8276c
11 changed files with 348 additions and 103 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -89,8 +89,8 @@ from vllm_ascend.attention.utils import (AscendCommonAttentionMetadata,
 # yapf conflicts with isort for this block
 # yapf: disable
 from vllm_ascend.compilation.acl_graph import (ACLGraphWrapper,
+                                               set_draft_graph_params,
                                               set_graph_params,
-                                               set_mtp_graph_params,
                                               update_attn_dcp_pcp_params,
                                               update_attn_params,
                                               update_mla_attn_dcp_pcp_params,
@@ -1104,7 +1104,8 @@ class NPUModelRunner(GPUModelRunner):
                self.spec_decode_common_attn_metadata is None:
                self.spec_decode_common_attn_metadata = common_attn_metadata
                if self.speculative_config.method in ("eagle", "eagle3") and \
-                        self.vllm_config.compilation_config.cudagraph_mode.has_full_cudagraphs():
+                        (self.vllm_config.speculative_config.enforce_eager \
+                         or self.use_async_scheduling):
                    self.spec_decode_common_attn_metadata = \
                        self.spec_decode_common_attn_metadata.unpadded(
                            total_num_scheduled_tokens, base_num_reqs)
@@ -2916,7 +2917,7 @@ class NPUModelRunner(GPUModelRunner):
        # we set the graph params right before initializing the keys.
        set_graph_params(self.cudagraph_batch_sizes)
        if self.speculative_config:
-            set_mtp_graph_params(self.cudagraph_batch_sizes)
+            set_draft_graph_params(self.cudagraph_batch_sizes)

        self.cudagraph_dispatcher.initialize_cudagraph_keys(
            self.compilation_config.cudagraph_mode,