[Feat] Merge the multi eagle graphs to one graph (#5940)
### What this PR does / why we need it?
This PR merge all steps of draft model in fullgraph mode, to avoid the
synchronize between each graph, reduce the bubble time.
#### Key ideas:
- The "model forward" of the step 0 (first step) and remaining steps are
captured together as a "Callable", rather than capturing each model
individually.
- "update_attn_params" is moved outside the entire graph, meaning that
all "attn_metadata" required by all steps are constructed before
"replay", and the "attn_params" of all steps are updated at once.
- Remove synchronization between the main model graph and draft model
graph.
#### Key params/functions:
- params: draft_attn_metadatas, attn_metadata_multi_steps,
slot_mapping_group
- functions: _run_merged_draft, attn_update_stack_num_spec_norm,
update_attn_params, _propose, dummy_run
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
This commit is contained in:
@@ -295,6 +295,7 @@ class TestACLGraphWrapper(TestBase):
|
||||
mock_current_platform.get_global_graph_pool.return_value = self.mock_graph_pool
|
||||
mock_get_forward_context.return_value = self.mock_forward_context
|
||||
self.mock_forward_context.cudagraph_runtime_mode = CUDAGraphMode.FULL
|
||||
self.mock_forward_context.is_draft_model = False
|
||||
|
||||
# Mock torch.npu.NPUGraph
|
||||
mock_npu_graph = MagicMock()
|
||||
@@ -366,6 +367,7 @@ class TestACLGraphWrapper(TestBase):
|
||||
mock_current_platform.get_global_graph_pool.return_value = self.mock_graph_pool
|
||||
mock_get_forward_context.return_value = self.mock_forward_context
|
||||
self.mock_forward_context.cudagraph_runtime_mode = CUDAGraphMode.FULL
|
||||
self.mock_forward_context.is_draft_model = False
|
||||
|
||||
# Mock torch.npu.NPUGraph
|
||||
mock_npu_graph = MagicMock()
|
||||
|
||||
Reference in New Issue
Block a user