[Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (#6811)
### What this PR does / why we need it?
[Refactor][EAGLE] 7/N Merged PCP and disable_padded interface into
eagle_proposer.py
This pull request significantly refactors the speculative decoding
mechanism by merging Parallel Context Processing (PCP) and Multi-Token
Prediction (MTP) functionalities directly into the eagle_proposer.py.
The changes aim to enhance the efficiency and correctness of distributed
speculative decoding, particularly by enabling the Eagle feature to work
seamlessly with the disable_padded interface. This involves detailed
adjustments to attention metadata, input/output processing, and state
management to ensure proper operation in parallel environments.
1. The PCP and MTP features are migrated to the eagle_proposer.py
2. The Eagle and PCP features are integrated
3. Enable the eagle feature to use the disable_padded interface
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tests and UT
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
This commit is contained in:
@@ -39,11 +39,7 @@ class MtpProposer(EagleProposer):
|
||||
# Currently, both GLM and DS encounter issues when enabling the fullgraph mode and running on EagleProposer.
|
||||
# Therefore, we temporarily bypass this problem by adding a conditional check for fullgraph.
|
||||
# TODO: this conditional check should be removed after bug fixing.
|
||||
if (
|
||||
self.pcp_size * self.dcp_size == 1
|
||||
and not self.speculative_config.disable_padded_drafter_batch
|
||||
and not self.vllm_config.compilation_config.cudagraph_mode.has_full_cudagraphs()
|
||||
):
|
||||
if not self.vllm_config.compilation_config.cudagraph_mode.has_full_cudagraphs():
|
||||
super().dummy_run(
|
||||
num_tokens,
|
||||
with_prefill,
|
||||
@@ -175,11 +171,7 @@ class MtpProposer(EagleProposer):
|
||||
# Currently, both GLM and DS encounter issues when enabling the fullgraph mode and running on EagleProposer.
|
||||
# Therefore, we temporarily bypass this problem by adding a conditional check for fullgraph.
|
||||
# TODO: this conditional check should be removed after bug fixing.
|
||||
if (
|
||||
self.pcp_size * self.dcp_size == 1
|
||||
and not self.speculative_config.disable_padded_drafter_batch
|
||||
and not self.vllm_config.compilation_config.cudagraph_mode.has_full_cudagraphs()
|
||||
):
|
||||
if not self.vllm_config.compilation_config.cudagraph_mode.has_full_cudagraphs():
|
||||
draft_token_ids = super()._propose(
|
||||
target_token_ids,
|
||||
target_positions,
|
||||
|
||||
Reference in New Issue
Block a user