[Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (#6811)
### What this PR does / why we need it?
[Refactor][EAGLE] 7/N Merged PCP and disable_padded interface into
eagle_proposer.py
This pull request significantly refactors the speculative decoding
mechanism by merging Parallel Context Processing (PCP) and Multi-Token
Prediction (MTP) functionalities directly into the eagle_proposer.py.
The changes aim to enhance the efficiency and correctness of distributed
speculative decoding, particularly by enabling the Eagle feature to work
seamlessly with the disable_padded interface. This involves detailed
adjustments to attention metadata, input/output processing, and state
management to ensure proper operation in parallel environments.
1. The PCP and MTP features are migrated to the eagle_proposer.py
2. The Eagle and PCP features are integrated
3. Enable the eagle feature to use the disable_padded interface
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tests and UT
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
This commit is contained in:
@@ -118,6 +118,7 @@ class TestAscendAttentionCPImpl(TestBase):
|
||||
|
||||
attn_metadata = MagicMock()
|
||||
attn_metadata.decode_meta = MagicMock()
|
||||
attn_metadata.num_decodes_flatten = 5
|
||||
attn_metadata.decode_meta.batch_seq_mask = torch.tensor(
|
||||
[1, 0], dtype=torch.bool)
|
||||
output = self.impl._forward_decode_pcp_dcp(query, attn_metadata)
|
||||
|
||||
Reference in New Issue
Block a user