[Bugfix] Fix padding logic in eagle proposer for kimi25 (#7348)
### What this PR does / why we need it?
This PR aims to fix padding logic in eagle proposer for kimi25. Main
changes involve:
1. modify the way to obtain draft model attention builder and backend
2. add block table padding & related tensor slicing in common metadata
when `draft_step>1` for solving fia verifying error
3. replace block table in `update_graph_params` for solving fia
verifying error
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: Zetong Li <slippersss@126.com>
This commit is contained in:
@@ -495,10 +495,12 @@ class AscendAttentionBackendImpl(AttentionImpl):
|
||||
draft_step = attn_count // num_layers
|
||||
seq_lens = attn_metadata[draft_step][key].seq_lens_list
|
||||
actual_seq_lengths_q = attn_metadata[draft_step][key].actual_seq_lengths_q
|
||||
block_tables = attn_metadata[draft_step][key].block_tables
|
||||
attn_count = attn_count + 1
|
||||
else:
|
||||
seq_lens = attn_metadata[key].seq_lens_list
|
||||
actual_seq_lengths_q = attn_metadata[key].actual_seq_lengths_q
|
||||
block_tables = attn_metadata[key].block_tables
|
||||
|
||||
torch.npu.graph_task_update_begin(update_stream, handle)
|
||||
torch_npu.npu_fused_infer_attention_score.out(
|
||||
|
||||
Reference in New Issue
Block a user