[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle (#4893)

### What this PR does / why we need it? We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler. - vLLM version: v0.12.0 - vLLM main: ad32e3e19c --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: drslark <slarksblood@qq.com>
2025-12-16 22:06:40 +08:00
parent cee521bad5
commit 5b1da4e914
6 changed files with 577 additions and 403 deletions
--- a/vllm_ascend/attention/attention_v1.py
+++ b/vllm_ascend/attention/attention_v1.py
@@ -730,6 +730,9 @@ class AscendAttentionBackendImpl(AttentionImpl):
                self.key_cache, self.value_cache = kv_cache[0], kv_cache[1]
            slots = attn_metadata.slot_mapping
            if get_ascend_device_type() == AscendDeviceType._910_95:
+                # TODO: Once eagle running to here, it may has error because of the 0 dim of slot_mapping.
+                # Should check if the 0 dim of slot_mapping must equal to the 0 dim of key.
+                # If it's necessary, the slots should be sliced.
                torch_npu.npu_scatter_pa_kv_cache(
                    key=key[:attn_metadata.num_actual_tokens],
                    value=value[:attn_metadata.num_actual_tokens].contiguous(),
@@ -742,7 +745,7 @@ class AscendAttentionBackendImpl(AttentionImpl):
                    value=value[:attn_metadata.num_actual_tokens],
                    key_cache=self.key_cache,
                    value_cache=self.value_cache,
-                    slot_indices=slots)
+                    slot_indices=slots[:attn_metadata.num_actual_tokens])
        return key, value

    def forward_impl(