Hybrid kv cache for LLaMA4 (#6563)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: tarinkk <rt572@physics.rutger.edu>
Co-authored-by: tarinkk <rt572@rutgers.physics.edu>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
This commit is contained in:
tarinkk
2025-06-27 21:58:55 -04:00
committed by GitHub
parent 357921aa51
commit eb6c2c1663
11 changed files with 519 additions and 59 deletions

View File

@@ -433,9 +433,7 @@ class DecodePreallocQueue:
else 0
)
available_size = self.token_to_kv_pool_allocator.available_size()
allocatable_tokens = available_size - max(
allocatable_tokens = self.token_to_kv_pool_allocator.available_size() - max(
# preserve some space for future decode
self.num_reserved_decode_tokens
* (