[bugfix] Fixing KV Pool Memory Retention and Performance Degradation Issues (#5751)

### What this PR does / why we need it?
1.Fixed memory retention on certain GPUs caused by missing PUT
operations.

2.Fixed performance degradation resulting from architectural
incompatibilities in the underlying refactor.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: fems14 <1804143737@qq.com>
This commit is contained in:
fems14
2026-01-09 17:46:23 +08:00
committed by GitHub
parent 3ba064f804
commit ff4c1a47b3
6 changed files with 27 additions and 22 deletions

View File

@@ -134,6 +134,12 @@ class KVPoolWorker:
self.use_mla, partitions)
real_backend = backend_map.get(self.backend.lower())
# be removed later
if self.backend == "mooncake":
self.head_or_tp_rank = self.tp_rank
self.put_step = 1
self.m_store = real_backend( # type: ignore[misc]
parallel_config)
@@ -245,7 +251,7 @@ class KVPoolWorker:
token_len = request.load_spec.kvpool_cached_tokens + 1
else:
token_len = request.load_spec.kvpool_cached_tokens
request.token_len_chunk = token_len
request.load_spec.token_len = token_len
if self.use_layerwise:
layerwise_retriever = self.retrieve_layer(request)
next(layerwise_retriever) # first layer load