[bugfix] Fixing KV Pool Memory Retention and Performance Degradation Issues (#5751)

### What this PR does / why we need it? 1.Fixed memory retention on certain GPUs caused by missing PUT operations. 2.Fixed performance degradation resulting from architectural incompatibilities in the underlying refactor. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: 2f4e6548ef --------- Signed-off-by: fems14 <1804143737@qq.com>
2026-01-09 17:46:23 +08:00
parent 3ba064f804
commit ff4c1a47b3
6 changed files with 27 additions and 22 deletions
--- a/vllm_ascend/distributed/kvpool/pool_worker.py
+++ b/vllm_ascend/distributed/kvpool/pool_worker.py
@@ -134,6 +134,12 @@ class KVPoolWorker:
                                                   self.use_mla, partitions)

        real_backend = backend_map.get(self.backend.lower())
+
+        # be removed later
+        if self.backend == "mooncake":
+            self.head_or_tp_rank = self.tp_rank
+            self.put_step = 1
+
        self.m_store = real_backend(  # type: ignore[misc]
            parallel_config)

@@ -245,7 +251,7 @@ class KVPoolWorker:
                token_len = request.load_spec.kvpool_cached_tokens + 1
            else:
                token_len = request.load_spec.kvpool_cached_tokens
-            request.token_len_chunk = token_len
+            request.load_spec.token_len = token_len
            if self.use_layerwise:
                layerwise_retriever = self.retrieve_layer(request)
                next(layerwise_retriever)  # first layer load