[bugfix] [main] Fix KV cache query inconsistency across different TP ranks in the KV Pool (#5030)

### What this PR does / why we need it? In the current KV Pool scenario for models like MLA and GQA, where different TP ranks generate identical KV caches, the system is designed to store only a single copy. The previous approach allowed each card to query storage requirements dynamically, but inconsistent query results across cards led to incorrect storage. To fix this, the new solution pre-allocates storage responsibilities; each card now simply stores its pre-assigned blocks, bypassing the inconsistent query step and ensuring data correctness. - vLLM version: v0.12.0 - vLLM main: ad32e3e19c --------- Signed-off-by: fems14 <1804143737@qq.com>
2025-12-15 21:56:05 +08:00
parent c064d11fd7
commit b662d914a4
5 changed files with 188 additions and 199 deletions
--- a/vllm_ascend/distributed/kvpool/backend/memcache_backend.py
+++ b/vllm_ascend/distributed/kvpool/backend/memcache_backend.py
@@ -19,7 +19,7 @@ class MemcacheBackend(Backend):

    def __init__(self, parallel_config: ParallelConfig):
        try:
-            from memcache import DistributedObjectStore  # type: ignore
+            from memcache_hybrid import DistributedObjectStore  # type: ignore
        except ImportError as e:
            raise ImportError(
                "Please install memcache by following the instructions at "
@@ -43,10 +43,7 @@ class MemcacheBackend(Backend):
        torch.npu.set_device(device)

    def register_buffer(self, ptrs: list[int], sizes: list[int]):
-        for ptr, size in zip(ptrs, sizes):
-            ret_value = self.store.register_buffer(ptr, size)
-            if ret_value != 0:
-                raise RuntimeError("Memcache memory registration failed.")
+        pass

    def exists(self, keys: list[str]) -> list[int]:
        return self.store.batch_is_exist(keys)