[bugfix] [main] Fix KV cache query inconsistency across different TP ranks in the KV Pool (#5030)
### What this PR does / why we need it?
In the current KV Pool scenario for models like MLA and GQA, where
different TP ranks generate identical KV caches, the system is designed
to store only a single copy. The previous approach allowed each card to
query storage requirements dynamically, but inconsistent query results
across cards led to incorrect storage. To fix this, the new solution
pre-allocates storage responsibilities; each card now simply stores its
pre-assigned blocks, bypassing the inconsistent query step and ensuring
data correctness.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: fems14 <1804143737@qq.com>
This commit is contained in:
@@ -310,8 +310,8 @@ class LookupKeyClient:
|
||||
self.socket.close(linger=0)
|
||||
|
||||
|
||||
def get_zmq_rpc_path_lookup(
|
||||
vllm_config: Optional["VllmConfig"] = None, ) -> str:
|
||||
def get_zmq_rpc_path_lookup(vllm_config: "VllmConfig") -> str:
|
||||
dp_rank = vllm_config.parallel_config.data_parallel_rank
|
||||
base_url = envs.VLLM_RPC_BASE_PATH
|
||||
# Default to 0 if not configured
|
||||
rpc_port = 0
|
||||
@@ -325,4 +325,4 @@ def get_zmq_rpc_path_lookup(
|
||||
"It is recommended to use the lookup_rpc_port, as the mooncake_rpc_port will be removed in the future."
|
||||
)
|
||||
logger.debug("Base URL: %s, RPC Port: %s", base_url, rpc_port)
|
||||
return f"ipc://{base_url}/lookup_rpc_port_{rpc_port}"
|
||||
return f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_dp_rank{dp_rank}"
|
||||
|
||||
Reference in New Issue
Block a user