[Bugfix][PD] Auto-clear producer KV cache if no pull notification (#2174)

### What this PR does / why we need it? This PR addresses a critical issue where Node D (Device) failures cause Node P (Processor) to hang due to inability to release KV cache. **Trigger Scenarios:** 1. Node D fails mid-inference (e.g., network disconnection) 2. Node D rejects requests at a certain stage (e.g., via API server) 3. Load-test script termination causes Node P or D to abort queued requests **Root Cause Analysis:** 1. Currently, Node D sends a "KV cache pull complete, release approved" message to Node P 2. This message is transmitted via the worker connector. If PD connection breaks or requests are rejected upstream, Node D cannot send the message 3. Node P will never release KV cache without receiving this message **Solution:** Following VLLM community's approach (NIXL connector timeout mechanism), we're implementing: - A timeout mechanism with comprehensive warnings - Updated README documentation - Reference: VLLM's optimization PR [#20139](https://github.com/vllm-project/vllm/pull/20139) ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? None - vLLM version: v0.10.2 - vLLM main: 9607d5eb44 --------- Signed-off-by: underfituu <hzhucong@163.com>
2025-09-23 09:53:34 +08:00
parent 704467cd9a
commit 8dd53c8860
3 changed files with 32 additions and 10 deletions
--- a/vllm_ascend/distributed/mooncake_connector.py
+++ b/vllm_ascend/distributed/mooncake_connector.py
@@ -19,6 +19,7 @@ import numpy.typing as npt
 import torch
 import zmq
 from mooncake.engine import TransferEngine  # type: ignore
+from vllm import envs
 from vllm.config import VllmConfig
 from vllm.distributed.kv_transfer.kv_connector.v1.base import (
    KVConnectorBase_V1, KVConnectorMetadata, KVConnectorRole)
@@ -100,7 +101,7 @@ class KVCacheTaskTracker:
        while self.delayed_free_requests:
            request_id, delay_start_time = self.delayed_free_requests[0]
            if (current_time - delay_start_time
-                    > envs_ascend.VLLM_ASCEND_KVCACHE_DELAY_FREE_TIMEOUT):
+                    > envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT):
                self.delayed_free_requests.popleft()
                expired_requests.add(request_id)
                logger.info("Force freed request: %s", request_id)