[Bugfix][PD] Auto-clear producer KV cache if no pull notification (#2174)

### What this PR does / why we need it?

This PR addresses a critical issue where Node D (Device) failures cause
Node P (Processor) to hang due to inability to release KV cache.

**Trigger Scenarios:**  
1. Node D fails mid-inference (e.g., network disconnection)  
2. Node D rejects requests at a certain stage (e.g., via API server)  
3. Load-test script termination causes Node P or D to abort queued
requests

**Root Cause Analysis:**  
1. Currently, Node D sends a "KV cache pull complete, release approved"
message to Node P
2. This message is transmitted via the worker connector. If PD
connection breaks or requests are rejected upstream, Node D cannot send
the message
3. Node P will never release KV cache without receiving this message  

**Solution:**  
Following VLLM community's approach (NIXL connector timeout mechanism),
we're implementing:
- A timeout mechanism with comprehensive warnings  
- Updated README documentation  
- Reference: VLLM's optimization PR
[#20139](https://github.com/vllm-project/vllm/pull/20139)
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
None


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: underfituu <hzhucong@163.com>
This commit is contained in:
hucong
2025-09-23 09:53:34 +08:00
committed by GitHub
parent 704467cd9a
commit 8dd53c8860
3 changed files with 32 additions and 10 deletions

View File

@@ -19,6 +19,7 @@ import numpy.typing as npt
import torch
import zmq
from mooncake.engine import TransferEngine # type: ignore
from vllm import envs
from vllm.config import VllmConfig
from vllm.distributed.kv_transfer.kv_connector.v1.base import (
KVConnectorBase_V1, KVConnectorMetadata, KVConnectorRole)
@@ -100,7 +101,7 @@ class KVCacheTaskTracker:
while self.delayed_free_requests:
request_id, delay_start_time = self.delayed_free_requests[0]
if (current_time - delay_start_time
> envs_ascend.VLLM_ASCEND_KVCACHE_DELAY_FREE_TIMEOUT):
> envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT):
self.delayed_free_requests.popleft()
expired_requests.add(request_id)
logger.info("Force freed request: %s", request_id)