[P/D]mooncake_connector adapted to 0.10.1 (#2664)

### What this PR does / why we need it? In vllm version 0.10.1, a new KVOutputAggregator was added to the executor, moving aggregation to the executor(https://github.com/vllm-project/vllm/pull/19555). This caused mooncake_connector to break. This change aims to fix this bug and also adds a policy to forcibly release the KV cache when the prefill node times out. This PR is currently linked to a PR in vllm (https://github.com/vllm-project/vllm/pull/23917). The vllm PR aims to modify the finish and send count confirmation in heterogeneous TP situations. The reason for deleting many UTs is that a lot of communication codes have been deleted, so the UT as a whole will appear more concise. - vLLM version: v0.10.1.1 - vLLM main: fa4311d85f --------- Signed-off-by: baxingpiaochong <771405853@qq.com>
2025-09-04 08:22:10 +08:00
parent 07d44ade19
commit df88a2ecc8
3 changed files with 130 additions and 319 deletions
--- a/vllm_ascend/envs.py
+++ b/vllm_ascend/envs.py
@@ -139,6 +139,11 @@ env_variables: Dict[str, Callable[[], Any]] = {
    # caused by the initialization of the Mooncake connector.
    "PHYSICAL_DEVICES":
    lambda: os.getenv("PHYSICAL_DEVICES", None),
+    # Timeout (in seconds) for delayed KVCache block release. In the prefill
+    # node, if a request is marked for delayed KV block release and the blocks
+    # are not freed within this timeout, they will be forcibly released.
+    "VLLM_ASCEND_KVCACHE_DELAY_FREE_TIMEOUT":
+    lambda: int(os.getenv("VLLM_ASCEND_KVCACHE_DELAY_FREE_TIMEOUT", 250)),
 }

 # end-env-vars-definition