Files

UnifiedCacheManager 195eac665b [Core][Worker] Add UCMConnector for KV Cache Offloading (#4411 )

### What this PR does / why we need it?

This PR introduces the initial integration of **UCM (Unified Cache
Management)** into the vllm-ascend distributed KV-cache system.

Specifically, it adds:
- A new `UCMConnector` implementation under the distributed KV-transfer
framework.
- Support for offloading KV-cache blocks to external UCM backends (DRAM
/ NFS / Localdisk), depending on UCM configuration).
- Integration with vLLM V1 KV connector interface, including metadata
handling and role registration.

**Why it is needed:**
- UCM provides a unified, high-performance storage layer for KV-cache
externalization.
- This enables vllm-ascend to support out-of-core KV-cache workloads,
improve memory efficiency, and leverage hardware-accelerated storage
paths (RDMA / NFS / hybrid modes).
- This connector is a required component to allow future work on
multi-node inference + UCM-based scaling.

---

### Does this PR introduce _any_ user-facing change?

Yes, but limited:

- A new `kv_connector=UCMConnector` option becomes available through the
configuration interface.
- When selected, vllm-ascend workers may initialize UCM and offload
KV-cache blocks externally.
- No default behaviors are changed. Users must explicitly enable this
connector.

This PR does **not** modify:
- existing APIs,
- default execution paths,
- model runner behavior,
- user workflow unless `UCMConnector` is configured.

---

### How was this patch tested?

---

### Prefix Caching Benchmark

We provide preliminary measurements for TTFT (ms) under VLLM benchmark.
Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size
2, with UCM (Localdisk) enabled.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>

2025-12-16 10:53:30 +08:00

5.3 KiB

Raw Blame History

UCM-Enhanced Prefix Caching Deployment Guide

Overview

Unified Cache Management (UCM) provides an external KV-cache storage layer designed for prefix-caching scenarios in vLLM/vLLM-Ascend. Unlike KV Pooling, which expands prefix-cache capacity only by aggregating device memory and therefore remains limited by HBM/DRAM size and lacks persistence, UCM decouples compute from storage and adopts a tiered design. Each node uses local DRAM as a fast cache, while a shared backend—such as 3FS or enterprise-grade storage—serves as the persistent KV store. This approach removes the capacity ceiling imposed by device memory, enables durable and reliable prefix caching, and allows cache capacity to scale with the storage system rather than with compute resources.

Prerequisites

OS: Linux
A hardware with Ascend NPU. It’s usually the Atlas 800 A2 series.
vLLM: main branch
vLLM Ascend: main branch

UCM Installation

Please refer to the official UCM installation guide for Ascend NPU

Configure UCM for Prefix Caching

Modify the UCM configuration file to specify which UCM connector to use and where KV blocks should be stored.
You may directly edit the example file at:

unified-cache-management/examples/ucm_config_example.yaml

For updated configuration options, please refer to the official UCM documentation for prefix-caching

A minimal configuration looks like this:

ucm_connectors:
  - ucm_connector_name: "UcmNfsStore"
    ucm_connector_config:
      storage_backends: "/mnt/test"
      use_direct: false

load_only_first_rank: false

Explanation:

ucm_connector_name: "UcmNfsStore": Specifies UcmNfsStore as the UCM connector.
storage_backends: Specify the directory used for storing KV blocks. It can be a local directory or an NFS-mounted path. UCM will store KV blocks here. ⚠️ Make sure to replace "/mnt/test" with your actual storage directory.
use_direct: Whether to enable direct I/O (optional). Default is false.
load_only_first_rank: Controls whether only rank 0 loads KV cache and broadcasts it to other ranks.
This feature is currently not supported on Ascend, so it must be set to false (all ranks load/dump independently).

Launching Inference

In this guide, we describe online inference using vLLM with the UCM connector, deployed as an OpenAI-compatible server. For best performance with UCM, it is recommended to set block_size to 128.

To start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model, run:

vllm serve Qwen/Qwen2.5-14B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 2 \
--gpu_memory_utilization 0.87 \
--block_size 128 \
--trust-remote-code \
--port 7800 \
--enforce-eager \
--no-enable-prefix-caching \
--kv-transfer-config \
'{
    "kv_connector": "UCMConnector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {"UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"}
}'

⚠️ Make sure to replace "/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml" with your actual config file path.

If you see log as below:

INFO:     Started server process [1049932]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Congratulations, you have successfully started the vLLM server with UCM connector!

Evaluating UCM Prefix Caching Performance

After launching the vLLM server with UCMConnector enabled, the easiest way to observe the prefix caching effect is to run the built-in vllm bench CLI. Executing the following command twice in a separate terminal shows the improvement clearly.

vllm bench serve \
--backend vllm \
--model Qwen/Qwen2.5-14B-Instruct \
--host 127.0.0.1 \
--port 7800 \
--dataset-name random \
--num-prompts 12 \
--random-input-len 16000 \
--random-output-len 2 \
--request-rate inf \
--seed 123456 \
--percentile-metrics "ttft,tpot,itl,e2el" \
--metric-percentiles "90,99" \
--ignore-eos

After the first execution

The vllm bench terminal prints the benchmark result:

---------------Time to First Token----------------
Mean TTFT (ms):                           15323.87

Inspecting the vLLM server logs reveals entries like:

INFO ucm_connector.py:228: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 0

This indicates that for the first inference request, UCM did not hit any cached KV blocks. As a result, the full 16K-token prefill must be computed, leading to a relatively large TTFT.

After the second execution

Running the same benchmark again produces:

---------------Time to First Token----------------
Mean TTFT (ms):                            1920.68

The vLLM server logs now contain similar entries:

INFO ucm_connector.py:228: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 125

This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate 8× improvement in TTFT compared to the initial run.

5.3 KiB Raw Blame History Unescape Escape