[docs] Add kv pool developer guide (#3752)

### What this PR does / why we need it?
Add kv pool developer guide

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Signed-off-by: pz1116 <zpbzpb123123@gmail.com>
This commit is contained in:
pz1116
2025-11-05 18:03:36 +08:00
committed by GitHub
parent 1ba158567c
commit e0c23cb011
4 changed files with 177 additions and 84 deletions

View File

@@ -0,0 +1,83 @@
# KV Cache Pool
## Why KV Cache Pool?
Prefix caching is an important feature in LLM inference that can reduce prefill computation time drastically.
However, the performance gain from prefix caching is highly dependent on cache hit rate, while cache hit rate can be limited if one only uses HBM for kv cache storage.
Hence, KV Cache Pool is proposed to utilize various types of storages including HBM,DRAM and SSD, making a pool for KV Cache storage, while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests.
vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake): one of the most recognized KV Cache storage engine;
While one can utilize mooncake store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports mooncake store and can utilize the data transfer strategy to one that is best fit to Huawei NPU hardware.
Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largly inspired by **LMCacheConnectorV1** (see the `How is MooncakestoreConnectorV1 Implemented?` section).
## Usage
vLLM Ascend Currently supports Mooncake Store for KV Cache Pool. To enable Mooncake Store, one needs to config `kv-transfer-config` and choose `MooncakeStoreConnector` as KV Connector.
For step-by-step deployment and configuration, please refer to the KV Pool User Guide at `vllm-ascend/docs/source/user_guide/feature_guide/kv_pool_mooncake.md`
## How it works?
The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) through a connector-based architecture.
Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth.
When combined with vLLMs Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
### 1. Combining KV Cache Pool with HBM Prefix Caching
Prefix Caching with HBM is already supported by the vLLM V1 Engine.
By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Caching with Mooncake-backed KV Pool.
The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the --no_enable_prefix_caching flag is set, and setting up the KV Connector for KV Pool(e.g. the MooncakeStoreConnector)
**Workflow**:
1. The engine first checks for prefix hits in the HBM cache.
2. After getting the number of hit tokens on HBM, it queries the KV Pool via the connector, if there is additional hits in KV Pool, we get the **additional blocks only** from KV Pool, and get the rest of the blocks directly from HBM to minimize the data transfer latency.
3. After the KV Caches in KV Pool is load into HBM, the remaining process is the same as Prefix Caching in HBM.
### 2. Combining KV Cache Pool with Mooncake PD Disaggregation
When used together with Mooncake PD (Prefill-Decode) Disaggregation, the KV Cache Pool can further decouple prefill and decode stages across devices or nodes.
Currently, we only perform put and get operation of KV Pool for **Prefiil Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e. MooncakeConnector.
The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfer KV Caches between NPU devices directly.
To Enable this feature, we need to setup both Mooncake Connector and Mooncake Store connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in specific order;
For details, please also refer to the Mooncake Connector Store Deployment Guide.
## How is MooncakestoreConnectorV1 Implemented?
**MooncakestoreConnectorV1** inhereits the KV Connector V1 class in vLLM V1: through implementing the required methods defined in the KV connector V1 base class, one can integrate a thrid-party KV cache transfer/storage backend into the vLLM framework.
MooncakeStoreConnectorV1 is also largly inspried by LMCacheConnectorV1 in term of the `Lookup Engine`/`Lookup Client` design for looking up KV cache keys, and the `ChunkedTokenDatabase` class for processing tokens into prefix-aware hashes as well as other hashing related designs. On top of this, we have also added our own design including `KVTransferThread` that allows async `get` and `put` of KV caches with multi-threading, and NPU-related data transfer optimization such as removing the `LocalBuffer` in LMCache to remove redundant data transfer.
The KV Connector methods that need to be implemented can be categorized into scheduler-side methods that are called in V1 scheduler and worker-side methods that are called in V1 worker, namely:
### KV Connector Scheduler-Side Methods:
`get_num_new_matched_tokens`: Get prefix cache hit in number of tokens through looking up into the KV pool.
`update_states_after_alloc`: Update KVConnector state after temporary buffer alloc.
`build_connector_meta`: Attach the connector metadata to the request object.
`request_finished`: Once a request is finished, determine whether request blocks should be freed now or will be sent asynchronously and freed later.
### Connector Worker-Side Methods:
`register_kv_caches`: Register KV cache buffers needed for KV cache transfer.
`start_load_kv`: Perform KV cache load operation that transfers KV cache from storage to device.
`wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario.
`save_kv_layer`: Optional Do layerwise KV cache put into KV Pool.
`wait_for_save`: Wait for KV Save to finish if async KV cache save/put.
`get_finished` Get request that finished KV transfer, `done_sending` if `put` finished, `done_reciving` if `get` finished.
## DFX
1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for current request.
2. Similaly, when we are trying to put a block into KV Pool and failed, we do not put further blocks (subject to change).
## Limitation
1. Currently, Mooncake Store for vLLM-Ascend only supports DRAM as the storage for KV Cache pool.
2. For now, if we successfully looked up a key and found it exists, but failed to get it when calling KV Pool's get function, we just output a log indicating the get operation failed and keep going; hence, the accuracy of that specific request may be affected. gWe will handle this situation by falling back the request and re-compute everything assuming there's no prefix cache hit (or even better, revert only one block and keep using the Prefix Caches before that).

View File

@@ -9,4 +9,5 @@ patch
ModelRunner_prepare_inputs
Multi_Token_Prediction
ACL_Graph
KV_Cache_Pool_Guide
:::

View File

@@ -13,4 +13,5 @@ lora
eplb_swift_balancer
netloader
dynamic_batch
kv_pool_mooncake
:::

View File

@@ -8,14 +8,23 @@
* PyTorch == 2.7.1, torch-npu == 2.7.1
* vLLMmain branch
* vLLM-Ascendmain branch
* Mooncake[AscendTransport/Mooncake at pooling-async-memcpy](https://github.com/AscendTransport/Mooncake/tree/pooling-async-memcpy)(Currently available branch code, continuously updated.)
Installation and Compilation Guidehttps://github.com/AscendTransport/Mooncake/tree/pooling-async-memcpy?tab=readme-ov-file#build-and-use-binaries
* Mooncakemain branch
Installation and Compilation Guidehttps://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries
Make sure to build with `-DUSE_ASCEND_DIRECT` to enable ADXL engine.
An example command for compiling ADXL
`rm -rf build && mkdir -p build && cd build \ && cmake .. -DCMAKE_INSTALL_PREFIX=/opt/transfer-engine/ -DCMAKE_POLICY_VERSION_MINIMUM=3.5 -DUSE_ASCEND_DIRECT=ON -DBUILD_SHARED_LIBS=ON -DBUILD_UNIT_TESTS=OFF \ && make -j \ && make install`
Also, you need to set environment variables to point to them `export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib64/python3.11/site-packages/mooncake`, or copy the .so files to the `/usr/local/lib64` directory after compilation
### KV Pooling Parameter Description
**kv_connector_extra_config**:Additional Configurable Parameters for Pooling
**mooncake_rpc_port**:Port for RPC Communication Between Pooling Scheduler Process and Worker Process: Each Instance Requires a Unique Port Configuration.
**load_async**:Whether to Enable Asynchronous Loading. The default value is false.
**register_buffer**:Whether to Register Video Memory with the Backend. Registration is Not Required When Used with MooncakeConnectorV1; It is Required in All Other Cases. The Default Value is false.
**kv_connector_extra_config**:Additional Configurable Parameters for Pooling.
**mooncake_rpc_port**:Port for RPC Communication Between Pooling Scheduler Process and Worker Process: Each Instance Requires a Unique Port Configuration.
**load_async**:Whether to Enable Asynchronous Loading. The default value is false.
**register_buffer**:Whether to Register Video Memory with the Backend. Registration is Not Required When Used with MooncakeConnectorV1; It is Required in All Other Cases. The Default Value is false.
## run mooncake master
@@ -29,26 +38,32 @@ The environment variable **MOONCAKE_CONFIG_PATH** is configured to the full path
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
"device_name": "",
"use_ascend_direct": true,
"alloc_in_same_node": true,
"master_server_address": "xx.xx.xx.xx:50088",
"global_segment_size": 30000000000
}
```
**local_hostname**: Configured as the IP address of the current master node,
**metadata_server**: Configured as **P2PHANDSHAKE**,
**protocol:** Configured for Ascend to use Mooncake's HCCL communication,
**device_name**: ""
**master_server_address**: Configured with the IP and port of the master service
**global_segment_size**: Expands the kvcache size registered by the PD node to the master
**local_hostname**: Configured as the IP address of the current master node.
**metadata_server**: Configured as **P2PHANDSHAKE**.
**protocol:** Configured for Ascend to use Mooncake's HCCL communication.
**device_name**: ""
**use_ascend_direct**: Indicator for using ADXL engine.
**alloc_in_same_node**: Indicator for preferring local buffer allocation strategy.
**master_server_address**: Configured with the IP and port of the master service.
**global_segment_size**: Expands the kvcache size registered by the PD node to the master.
### 2. Start mooncake_master
Under the mooncake folder:
```
mooncake_master --port 50088
mooncake_master --port 50088 --eviction_high_watermark_ratio 0.95 --eviction_ratio 0.05
```
`eviction_high_watermark_ratio` determines the watermark where Mooncake Store will perform evictionand `eviction_ratio` determines the portion of stored objects that would be evicted.
## Pooling and Prefill Decode Disaggregate Scenario
### 1.Run `prefill` Node and `decode` Node
@@ -69,11 +84,9 @@ export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ASCEND_TRANSPORT_PRINT=1
export ACL_OP_INIT_MODE=1
# The upper boundary environment variable for memory swap logging is set to mooncake, where 1 indicates enabled and 0 indicates disabled.
export ASCEND_AGGREGATE_ENABLE=1
# The upper-level environment variable is the switch for enabling the mooncake aggregation function, where 1 means on and 0 means off.
export ASCEND_BUFFER_POOL=4:8
# ASCEND_BUFFER_POOL is the environment variable for configuring the number and size of buffer on NPU Device for aggregation and KV transferthe value 4:8 means we allocate 4 buffers of size 8MB.
python3 -m vllm.entrypoints.openai.api_server \
--model /xxxxx/Qwen2.5-7B-Instruct \
@@ -88,34 +101,34 @@ python3 -m vllm.entrypoints.openai.api_server \
--max-num-batched-tokens 4096 \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"use_layerwise": false,
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "20001",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 1
},
"decode": {
"dp_size": 1,
"tp_size": 1
}
}
},
{
"kv_connector": "MooncakeConnectorStoreV1",
"kv_role": "kv_producer",
"kv_connector": "MultiConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"use_layerwise": false,
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "20001",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 1
},
"decode": {
"dp_size": 1,
"tp_size": 1
}
}
},
{
"kv_connector": "MooncakeConnectorStoreV1",
"kv_role": "kv_producer",
"mooncake_rpc_port":"0"
}
]
}
}' > p.log 2>&1
}
]
}
}' > p.log 2>&1
```
`decode` Node
@@ -133,10 +146,7 @@ export MOONCAKE_CONFIG_PATH="/xxxxx/mooncake.json"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
export ACL_OP_INIT_MODE=1
export ASCEND_TRANSPORT_PRINT=1
# The upper boundary environment variable for memory swap logging is set to mooncake, where 1 indicates enabled and 0 indicates disabled.
export ASCEND_AGGREGATE_ENABLE=1
# The upper-level environment variable is the switch for enabling the mooncake aggregation function, where 1 means on and 0 means off.
export ASCEND_BUFFER_POOL=4:8
python3 -m vllm.entrypoints.openai.api_server \
--model /xxxxx/Qwen2.5-7B-Instruct \
@@ -151,33 +161,34 @@ python3 -m vllm.entrypoints.openai.api_server \
--max-num-batched-tokens 4096 \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"use_layerwise": false,
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "20002",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 1
},
"decode": {
"dp_size": 1,
"tp_size": 1
}
}
},
{
"kv_connector": "MooncakeConnectorStoreV1",
"kv_role": "kv_consumer",
"kv_connector": "MultiConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"use_layerwise": false,
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "20002",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 1,
"tp_size": 1
},
"decode": {
"dp_size": 1,
"tp_size": 1
}
}
},
{
"kv_connector": "MooncakeConnectorStoreV1",
"kv_role": "kv_consumer",
"mooncake_rpc_port":"1"
}
]
}
}
]
}
}' > d.log 2>&1
```
@@ -234,10 +245,7 @@ export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ACL_OP_INIT_MODE=1
export ASCEND_TRANSPORT_PRINT=1
# The upper boundary environment variable for memory swap logging is set to mooncake, where 1 indicates enabled and 0 indicates disabled.
export ASCEND_AGGREGATE_ENABLE=1
# The upper-level environment variable is the switch for enabling the mooncake aggregation function, where 1 means on and 0 means off.
export ASCEND_BUFFER_POOL=4:8
python3 -m vllm.entrypoints.openai.api_server \
--model /xxxxx/Qwen2.5-7B-Instruct \
@@ -252,12 +260,12 @@ python3 -m vllm.entrypoints.openai.api_server \
--max-num-batched-tokens 4096 \
--kv-transfer-config \
'{
"kv_connector": "MooncakeConnectorStoreV1",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"use_layerwise": false,
"kv_connector": "MooncakeConnectorStoreV1",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"use_layerwise": false,
"mooncake_rpc_port":"0"
}
}
}' > mix.log 2>&1
```
@@ -275,4 +283,4 @@ Long question:
```
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'
```
```