[Doc][KV Pool]Revision KV Pool User Guide (#7434)

### What this PR does / why we need it?
Revise the KV Pool user guide:
1. Revise Mooncake environment variables and kvconnector extra configs.
2. Delete `use_ascend_direct` in kv connector extra config as it is
deprecated
3. Delete `kv_buffer_device` and `kv_rank` in P2P mooncake config
4. Unifies default `max-model-len` and `max-num-batch-tokens` in
examples given.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
This commit is contained in:
pz1116
2026-03-19 10:13:13 +08:00
committed by GitHub
parent ab9cd2e305
commit 3effc4bc70
8 changed files with 58 additions and 86 deletions

View File

@@ -124,7 +124,6 @@ vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 1,
"tp_size": 16
@@ -192,7 +191,6 @@ vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
"kv_port": "30000",
"engine_id": "1",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 1,
"tp_size": 16

View File

@@ -185,7 +185,6 @@ The template for the mooncake.json file is as follows:
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
"device_name": "",
"use_ascend_direct": true,
"master_server_address": "<your_server_ip>:50088",
"global_segment_size": 107374182400
}
@@ -195,7 +194,6 @@ The template for the mooncake.json file is as follows:
| --------------| ------------------------| -----------------------------------|
| metadata_server | P2PHANDSHAKE | Point-to-point handshake mode |
| protocol | ascend | Ascend proprietary protocol |
| use_ascend_direct | true | Enable direct hardware access |
| master_server_address | 90.90.100.188:50088(for example) | Master server address|
| global_segment_size | 107374182400 | Size per segment (100 GB) |

View File

@@ -564,7 +564,6 @@ Before you start, please
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 16
@@ -639,7 +638,6 @@ Before you start, please
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 16
@@ -716,7 +714,6 @@ Before you start, please
"kv_port": "30100",
"engine_id": "1",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 16
@@ -793,7 +790,6 @@ Before you start, please
"kv_port": "30100",
"engine_id": "1",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 16

View File

@@ -448,7 +448,6 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 8
@@ -513,7 +512,6 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
"kv_port": "30100",
"engine_id": "1",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 8
@@ -579,7 +577,6 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
"kv_port": "30100",
"engine_id": "1",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 8

View File

@@ -3,18 +3,24 @@
## Environmental Dependencies
* Software:
* Python >= 3.10, < 3.12
* CANN == 8.3.rc2
* PyTorch == 2.8.0, torch-npu == 2.8.0
* CANN >= 8.5.0
* vLLMmain branch
* vLLM-Ascendmain branch
* mooncake>= 0.3.9
### KV Pool Parameter Description
**kv_connector_extra_config**: Additional Configurable Parameters for Pooling.
**lookup_rpc_port**: Port for RPC Communication Between Pooling Scheduler Process and Worker Process: Each Instance Requires a Unique Port Configuration.
**load_async**: Whether to Enable Asynchronous Loading. The default value is false.
**backend**: Set the storage backend for kvpool, with the default being mooncake.
#### `kv_connector_extra_config`: Additional Configurable Parameters for Pooling
| Parameter | Description |
| :--- | :--- |
| `lookup_rpc_port` | Port for RPC Communication Between Pooling Scheduler Process and Worker Process: Each Instance Requires a Unique Port Configuration. |
| `load_async` | Whether to Enable Asynchronous Loading. The default value is false. |
| `backend` | Set the storage backend for kvpool, with the default being mooncake. |
| `consumer_is_to_put` | Whether Decode node put KV Cache into KV Pool. The default value is false. |
| `consumer_is_to_load` | Whether Decode node load KV cache from KV Pool. The default value is false. |
| `prefill_pp_size` | Prefill PP size, needs to be set when Prefill node enables PP. |
| `prefill_pp_layer_partition` | Prefill PP layer partition, needs to be set when Prefill node enables PP. |
### Environment Variable Configuration
@@ -87,12 +93,11 @@ export PYTHONHASHSEED=0
### Environment Variables Description
`export ASCEND_ENABLE_USE_FABRIC_MEM=1`: Enable unified memory address direct transmission scheme and only can be used for 800 I/T A3 series. Required supporting hardware versions are as follows:
HDK >=26.0
CANN >= 9.0
`export ASCEND_BUFFER_POOL=4:8`: ASCEND_BUFFER_POOL is the environment variable for configuring the number and size of buffer on NPU Device for aggregation and KV transferthe value 4:8 means we allocate 4 buffers of size 8MB. It only can be used for 800 I/T A2 series.
| Hardware | HDK & CANN versions | Export Command | Description |
| :--- | :--- | :--- | :--- |
| 800 I/T A3 series | HDK >= 26.0.0<br>CANN >= 9.0.0 | `export ASCEND_ENABLE_USE_FABRIC_MEM=1` | **Recommended**. Enables unified memory address direct transmission scheme. |
| 800 I/T A3 series | 25.5.0<=HDK<26.0.0 | `export ASCEND_BUFFER_POOL=4:8` | Configures the number and size of buffers on the NPU Device for aggregation and KV transfer (e.g., `4:8` means 4 buffers of 8MB). |
| 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission cheme on 800 I/T A2 series|
### Run Mooncake Master
@@ -114,7 +119,7 @@ The environment variable **MOONCAKE_CONFIG_PATH** is configured to the full path
**protocol:** Must be set to 'Ascend' on the NPU.
**device_name**: ""
**master_server_address**: Configured with the IP and port of the master service.
**global_segment_size**: Registered memory size per card to the KV Pool.
**global_segment_size**: Registered memory size per card to the KV Pool. **Needs to be aligned to 1GB.**
#### 2.Start mooncake_master
@@ -147,9 +152,10 @@ export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ACL_OP_INIT_MODE=1
# ASCEND_BUFFER_POOL is the environment variable for configuring the number and size of buffer on NPU Device for aggregation and KV transferthe value 4:8 means we allocate 4 buffers of size 8MB.
export ASCEND_BUFFER_POOL=4:8
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1
# Unit: ms. The timeout for one-sided communication connection establishment is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039). Users can adjust this value based on their specific setup.
# The recommended formula is: ASCEND_CONNECT_TIMEOUT = connection_time_per_card (typically within 500ms) × total_number_of_Decode_cards.
@@ -164,12 +170,12 @@ python3 -m vllm.entrypoints.openai.api_server \
--port 8100 \
--trust-remote-code \
--enforce-eager \
--no_enable_prefix_caching \
--no-enable-prefix-caching \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 10000 \
--max-model-len 32768 \
--block-size 128 \
--max-num-batched-tokens 4096 \
--max-num-batched-tokens 16384 \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
@@ -181,7 +187,6 @@ python3 -m vllm.entrypoints.openai.api_server \
"kv_role": "kv_producer",
"kv_port": "20001",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 1,
"tp_size": 1
@@ -220,7 +225,10 @@ export PYTHONHASHSEED=0 
export MOONCAKE_CONFIG_PATH="/xxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
export ACL_OP_INIT_MODE=1
export ASCEND_BUFFER_POOL=4:8
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000
@@ -229,12 +237,12 @@ python3 -m vllm.entrypoints.openai.api_server \
--port 8200 \
--trust-remote-code \
--enforce-eager \
--no_enable_prefix_caching \
--no-enable-prefix-caching \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 10000 \
--max-model-len 32768 \
--block-size 128 \
--max-num-batched-tokens 4096 \
--max-num-batched-tokens 16384 \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
@@ -331,7 +339,10 @@ export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export PYTHONHASHSEED=0 
export ACL_OP_INIT_MODE=1
export ASCEND_BUFFER_POOL=4:8
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000
@@ -340,12 +351,12 @@ python3 -m vllm.entrypoints.openai.api_server \
--port 8100 \
--trust-remote-code \
--enforce-eager \
--no_enable_prefix_caching \
--no-enable-prefix-caching \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 10000 \
--max-model-len 32768 \
--block-size 128 \
--max-num-batched-tokens 4096 \
--max-num-batched-tokens 16384 \
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
@@ -616,13 +627,12 @@ vllm serve xxxxxxx/Qwen3-32B \
--tensor-parallel-size 4 \
--seed 1024 \
--served-model-name qwen3 \
--max-model-len 65536 \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--max-num_seqs 20 \
--no-enable-prefix-caching \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
@@ -633,11 +643,8 @@ vllm serve xxxxxxx/Qwen3-32B \
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_buffer_device": "npu",
"kv_rank": 0,
"kv_port": "20001",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 4
@@ -699,13 +706,12 @@ vllm serve xxxxxxx/Qwen3-32B \
--tensor-parallel-size 4 \
--seed 1024 \
--served-model-name qwen3 \
--max-model-len 65536 \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--max-num_seqs 20 \
--no-enable-prefix-caching \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
@@ -715,11 +721,8 @@ vllm serve xxxxxxx/Qwen3-32B \
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_buffer_device": "npu",
"kv_rank": 1,
"kv_port": "20002",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 4
@@ -774,10 +777,9 @@ python -m vllm.entrypoints.openai.api_server \
--data-parallel-size 2 \
--tensor-parallel-size 8 \
--port 30050 \
--max-num_seqs 28 \
--max-model-len 16384 \
--max-num_seqs 20 \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
--enable_expert_parallel \
--quantization ascend \
--gpu-memory-utilization 0.90 \
@@ -792,11 +794,8 @@ python -m vllm.entrypoints.openai.api_server \
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_buffer_device": "npu",
"kv_rank": 0,
"kv_port": "20001",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 8
@@ -846,15 +845,14 @@ python -m vllm.entrypoints.openai.api_server \
--data-parallel-size 2 \
--tensor-parallel-size 8 \
--port 30060 \
--max-model-len 16384 \
--max-num-batched-tokens 5200 \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--enforce-eager\
--quantization ascend \
--no-enable-prefix-caching \
--max-num_seqs 28 \
--max-num_seqs 20 \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--enable_expert_parallel \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{
@@ -865,11 +863,8 @@ python -m vllm.entrypoints.openai.api_server \
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_buffer_device": "npu",
"kv_rank": 1,
"kv_port": "20002",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 8
@@ -905,7 +900,7 @@ python -m vllm.entrypoints.openai.api_server \
The deepseek model needs to be run in a two-node cluster.
**Run_hunbu_1.sh:**
**Run_pd_mix_1.sh:**
```shell
rm -rf /root/ascend/log/*
@@ -948,7 +943,7 @@ vllm serve xxxxxxx/DeepSeek-R1 \
--tensor-parallel-size 8 \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 65536 \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
@@ -956,7 +951,6 @@ vllm serve xxxxxxx/DeepSeek-R1 \
--max-num_seqs 20 \
--enable-expert-parallel \
--no-enable-prefix-caching \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
@@ -965,11 +959,11 @@ vllm serve xxxxxxx/DeepSeek-R1 \
"backend": "memcache",
"lookup_rpc_port":"0"
}
}' > log_hunbu_1.log 2>&1
}' > log_pd_mix_1.log 2>&1
```
**Run_hunbu_2.sh:**
**Run_pd_mix_2.sh:**
```shell
rm -rf /root/ascend/log/*
@@ -1014,7 +1008,7 @@ vllm serve xxxxxxx/DeepSeek-R1 \
--tensor-parallel-size 8 \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 65536 \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
@@ -1022,16 +1016,15 @@ vllm serve xxxxxxx/DeepSeek-R1 \
--max-num_seqs 20 \
--enable-expert-parallel \
--no-enable-prefix-caching \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false, "chunked_prefill_for_mla":true}' \
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"backend": "memcache",
"mooncake_rpc_port":"0"
"lookup_rpc_port":"0"
}
}' > log_hunbu_2.log 2>&1
}' > log_pd_mix_2.log 2>&1
```
@@ -1069,12 +1062,11 @@ python -m vllm.entrypoints.openai.api_server \
-dp 2 \
-tp 8 \
--port 30050 \
--max-num_seqs 28 \
--max-model-len 16384 \
--max-num_seqs 20 \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false, "chunked_prefill_for_mla":true}' \
--enable_expert_parallel \
--quantization ascend \
--gpu-memory-utilization 0.90 \
@@ -1085,10 +1077,8 @@ python -m vllm.entrypoints.openai.api_server \
"kv_role": "kv_both",
"kv_connector_extra_config": {
"backend": "memcache",
"mooncake_rpc_port":"0"
"lookup_rpc_port":"0"
}
}' > log_hunbu.log 2>&1
}' > log_pd_mix.log 2>&1
```
#### [2.Run Inference](#2run-inference)

View File

@@ -60,7 +60,6 @@ deployment:
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 16
@@ -106,7 +105,6 @@ deployment:
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 16
@@ -152,7 +150,6 @@ deployment:
"kv_port": "30200",
"engine_id": "1",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 16
@@ -200,7 +197,6 @@ deployment:
"kv_port": "30200",
"engine_id": "1",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 16

View File

@@ -45,7 +45,6 @@ deployment:
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 8
@@ -84,7 +83,6 @@ deployment:
"kv_port": "30100",
"engine_id": "1",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 8

View File

@@ -45,7 +45,6 @@ mooncake_json = {
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
"device_name": "",
"use_ascend_direct": True,
"master_server_address": "",
"global_segment_size": 30000000000
}