### What this PR does / why we need it?
Revise the KV Pool user guide:
1. Revise Mooncake environment variables and kvconnector extra configs.
2. Delete `use_ascend_direct` in kv connector extra config as it is
deprecated
3. Delete `kv_buffer_device` and `kv_rank` in P2P mooncake config
4. Unifies default `max-model-len` and `max-num-batch-tokens` in
examples given.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
#### `kv_connector_extra_config`: AdditionalConfigurableParameters for Pooling
| Parameter | Description |
| :--- | :--- |
| `lookup_rpc_port` | Port for RPC Communication Between Pooling Scheduler Process and Worker Process: Each Instance Requires a Unique Port Configuration. |
| `load_async` | Whether to Enable Asynchronous Loading. The default value is false. |
| `backend` | Set the storage backend for kvpool, with the default being mooncake. |
| `consumer_is_to_put` | Whether Decode node put KV Cache into KV Pool. The default value is false. |
| `consumer_is_to_load` | Whether Decode node load KV cache from KV Pool. The default value is false. |
| `prefill_pp_size` | Prefill PP size, needs to be set when Prefill node enables PP. |
| `prefill_pp_layer_partition` | Prefill PP layer partition, needs to be set when Prefill node enables PP. |
### Environment Variable Configuration
@@ -87,12 +93,11 @@ export PYTHONHASHSEED=0
### Environment Variables Description
`exportASCEND_ENABLE_USE_FABRIC_MEM=1`: Enable unified memory address direct transmission scheme and only can be used for 800 I/T A3 series. Required supporting hardware versions are as follows:
HDK >=26.0
CANN >= 9.0
`exportASCEND_BUFFER_POOL=4:8`: ASCEND_BUFFER_POOL is the environment variable for configuring the number and size of buffer on NPU Device for aggregation and KV transfer,the value 4:8 means we allocate 4 buffers of size 8MB. It only can be used for 800 I/T A2 series.
| 800 I/T A3 series | 25.5.0<=HDK<26.0.0 | `export ASCEND_BUFFER_POOL=4:8` | Configures the number and size of buffers on the NPU Device for aggregation and KV transfer (e.g., `4:8` means 4 buffers of 8MB). |
| 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission cheme on 800 I/T A2 series|
### Run Mooncake Master
@@ -114,7 +119,7 @@ The environment variable **MOONCAKE_CONFIG_PATH** is configured to the full path
**protocol:** Must be set to 'Ascend' on the NPU.
**device_name**: ""
**master_server_address**: Configured with the IP and port of the master service.
**global_segment_size**: Registered memory size per card to the KV Pool.
**global_segment_size**: Registered memory size per card to the KV Pool. **Needs to be aligned to 1GB.**
# ASCEND_BUFFER_POOL is the environment variable for configuring the number and size of buffer on NPU Device for aggregation and KV transfer,the value 4:8 means we allocate 4 buffers of size 8MB.
export ASCEND_BUFFER_POOL=4:8
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1
# Unit: ms. The timeout for one-sided communication connection establishment is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039). Users can adjust this value based on their specific setup.
# The recommended formula is: ASCEND_CONNECT_TIMEOUT = connection_time_per_card (typically within 500ms) × total_number_of_Decode_cards.
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.