From 14411e911e905b08d664fab925a0b454156ef5be Mon Sep 17 00:00:00 2001 From: pz1116 <47019764+Pz1116@users.noreply.github.com> Date: Tue, 31 Mar 2026 20:17:03 +0800 Subject: [PATCH] [Doc][0.18.0][KV Pool]add mooncake rdma timeout (#7784) ### What this PR does / why we need it? - Add `default_kv_lease_ttl` to the `mooncake.json` example in the KV Pool guide. - Document `default_kv_lease_ttl` semantics and clarify that it should be larger than `ASCEND_CONNECT_TIMEOUT` and `ASCEND_TRANSFER_TIMEOUT`. - Add `HCCL_RDMA_TIMEOUT` explanation for Mooncake RDMA retransmission timeout, including the recommended constraint note. - Add `HCCL_RDMA_TIMEOUT=16` to relevant KV Pool environment setup examples for consistency. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? --------- Signed-off-by: Pz1116 --- docs/source/user_guide/feature_guide/kv_pool.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/docs/source/user_guide/feature_guide/kv_pool.md b/docs/source/user_guide/feature_guide/kv_pool.md index 0950ce09..6239f758 100644 --- a/docs/source/user_guide/feature_guide/kv_pool.md +++ b/docs/source/user_guide/feature_guide/kv_pool.md @@ -99,6 +99,11 @@ export PYTHONHASHSEED=0 | 800 I/T A3 series | 25.5.0<=HDK<26.0.0 | `export ASCEND_BUFFER_POOL=4:8` | Configures the number and size of buffers on the NPU Device for aggregation and KV transfer (e.g., `4:8` means 4 buffers of 8MB). | | 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission cheme on 800 I/T A2 series| +### FAQ for HIXL (ascend_direct) backend + +For common troubleshooting and issue localization guidance for HIXL (ascend_direct), see: + + ### Run Mooncake Master #### 1.Configure mooncake.json @@ -126,10 +131,11 @@ The environment variable **MOONCAKE_CONFIG_PATH** is configured to the full path Under the mooncake folder: ```shell -mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.1 +mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.1 --default_kv_lease_ttl 11000 ``` `eviction_high_watermark_ratio` determines the watermark where Mooncake Store will perform eviction,and `eviction_ratio` determines the portion of stored objects that would be evicted. +`default_kv_lease_ttl` controls the default lease TTL for KV objects (milliseconds); configure it via `--default_kv_lease_ttl` and keep it larger than `ASCEND_CONNECT_TIMEOUT` and `ASCEND_TRANSFER_TIMEOUT`. ### PD Disaggregation Scenario @@ -157,6 +163,11 @@ export ASCEND_ENABLE_USE_FABRIC_MEM=1 #A2 #export HCCL_INTRA_ROCE_ENABLE=1 +#Minimum retransmission timeout of the RDMA,equals 4.096 μs * 2 ^ timeout. +#Needs to satisfy the equation: ASCEND_TRANSFER_TIMEOUT > RDMA_TIMEOUT * 7, where 7 is the default number of retry for RDMA transfer. +#HCCL_RDMA_TIMEOUT also affects collective communication behavior and should be configured carefully. +export HCCL_RDMA_TIMEOUT=17 + # Unit: ms. The timeout for one-sided communication connection establishment is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039). Users can adjust this value based on their specific setup. # The recommended formula is: ASCEND_CONNECT_TIMEOUT = connection_time_per_card (typically within 500ms) × total_number_of_Decode_cards. # This ensures that even in the worst-case scenario—where all Decode cards simultaneously attempt to connect to the same Prefill card the connection will not time out. @@ -229,6 +240,7 @@ export ACL_OP_INIT_MODE=1 export ASCEND_ENABLE_USE_FABRIC_MEM=1 #A2 #export HCCL_INTRA_ROCE_ENABLE=1 +export HCCL_RDMA_TIMEOUT=17 export ASCEND_CONNECT_TIMEOUT=10000 export ASCEND_TRANSFER_TIMEOUT=10000 @@ -343,6 +355,7 @@ export ACL_OP_INIT_MODE=1 export ASCEND_ENABLE_USE_FABRIC_MEM=1 #A2 #export HCCL_INTRA_ROCE_ENABLE=1 +export HCCL_RDMA_TIMEOUT=17 export ASCEND_CONNECT_TIMEOUT=10000 export ASCEND_TRANSFER_TIMEOUT=10000