[Doc][0.18.0][KV Pool]add mooncake rdma timeout (#7784)

### What this PR does / why we need it? - Add `default_kv_lease_ttl` to the `mooncake.json` example in the KV Pool guide. - Document `default_kv_lease_ttl` semantics and clarify that it should be larger than `ASCEND_CONNECT_TIMEOUT` and `ASCEND_TRANSFER_TIMEOUT`. - Add `HCCL_RDMA_TIMEOUT` explanation for Mooncake RDMA retransmission timeout, including the recommended constraint note. - Add `HCCL_RDMA_TIMEOUT=16` to relevant KV Pool environment setup examples for consistency.  ### Does this PR introduce _any_ user-facing change?  ### How was this patch tested?  --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
2026-03-31 20:17:03 +08:00
parent a63dd5868d
commit 14411e911e
1 changed files with 14 additions and 1 deletions
--- a/docs/source/user_guide/feature_guide/kv_pool.md
+++ b/docs/source/user_guide/feature_guide/kv_pool.md
@@ -99,6 +99,11 @@ export PYTHONHASHSEED=0
 | 800 I/T A3 series | 25.5.0<=HDK<26.0.0 | `export ASCEND_BUFFER_POOL=4:8` | Configures the number and size of buffers on the NPU Device for aggregation and KV transfer (e.g., `4:8` means 4 buffers of 8MB). |
 | 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission cheme on 800 I/T A2 series|

+### FAQ for HIXL (ascend_direct) backend
+
+For common troubleshooting and issue localization guidance for HIXL (ascend_direct), see:
+<https://gitcode.com/cann/hixl/wiki/HIXL%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E5%AE%9A%E4%BD%8D%E6%89%8B%E5%86%8C.md>
+
 ### Run Mooncake Master

 #### 1.Configure mooncake.json
@@ -126,10 +131,11 @@ The environment variable **MOONCAKE_CONFIG_PATH** is configured to the full path
 Under the mooncake folder:

 ```shell
-mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.1
+mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.1 --default_kv_lease_ttl 11000
 ```

 `eviction_high_watermark_ratio` determines the watermark where Mooncake Store will perform eviction，and `eviction_ratio` determines the portion of stored objects that would be evicted.
+`default_kv_lease_ttl` controls the default lease TTL for KV objects (milliseconds); configure it via `--default_kv_lease_ttl` and keep it larger than `ASCEND_CONNECT_TIMEOUT` and `ASCEND_TRANSFER_TIMEOUT`.

 ### PD Disaggregation Scenario

@@ -157,6 +163,11 @@ export ASCEND_ENABLE_USE_FABRIC_MEM=1
 #A2
 #export HCCL_INTRA_ROCE_ENABLE=1

+#Minimum retransmission timeout of the RDMA，equals 4.096 μs * 2 ^ timeout.
+#Needs to satisfy the equation: ASCEND_TRANSFER_TIMEOUT > RDMA_TIMEOUT * 7, where 7 is the default number of retry for RDMA transfer.
+#HCCL_RDMA_TIMEOUT also affects collective communication behavior and should be configured carefully.
+export HCCL_RDMA_TIMEOUT=17
+
 # Unit: ms. The timeout for one-sided communication connection establishment is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039). Users can adjust this value based on their specific setup.
 # The recommended formula is: ASCEND_CONNECT_TIMEOUT = connection_time_per_card (typically within 500ms) × total_number_of_Decode_cards.
 # This ensures that even in the worst-case scenario—where all Decode cards simultaneously attempt to connect to the same Prefill card the connection will not time out.
@@ -229,6 +240,7 @@ export ACL_OP_INIT_MODE=1
 export ASCEND_ENABLE_USE_FABRIC_MEM=1
 #A2
 #export HCCL_INTRA_ROCE_ENABLE=1
+export HCCL_RDMA_TIMEOUT=17
 export ASCEND_CONNECT_TIMEOUT=10000
 export ASCEND_TRANSFER_TIMEOUT=10000

@@ -343,6 +355,7 @@ export ACL_OP_INIT_MODE=1
 export ASCEND_ENABLE_USE_FABRIC_MEM=1
 #A2
 #export HCCL_INTRA_ROCE_ENABLE=1
+export HCCL_RDMA_TIMEOUT=17
 export ASCEND_CONNECT_TIMEOUT=10000
 export ASCEND_TRANSFER_TIMEOUT=10000