[Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701)

### What this PR does / why we need it? > Extracted from PR #5513 Based on the Sharded-CP feature PR:#4702; RFC:https://github.com/vllm-project/vllm/issues/30055 ### All-gather KV Cache for Communication Overlap: - This PR adjusts the calculation order in the SFA. - split `index_select` into `indexer_select_pre_process` and `indexer_select_post_process`. - Combine `nope`, `rope` and `index-k` into a tensor to perform asynchronous all-gather. ### benchmark: input=40k && num_batch_token=20k - before: ``` Mean TTFT (ms): 2614.52 Median TTFT (ms): 3148.03 P50 TTFT (ms): 3148.03 P90 TTFT (ms): 3163.48 P99 TTFT (ms): 3170.20 ``` - after: ``` Mean TTFT (ms): 2529.92 Median TTFT (ms): 3051.69 P50 TTFT (ms): 3051.69 P90 TTFT (ms): 3067.31 P99 TTFT (ms): 3072.15 ``` ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: 2f4e6548ef --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2026-01-11 09:47:27 +08:00
parent c5744e2350
commit db12c1e2c8
3 changed files with 124 additions and 61 deletions
--- a/vllm_ascend/utils.py
+++ b/vllm_ascend/utils.py
@@ -1172,20 +1172,13 @@ def singleton(cls):
@lru_cache(maxsize=1)
 def enable_dsa_cp() -> bool:
    from vllm.config import get_current_vllm_config
-
    vllm_config = get_current_vllm_config()
-    if vllm_config is None:
-        return False
-
-    model_config = getattr(vllm_config, "model_config", None)
-    if model_config is None:
-        return False
-
-    hf_text_config = getattr(model_config, "hf_text_config", None)
-    if hf_text_config is None:
-        return False
-
-    return hasattr(hf_text_config, "index_topk")
+    is_ds_v32 = hasattr(
+        vllm_config.model_config, "hf_text_config") and hasattr(
+            vllm_config.model_config.hf_text_config, "index_topk")
+    if is_ds_v32 and enable_sp():
+        return True
+    return False


@lru_cache(maxsize=1)