[Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701)

### What this PR does / why we need it?
> Extracted from PR #5513
Based on the Sharded-CP feature PR:#4702;
RFC:https://github.com/vllm-project/vllm/issues/30055

### All-gather KV Cache for Communication Overlap:
- This PR adjusts the calculation order in the SFA.
- split `index_select` into `indexer_select_pre_process` and
`indexer_select_post_process`.
- Combine `nope`, `rope` and `index-k` into a tensor to perform
asynchronous all-gather.

### benchmark:
input=40k && num_batch_token=20k
- before:
```
Mean TTFT (ms):                          2614.52
Median TTFT (ms):                        3148.03
P50 TTFT (ms):                           3148.03
P90 TTFT (ms):                           3163.48
P99 TTFT (ms):                           3170.20
```

- after:
```
Mean TTFT (ms):                          2529.92
Median TTFT (ms):                        3051.69
P50 TTFT (ms):                           3051.69
P90 TTFT (ms):                           3067.31
P99 TTFT (ms):                           3072.15
```

### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
This commit is contained in:
zzhxxx
2026-01-11 09:47:27 +08:00
committed by GitHub
parent c5744e2350
commit db12c1e2c8
3 changed files with 124 additions and 61 deletions

View File

@@ -1172,20 +1172,13 @@ def singleton(cls):
@lru_cache(maxsize=1)
def enable_dsa_cp() -> bool:
from vllm.config import get_current_vllm_config
vllm_config = get_current_vllm_config()
if vllm_config is None:
return False
model_config = getattr(vllm_config, "model_config", None)
if model_config is None:
return False
hf_text_config = getattr(model_config, "hf_text_config", None)
if hf_text_config is None:
return False
return hasattr(hf_text_config, "index_topk")
is_ds_v32 = hasattr(
vllm_config.model_config, "hf_text_config") and hasattr(
vllm_config.model_config.hf_text_config, "index_topk")
if is_ds_v32 and enable_sp():
return True
return False
@lru_cache(maxsize=1)