xc-llm-ascend

Files

zzhxxx db12c1e2c8 [Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701 )

### What this PR does / why we need it?
> Extracted from PR #5513
Based on the Sharded-CP feature PR:#4702;
RFC:https://github.com/vllm-project/vllm/issues/30055

### All-gather KV Cache for Communication Overlap:
- This PR adjusts the calculation order in the SFA.
- split `index_select` into `indexer_select_pre_process` and
`indexer_select_post_process`.
- Combine `nope`, `rope` and `index-k` into a tensor to perform
asynchronous all-gather.

### benchmark:
input=40k && num_batch_token=20k
- before:
```
Mean TTFT (ms):                          2614.52
Median TTFT (ms):                        3148.03
P50 TTFT (ms):                           3148.03
P90 TTFT (ms):                           3163.48
P99 TTFT (ms):                           3170.20
```

- after:
```
Mean TTFT (ms):                          2529.92
Median TTFT (ms):                        3051.69
P50 TTFT (ms):                           3051.69
P90 TTFT (ms):                           3067.31
P99 TTFT (ms):                           3072.15
```

### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

2026-01-11 09:47:27 +08:00

context_parallel

[feature]dcp&pcp support mlapo (#5672 )

2026-01-08 23:49:23 +08:00

__init__.py

[Core] Make V1 work and enable V1 engine test (#389 )

2025-03-28 19:34:23 +08:00

attention_mask.py

[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870 )

2026-01-07 17:09:52 +08:00

attention_v1.py

[Feat] flashcomm2+oshard Generalized (#4723 )