xc-llm-ascend/vllm_ascend at db12c1e2c840a98570bf8a793f2a9786999e8c3f - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

zzhxxx db12c1e2c8 [Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701 )

### What this PR does / why we need it?
> Extracted from PR #5513
Based on the Sharded-CP feature PR:#4702;
RFC:https://github.com/vllm-project/vllm/issues/30055

### All-gather KV Cache for Communication Overlap:
- This PR adjusts the calculation order in the SFA.
- split `index_select` into `indexer_select_pre_process` and
`indexer_select_post_process`.
- Combine `nope`, `rope` and `index-k` into a tensor to perform
asynchronous all-gather.

### benchmark:
input=40k && num_batch_token=20k
- before:
```
Mean TTFT (ms):                          2614.52
Median TTFT (ms):                        3148.03
P50 TTFT (ms):                           3148.03
P90 TTFT (ms):                           3163.48
P99 TTFT (ms):                           3170.20
```

- after:
```
Mean TTFT (ms):                          2529.92
Median TTFT (ms):                        3051.69
P50 TTFT (ms):                           3051.69
P90 TTFT (ms):                           3067.31
P99 TTFT (ms):                           3072.15
```

### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

2026-01-11 09:47:27 +08:00

..

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

[Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701 )

2026-01-11 09:47:27 +08:00

[BugFix][Fusion] Fix graph fusion failure problem (#5676 )

2026-01-07 18:42:55 +08:00

[CI] fix lint (#5216 )

2025-12-20 17:03:25 +08:00

device_allocator

[Refactor] Cleanup platform (#5566 )

2026-01-07 09:25:55 +08:00

[Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701 )

2026-01-11 09:47:27 +08:00

[Bugfix] Revert pr4214 multi-stream collect expert hotpot (#5529 )

2026-01-07 11:26:47 +08:00

[BugFix] Fix npu-cpu offloading interface change bug. (#5290 )

2025-12-27 10:21:20 +08:00

[BufFix]Fix the error when using Ascend custom operators with rank=128 (#5394 )

2026-01-09 15:57:43 +08:00

[BugFix] NetLoader: No backend type associated with device type npu (#5700 )

2026-01-09 15:54:54 +08:00

[Feat] flashcomm2+oshard Generalized (#4723 )

2026-01-10 22:57:57 +08:00

[BugFix][Fusion] Fix graph fusion failure problem (#5676 )

2026-01-07 18:42:55 +08:00

adapt to minimax_m2 (#5624 )

2026-01-10 23:01:35 +08:00

[Feature] add the magicmtp speculative decoding acceleration algorithm (#5542 )

2026-01-08 09:15:55 +08:00

[feature]dcp&pcp support mlapo (#5672 )

2026-01-08 23:49:23 +08:00

[main][bugfix] Fix fullgraph padding bug in mtp eagle refactor (#5692 )

2026-01-10 23:07:48 +08:00

[BugFix] Xlite: Bypass the padding of the graph mode in non-MTP cases to obtain the correct decode num. (#5711 )

2026-01-09 15:55:30 +08:00

__init__.py

clean up model module (#4611 )

2025-12-02 17:35:47 +08:00

ascend_config.py

[CI]Add Disaggregated PD Nightly Test for Qwen3-235B and Qwen3-VL-235B (#5502 )

2026-01-09 16:25:20 +08:00

ascend_forward_context.py

[Feature]EPLB:Adapt DispatchGmmCombineDecode operator to eplb tensor list and expert token numbers (#5552 )

2026-01-07 11:23:42 +08:00

batch_invariant.py

[Feature] implement basic framework for batch invariant (#5517 )

2026-01-07 09:11:26 +08:00

cpu_binding.py

[main] support cpu binding (#3546 )

2025-10-21 09:17:03 +08:00

envs.py

[refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181 )

2026-01-08 09:05:02 +08:00

flash_common3_context.py

[Perf]enable prefill flashcommon3 (#4065 )

2025-12-14 09:34:13 +08:00

meta_registration.py

Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

2025-09-13 11:58:52 +08:00

platform.py

Optimize the print info format when deprecated code is used in vllm-ascend (#5696 )

2026-01-08 09:26:49 +08:00

profiling_config.py

Drop ascend scheduler (#4623 )

2025-12-05 09:03:45 +08:00

utils.py

[Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701 )

2026-01-11 09:47:27 +08:00