xc-llm-ascend

Files

wangxiyuan fd7c929145 [perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks (#4983 )

pick from https://github.com/vllm-project/vllm-ascend/pull/4736 to fix
the merge conflict

### What this PR does / why we need it?
Currently, the all_reduce operation in _sync_metadata_across_dp is
performed with gloo backend which is extremely time-consuming when
DPEngineCores are in different nodes. This operation cannot be ignored
by async scheduling in multi-node-scenarios with speculative decoding
(e.g., EAGLE, mtp).

This pr eliminates the all_reduce operation for D Nodes and change the
input parameter of MoEDispatch & MoeCombine operators to make MC2EP
support different num_tokens across all ranks.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with PD disaggregation (2P: DP2TP8EP16 1D: DP8TP4EP32) scenarios
while enabling async scheduling. This pr can remove cross-node
all_reduce with gloo backend and further reduce latency with correct
accuracy.

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>

2025-12-13 18:59:54 +08:00

e2e

[bugfix] asyncscheduler bug fix (#4968 )

2025-12-13 17:04:54 +08:00

[perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks (#4983 )

2025-12-13 18:59:54 +08:00

__init__.py

[SpecDecode] Add spec decode support (#500 )

2025-04-17 20:16:32 +08:00