xc-llm-ascend

Files

Ronald1995 32a9c5f694 [Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926 )

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
7728dd77bb

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

2025-07-28 15:13:37 +08:00

test_patch_distributed.py

Add UT for Patches (#1766 )

2025-07-23 16:07:20 +08:00

test_patch_linear.py

[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926 )

2025-07-28 15:13:37 +08:00

test_patch_minicpm.py

Add UT for Patches (#1766 )

2025-07-23 16:07:20 +08:00

test_patch_sampler.py

[Perf] add patch to optimize apply_topk_topp (#1732 )

2025-07-11 15:32:02 +08:00

test_patch_utils.py

Add UT for Patches (#1766 )

2025-07-23 16:07:20 +08:00