xc-llm-ascend

Author	SHA1	Message	Date
xulei	8325528368	[Kernel]: Optimize DispatchFFNCombine performance (#6468 ) ### What this PR does / why we need it? This PR focuses on performance optimization for the DispatchFFNCombine operator. The key optimizations include: 1. Improving communication efficiency by merging the transmission of tokens and scales; 2. Decoupling multi-core dependencies and reducing waiting bubbles in the combine process through tile-granularity communication; 3. Optimizing the full-card synchronization overhead before the umpermute operation. These optimizations aim to reduce the overall execution latency of the DispatchFFNCombine operator and enhance the runtime performance of the model inference process on Ascend devices. ### Does this PR introduce _any_ user-facing change? No. This PR only involves internal performance optimization of the DispatchFFNCombine operator and does not introduce any changes to user-facing APIs, interfaces, or behaviors. ### How was this patch tested? 1. Enable the DispatchFFNCombine operator by setting the environment variable: ``` export VLLM_ASCEND_ENABLE_FUSED_MC2=1 ``` 2. Run the standard model inference test suite with the above environment variable enabled; 4. Verify the correctness of model outputs (ensuring no functional regression) and measure the performance improvement of the DispatchFFNCombine operator (reduced latency and improved throughput). - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: xulei_ict <xulei292@huawei.com> Co-authored-by: xulei_ict <xulei292@huawei.com>	2026-02-09 16:30:34 +08:00
lhchg	717d299ae5	[BugFix]bug fix for dispatch_ffn_combine (#6156 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Some synchronization logic of the fusion operator copies EP * expertPerRank int32 values. This part of data contains synchronization signals and data. The 512B DataBlock of Ascend A3 writes all data in the same block atomically to the HBM. For the DeepSeek model, when expertPerRank per device is 16, the 512B alignment is met in both 16-device single-node and 32-device two-node scenarios. Therefore, we check the first position of each 512B data. If the value is not 0, it indicates that the current 512B data has been sent. However, for other cases where expertPerRank per device is not 16, EP * expertPerRank does not meet the 512B alignment. If the above logic is used for checking, there will be problems. Therefore, here we will pad the EP * expertPerRank data length to the length aligned to 512B. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: lhchg <lhao_cheng@163.com> Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>	2026-01-23 21:14:18 +08:00
lhchg	dc99cfdc15	[CustomOp] support TensorList for dispatchFFNCombine (#5665 ) ### What this PR does / why we need it? To support tensorList for dispatch_ffn_combine, to adjust eplb ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Single Operator Testing - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lhchg <lhao_cheng@163.com> Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>	2026-01-09 15:56:29 +08:00
Chen Chen	ad0607f900	add `dispatch_gmm_combine` kernel (#3532 ) ### What this PR does / why we need it? This PR introduces the Ascend implementation of the `dispatch_ffn_combine` kernel and wires it into the vLLM-Ascend runtime, together with follow‑up fixes to ensure the kernel builds and runs correctly in CI. - Add full host and device implementation of the `dispatch_ffn_combine` kernel under `csrc/dispatch_ffn_combine`, including tiling logic, MOE routing helpers, and kernel utilities for quantized FFN dispatch. - Integrate the new kernel with the PyTorch binding (csrc/torch_binding.cpp, csrc/torch_binding_meta.cpp) and the Ascend runtime (vllm_ascend/ascend_forward_context.py, vllm_ascend/worker/model_runner_v1.py). - Extend fused MoE communication and token dispatch support in `vllm_ascend/ops/fused_moe`, adding methods/utilities needed by the new dispatch path. - Update quantization logic in vllm_ascend/quantization/w8a8_dynamic.py to support the new FFN dispatch flow. - Fix kernel build issues by adjusting `csrc/build_aclnn.sh`, CMake configuration, and include/namespace usage in the new kernel files. - Add an end‑to‑end nightly test `tests/e2e/nightly/ops/test_dispatch_ffn_combine.py` and helper utilities in `vllm_ascend/utils.py` to validate the new kernel. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 --------- Signed-off-by: mojave2 <chenchen145@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-04 23:00:59 +08:00

4 Commits