xc-llm-ascend

Author SHA1 Message Date

Author	SHA1	Message	Date
xulei	8325528368	[Kernel]: Optimize DispatchFFNCombine performance (#6468 ) ### What this PR does / why we need it? This PR focuses on performance optimization for the DispatchFFNCombine operator. The key optimizations include: 1. Improving communication efficiency by merging the transmission of tokens and scales; 2. Decoupling multi-core dependencies and reducing waiting bubbles in the combine process through tile-granularity communication; 3. Optimizing the full-card synchronization overhead before the umpermute operation. These optimizations aim to reduce the overall execution latency of the DispatchFFNCombine operator and enhance the runtime performance of the model inference process on Ascend devices. ### Does this PR introduce _any_ user-facing change? No. This PR only involves internal performance optimization of the DispatchFFNCombine operator and does not introduce any changes to user-facing APIs, interfaces, or behaviors. ### How was this patch tested? 1. Enable the DispatchFFNCombine operator by setting the environment variable: ``` export VLLM_ASCEND_ENABLE_FUSED_MC2=1 ``` 2. Run the standard model inference test suite with the above environment variable enabled; 4. Verify the correctness of model outputs (ensuring no functional regression) and measure the performance improvement of the DispatchFFNCombine operator (reduced latency and improved throughput). - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: xulei_ict <xulei292@huawei.com> Co-authored-by: xulei_ict <xulei292@huawei.com>	2026-02-09 16:30:34 +08:00
Chen Chen	7f33838e6e	Update comment doc (#4731 ) ### What this PR does / why we need it? Translate remaining Chinese comments in the `dispatch_ffn_combine` code to English and update the installation guide to remind users to initialize submodules when building from source. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: mojave2 <chenchen145@huawei.com> Signed-off-by: Chen Chen <0109chenchen@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 15:07:31 +08:00
Chen Chen	ad0607f900	add `dispatch_gmm_combine` kernel (#3532 ) ### What this PR does / why we need it? This PR introduces the Ascend implementation of the `dispatch_ffn_combine` kernel and wires it into the vLLM-Ascend runtime, together with follow‑up fixes to ensure the kernel builds and runs correctly in CI. - Add full host and device implementation of the `dispatch_ffn_combine` kernel under `csrc/dispatch_ffn_combine`, including tiling logic, MOE routing helpers, and kernel utilities for quantized FFN dispatch. - Integrate the new kernel with the PyTorch binding (csrc/torch_binding.cpp, csrc/torch_binding_meta.cpp) and the Ascend runtime (vllm_ascend/ascend_forward_context.py, vllm_ascend/worker/model_runner_v1.py). - Extend fused MoE communication and token dispatch support in `vllm_ascend/ops/fused_moe`, adding methods/utilities needed by the new dispatch path. - Update quantization logic in vllm_ascend/quantization/w8a8_dynamic.py to support the new FFN dispatch flow. - Fix kernel build issues by adjusting `csrc/build_aclnn.sh`, CMake configuration, and include/namespace usage in the new kernel files. - Add an end‑to‑end nightly test `tests/e2e/nightly/ops/test_dispatch_ffn_combine.py` and helper utilities in `vllm_ascend/utils.py` to validate the new kernel. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 --------- Signed-off-by: mojave2 <chenchen145@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-04 23:00:59 +08:00

xulei

8325528368

[Kernel]: Optimize DispatchFFNCombine performance (#6468 )

### What this PR does / why we need it?

This PR focuses on performance optimization for the DispatchFFNCombine
operator. The key optimizations include:

1. Improving communication efficiency by merging the transmission of
tokens and scales;
2. Decoupling multi-core dependencies and reducing waiting bubbles in
the combine process through tile-granularity communication;
3. Optimizing the full-card synchronization overhead before the
umpermute operation.

These optimizations aim to reduce the overall execution latency of the
DispatchFFNCombine operator and enhance the runtime performance of the
model inference process on Ascend devices.

### Does this PR introduce _any_ user-facing change?

No. This PR only involves internal performance optimization of the
DispatchFFNCombine operator and does not introduce any changes to
user-facing APIs, interfaces, or behaviors.

### How was this patch tested?

1. Enable the DispatchFFNCombine operator by setting the environment
variable:
```
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
```
2. Run the standard model inference test suite with the above
environment variable enabled;
4. Verify the correctness of model outputs (ensuring no functional
regression) and measure the performance improvement of the
DispatchFFNCombine operator (reduced latency and improved throughput).

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: xulei_ict <xulei292@huawei.com>
Co-authored-by: xulei_ict <xulei292@huawei.com>

2026-02-09 16:30:34 +08:00

Chen Chen

7f33838e6e

Update comment doc (#4731 )

### What this PR does / why we need it?

Translate remaining Chinese comments in the `dispatch_ffn_combine` code
to English and update the installation guide to remind users to
initialize submodules when building from source.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: mojave2 <chenchen145@huawei.com>
Signed-off-by: Chen Chen <0109chenchen@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

2025-12-05 15:07:31 +08:00

Chen Chen

ad0607f900

add dispatch_gmm_combine kernel (#3532 )

### What this PR does / why we need it?

This PR introduces the Ascend implementation of the
`dispatch_ffn_combine` kernel and wires it into the vLLM-Ascend runtime,
together with follow‑up fixes to ensure the kernel builds and runs
correctly in CI.

- Add full host and device implementation of the `dispatch_ffn_combine`
kernel under `csrc/dispatch_ffn_combine`, including tiling logic, MOE
routing helpers, and kernel utilities for quantized FFN dispatch.
- Integrate the new kernel with the PyTorch binding
(csrc/torch_binding.cpp, csrc/torch_binding_meta.cpp) and the Ascend
runtime (vllm_ascend/ascend_forward_context.py,
vllm_ascend/worker/model_runner_v1.py).
- Extend fused MoE communication and token dispatch support in
`vllm_ascend/ops/fused_moe`, adding methods/utilities needed by the new
dispatch path.
- Update quantization logic in vllm_ascend/quantization/w8a8_dynamic.py
to support the new FFN dispatch flow.
- Fix kernel build issues by adjusting `csrc/build_aclnn.sh`, CMake
configuration, and include/namespace usage in the new kernel files.
- Add an end‑to‑end nightly test
`tests/e2e/nightly/ops/test_dispatch_ffn_combine.py` and helper
utilities in `vllm_ascend/utils.py` to validate the new kernel.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0

---------

Signed-off-by: mojave2 <chenchen145@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

2025-12-04 23:00:59 +08:00

3 Commits