14 Commits

Author SHA1 Message Date
guanguan0308
44ef9a36ac [fix]: fix precision issue in dispatch_ffn_combine_bf16 and remove redundant sync (#7198)
### What this PR does / why we need it?
Fix the precision issue in dispatch_ffn_combine_bf16 operator.
Remove redundant synchronization operations in dispatch_ffn_combine
operator.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: guanguan0308 <1546542263@qq.com>
2026-03-23 10:14:03 +08:00
linfeng-yuan
5f3826b093 [Build] Add support for Ascend950 chip (#7151)
### What this PR does / why we need it?
This PR adds support for the Ascend950 chip. This includes:
- Updating build scripts (`CMakeLists.txt` and `setup.py`) to recognize
the Ascend950 chip and set appropriate compilation flags.
- Disabling a set of custom operators that are not yet supported on the
Ascend950 hardware target.
- Performing a codebase-wide refactoring of `pipe_barrier()` calls to
the namespaced `AscendC::PipeBarrier<>()` for improved code consistency
and adherence to the latest API standards.

Ascend950DT e2e passed (Qwen3-32B-MXFP8) and CI passed
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-12 10:25:51 +08:00
guanguan0308
4b4961ba5f [fix]Resolve compilation errors that occur when building versions subsequent to b020 (#7059)
### What this PR does / why we need it?
Resolve compilation errors that occur when building versions subsequent
to b020:
Root Cause
During operator compilation, we previously modified the names of structs
HcclOpResParam and HcclRankRelationResV2 in the moe_distribute_base.h
file. After version b020, moe_distribute_base.h was updated with
additional code that references these two structs. This resulted in
compilation errors, as renaming the structs alone broke the newly added
references to them.
Solution
we have added the moe_distribute_base.h file to the operator
implementation. This avoids compilation errors caused by updates to this
file in the CANN framework.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: guanguan0308 <1546542263@qq.com>
2026-03-09 16:09:35 +08:00
xulei
1e77077788 [Bugfix][DispatchFFNCombine] resolve vec error caused by unaligned UB access (#6707)
### What this PR does / why we need it?
1. Fix a vec error caused by unaligned UB accesss in the
DispatchFFNCombine;
2. Fix expert_token_nums tensor defined on host instead of NPU in
moe_comm_method.py
3. Fix multi-core copy issue of expert_token_nums in dispatchffnCombine
op (single aiv copy is sufficient)

### Does this PR introduce _any_ user-facing change?

No, this PR does not introduce any user-facing changes. The fix only
addresses internal memory access logic and does not modify any public
APIs, interfaces, or user-visible behaviors.

### How was this patch tested?

`export VLLM_ASCEND_ENABLE_FUSED_MC2=1`

vLLM version: v0.15.0

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: xulei_ict <xulei292@huawei.com>
Co-authored-by: xulei_ict <xulei292@huawei.com>
2026-02-14 10:32:50 +08:00
xulei
8325528368 [Kernel]: Optimize DispatchFFNCombine performance (#6468)
### What this PR does / why we need it?

This PR focuses on performance optimization for the DispatchFFNCombine
operator. The key optimizations include:

1. Improving communication efficiency by merging the transmission of
tokens and scales;
2. Decoupling multi-core dependencies and reducing waiting bubbles in
the combine process through tile-granularity communication;
3. Optimizing the full-card synchronization overhead before the
umpermute operation.

These optimizations aim to reduce the overall execution latency of the
DispatchFFNCombine operator and enhance the runtime performance of the
model inference process on Ascend devices.

### Does this PR introduce _any_ user-facing change?

No. This PR only involves internal performance optimization of the
DispatchFFNCombine operator and does not introduce any changes to
user-facing APIs, interfaces, or behaviors.

### How was this patch tested?

1. Enable the DispatchFFNCombine operator by setting the environment
variable:
```
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
```
2. Run the standard model inference test suite with the above
environment variable enabled;
4. Verify the correctness of model outputs (ensuring no functional
regression) and measure the performance improvement of the
DispatchFFNCombine operator (reduced latency and improved throughput).

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: xulei_ict <xulei292@huawei.com>
Co-authored-by: xulei_ict <xulei292@huawei.com>
2026-02-09 16:30:34 +08:00
guanguan0308
dffac6db73 [Refactor] Add expert processed token count output for DispatchFFNCombine/DispatchFFNCombineBF16 (#6402)
### What this PR does / why we need it?
Add New Output for Expert Token Count
An additional output tensor expert_token_nums is added to both operators
to meet the requirement of tracking token distribution among experts:

Tensor Name: expert_token_nums
Dimension: 1D tensor
Shape: (local_expert_num,)
Data Type: int32
Semantics: Represents the number of tokens actually received by each
expert on the current card.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: guanguan0308 <1546542263@qq.com>
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
2026-02-03 10:41:06 +08:00
LQLlulu
d1dcdfc408 [bugfix]fix some bug in dispatch_ffn_combine kernel (#6465)
### What this PR does / why we need it?
The kernel internals had an issue with maxoutputsize overflow in the
swiglu section, which has been fixed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: LQLlulu <39671654+LQLlulu@users.noreply.github.com>
2026-02-02 08:32:42 +08:00
xulei
77ea873224 fix: resolve sync bug in DispathFFNCombine when expert num per card is 32 (#6416)
### What this PR does / why we need it?

Fix the synchronization deadlock issue in DispathFFNCombine module that
occurs on NPU cards when the number of experts per card exceeds 16 (the
bug manifests prominently when set to 32/128).

### Does this PR introduce _any_ user-facing change?
No, this is a bug fix for internal synchronization logic specific to NPU
expert dispatch, with no impact on external APIs, interfaces, or
end-user behaviors.


- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: xulei_ict <xulei292@huawei.com>
Co-authored-by: xulei_ict <xulei292@huawei.com>
2026-01-30 21:21:20 +08:00
Li Wang
c26ad78f86 [CI][lint] Add rule codespell back (#6236)
### What this PR does / why we need it?
After removing codepsell a while, we discovered that typo had a problem
correctly recognizing certain misspelled words, so I suggested adding it
back.

- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 14:12:33 +08:00
lhchg
717d299ae5 [BugFix]bug fix for dispatch_ffn_combine (#6156)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Some synchronization logic of the fusion operator copies EP *
expertPerRank int32 values. This part of data contains synchronization
signals and data.

The 512B DataBlock of Ascend A3 writes all data in the same block
atomically to the HBM.

For the DeepSeek model, when expertPerRank per device is 16, the 512B
alignment is met in both 16-device single-node and 32-device two-node
scenarios. Therefore, we check the first position of each 512B data. If
the value is not 0, it indicates that the current 512B data has been
sent.

However, for other cases where expertPerRank per device is not 16, EP *
expertPerRank does not meet the 512B alignment. If the above logic is
used for checking, there will be problems.

Therefore, here we will pad the EP * expertPerRank data length to the
length aligned to 512B.

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: lhchg <lhao_cheng@163.com>
Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>
2026-01-23 21:14:18 +08:00
lhchg
dc99cfdc15 [CustomOp] support TensorList for dispatchFFNCombine (#5665)
### What this PR does / why we need it?
To support tensorList for dispatch_ffn_combine, to adjust eplb

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Single Operator Testing

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: lhchg <lhao_cheng@163.com>
Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>
2026-01-09 15:56:29 +08:00
Chen Chen
aa02a85e4d [bugfix] Fix dummy-run and multi-node issues in MoE routing and MTP (#4947)
### What this PR does / why we need it?

- Fix a premature `return` in `moe_init_routing_quant_v2.cpp` so the
routing kernel completes correctly instead of exiting early in certain
paths.
- Switch `FusedAlltoAllCommImpl` to use the MC2-based token dispatcher
and prepare/finalize routines, aligning MoE communication with the MC2
algorithm optimized for Ascend devices.
- Add a temporary override in `MtpProposer` to map `FUSED_ALLTOALL` back
to `ALLTOALL` until the MoE communication type selection logic is fully
finalized, avoiding incorrect behavior in dummy-run flows.
- Simplify the MoE communication selection for Ascend 910-93 in
`NPUModelRunner` by removing the EP-size guard on `FUSED_ALLTOALL`,
which fixes failures in multi-node / larger-EP configurations while
keeping MC2 routing under the configured token capacity.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: mojave2 <chenchen145@huawei.com>
2025-12-15 14:18:23 +08:00
Chen Chen
7f33838e6e Update comment doc (#4731)
### What this PR does / why we need it?

Translate remaining Chinese comments in the `dispatch_ffn_combine` code
to English and update the installation guide to remind users to
initialize submodules when building from source.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: mojave2 <chenchen145@huawei.com>
Signed-off-by: Chen Chen <0109chenchen@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-05 15:07:31 +08:00
Chen Chen
ad0607f900 add dispatch_gmm_combine kernel (#3532)
### What this PR does / why we need it?

This PR introduces the Ascend implementation of the
`dispatch_ffn_combine` kernel and wires it into the vLLM-Ascend runtime,
together with follow‑up fixes to ensure the kernel builds and runs
correctly in CI.

- Add full host and device implementation of the `dispatch_ffn_combine`
kernel under `csrc/dispatch_ffn_combine`, including tiling logic, MOE
routing helpers, and kernel utilities for quantized FFN dispatch.
- Integrate the new kernel with the PyTorch binding
(csrc/torch_binding.cpp, csrc/torch_binding_meta.cpp) and the Ascend
runtime (vllm_ascend/ascend_forward_context.py,
vllm_ascend/worker/model_runner_v1.py).
- Extend fused MoE communication and token dispatch support in
`vllm_ascend/ops/fused_moe`, adding methods/utilities needed by the new
dispatch path.
- Update quantization logic in vllm_ascend/quantization/w8a8_dynamic.py
to support the new FFN dispatch flow.
- Fix kernel build issues by adjusting `csrc/build_aclnn.sh`, CMake
configuration, and include/namespace usage in the new kernel files.
- Add an end‑to‑end nightly test
`tests/e2e/nightly/ops/test_dispatch_ffn_combine.py` and helper
utilities in `vllm_ascend/utils.py` to validate the new kernel.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0

---------

Signed-off-by: mojave2 <chenchen145@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-04 23:00:59 +08:00