Commit Graph

22 Commits

Author SHA1 Message Date
linfeng-yuan
068ed706c8 [feat][torchair] support super kernel feat for quantized dsr1 (#3485)
### What this PR does / why we need it?
Port #1916 and #2157 to master branch to fuse operators in deepseek moe
layers, which can reduce scheduling overhead on devices. Note that this
feature is valid only when `tp_size = 1` and
`multistream_overlap_shared_expert` is enabled with torchair graph mode.

### Does this PR introduce _any_ user-facing change?
Users can enable this feature with `--additional-config
'{"torchair_graph_config":{"enabled":true, "enable_super_kernel":true},
"multistream_overlap_shared_expert":true}'`.

### How was this patch tested?
E2E deepseek serving with 2P1D disaggregated prefill scenarios.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-10-20 20:04:37 +08:00
offline893
6c9909c861 [Patch]patch of v1 executor when enable eplb. (#3511)
### What this PR does / why we need it?
when using dynamic eplb, patch v1 executor to avoid create child process
failed.

### How was this patch tested?
deepseek in v3.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-19 10:54:26 +08:00
yechao237
4750d45d86 [BugFix]Support redundant experts in EPLB (#3473)
This PR adds support for redundant experts in the EPLB. 

Key points: 
- Use global_num_experts = num_experts + num_redundant_experts
consistently.
- Backward compatible when num_redundant_experts=0. 

Tested 
On a 16-rank setup (W8A8) with static EPLB and expert_map_path,
verifying router logits shape and successful requests.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: yechao237 <yechao20180411@gmail.com>
2025-10-18 00:09:16 +08:00
offline893
5a3082cd15 [EPLB]Record expert map without dynamic eplb. (#3409)
What this PR does / why we need it?
1.Record expert map without dynamic eplb.
2.Add export PYTHONOPTIMIZE=1  when using dynamic eplb.
3.change eplb doc

Does this PR introduce any user-facing change?
How was this patch tested?
Qwen3_moe in A3.

- vLLM version: v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-15 14:21:15 +08:00
realliujiaxu
31682961af [Feat] enable hierarchical communication for mc2 ops on A2 (#3015)
Currently, when in A2, setting the environment variables
`HCCL_INTRA_PCIE_ENABLE=1` and `HCCL_INTRA_ROCE_ENABLE=0` can reduce
cross-machine communication traffic and significantly improve
communication performance.

For more details, please refer to
[document](https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_moe_distribute_dispatch_v2.md)

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-10-13 16:13:17 +08:00
dsxsteven
847d12a389 [BugFix]Fix moe load problems in torchair when using dynamic eplb (#3381)
### What this PR does / why we need it?

When using dynamic eplb, moe load is not imported. We fix this problem
by modifying the return value of hidden states in torchair.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
DeepseekV3 in A3.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: daishixun <dsxsteven@sina.com>
2025-10-13 11:38:57 +08:00
offline893
82b6c846ca [BugFix]Fix eplb problems when using dynamic eplb. (#3364)
### What this PR does / why we need it?
When using dynamic eplb,it will be blocking by nz tensor.We fix these
prolems by clone src tensor and recv tensor.

### Does this PR introduce any user-facing change?

### How was this patch tested?
Qwen3_moe in A3.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-11 14:04:02 +08:00
zouyida2052
ee0a95e47f bugfix for mtp when running torchair in a2 (#3354)
### What this PR does / why we need it?
when ops torchair_fused_experts_with_mc2 is called, we need pass a tp
group, but now it only pass when quantized scenario, we need also pass
it when unquantized.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-10 23:07:24 +08:00
offline893
1c2c72af8d [bugfix]change log2phy map to npu (#3339)
### What this PR does / why we need it?
Resolved the issue of EPLB failure caused by changes in the log2phy map
due to device type modifications when using MTP rotation position
encoding.

### Does this PR introduce any user-facing change?

### How was this patch tested?
https://github.com/vllm-project/vllm/commit/releases/v0.11.0


- vLLM version: v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-10 08:47:55 +08:00
weichen
94dd832815 [MoE] [Refactor] Combine common_fused_moe and fused_moe (#3176)
### What this PR does / why we need it?
1. Move additional functionalities from fused_moe.py to
common_fused_moe.py and remove fused_moe.py
2. Remove unnecessary custom classes from qwen3_moe.py, and it will be
completely removed after we release vllm-ascend v0.11.0

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing:

1. Enable/Disable EP
3. Aclgraph & eager
4. SP


- vLLM version: v0.11.0

---------

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-10-09 14:12:46 +08:00
wangxiyuan
f12f76d7ba Drop 0.10.2 (#3284)
Drop v0.10.2 support, we support vLLM 0.11.0rc3 now.
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-09 10:28:38 +08:00
Mengqing Cao
4ff422c730 [CI][Bugfix] Quickfix for DPMetaData (#3234)
### What this PR does / why we need it?
Fix `dpmetadata` and `Qwen3MoeSparseMoeBlock` break introduced by
26a7a33b88 (diff-c1550d0a38469d039370567d8981969530cbfffc7302cd1778e7c2c8a9322dea)

NOTE: we maintain a different sp in vllm-ascend with vllm, thus we can
just use `cu_tokens_across_sp(1)` as `cu_tokens_across_dp_cpu`

close https://github.com/vllm-project/vllm-ascend/issues/3236,
https://github.com/vllm-project/vllm-ascend/issues/3239
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-28 21:11:22 +08:00
weijinqian0
6aa4253798 [Refactor] [SP]The sequence parallelism characteristics in the MoE and Dense models are integrated into a single solution. (#3085)
What this PR does / why we need it?

there are two sets of sp implementations for moe and dense models. One
is called sequence_parallelism, and the other is flashcomm_v1.
We did the following things:

Merge two sets of code with the same implementation into one.
Remove the implementation of sequence_parallelism, as this solution
cannot support aclgraph.
Does this PR introduce any user-facing change?

No

How was this patch tested?

e2e&ut

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-09-24 11:29:59 +08:00
linfeng-yuan
d01fd1d1c3 [misc][torchair] fix bugs around deepseek mtp, enable_shared_expert_dp and use_cached_kv_cache_bytes (#3074)
### What this PR does / why we need it?
This miscellaneous​ contains several small fixes:
1) fix initialization and forward bugs of DeepseekMTPLayer with
`shared_expert_dp` enabled.
2) fix a tensor shape mismatches after o_proj caused by a work-aroud
change in NPUModelRunner.
3) avoid unnecessary decline of kv_cache memory (default: 64MB) with
`use_cached_kv_cache_bytes` disabled.
4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding
logic of `mc2_mask` is incompatible with input hidden_states when
`shared_expert_dp` enabled.

Once this PR is merged, users can launch disaggregated_prefill
deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as
`v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline
compared to `v0.9.1-dev` will be resolved by
https://github.com/vllm-project/vllm-ascend/pull/3073.
 
### Does this PR introduce _any_ user-facing change?

No.
### How was this patch tested?
E2E vllm serving about deepseek_mtp with torchair graph mode and
`enable_shared_expert_dp` with eager mode. Large ep deployments are also
tested with this PR.


- vLLM version: v0.10.2
- vLLM main:
5aeb925452

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-23 14:52:42 +08:00
Li Wang
12bcbd02bb [CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907)
### What this PR does / why we need it?
1. This pr bump vllm commit to
6d8246aaff
2. fix upstream changes https://github.com/vllm-project/vllm/pull/24548
abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable
3. fix metadata_builder changes introduced by
https://github.com/vllm-project/vllm/pull/23693
4. fix `structured_outputs_config` changes introduced by
https://github.com/vllm-project/vllm/pull/22772
5. fix `moe_config` changes introduced by
https://github.com/vllm-project/vllm/pull/22537

Co-authored-by:  MengqingCao <cmq0113@163.com>
Co-authored-by:  Yikun Jiang <yikunkero@gmail.com>


- vLLM version: v0.10.2
- vLLM main:
c60e6137f0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-09-20 17:37:57 +08:00
whx
0a526768f5 [Feature] Support moe multi-stream for aclgraph. (#2946)
This PR puts the calculation of shared experts into a separate stream,
overlaping with routing experts.

- vLLM version: v0.10.2
- vLLM main:
fbd6523ac0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-19 11:06:45 +08:00
linfeng-yuan
79a910ef47 [bugfix][torchair] fix multistream_moe problems in torchair graph mode (#2681)
This pr fixes two problems while `multistream_moe` enabled in torchair
graph mode:
1. check `TorchairAscendW8A8DynamicFusedMoEMethod` instead of incorrect
`AscendW8A8DynamicFusedMoEMethod`
2. mc2_mask should be chunked no matter `replace_allreduce` is True or
False in forward function of `TorchairAscendFusedMoE`

- vLLM version: v0.10.2
- vLLM main:
0fb2551c23

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-18 17:35:04 +08:00
offline893
76844eec78 Dynamic Expert Load Balance with Zero-like-overhead (#2956)
### Motivation
Currently dynamically experts balancing would stop-the-world.
Asynchronously expert load balancing would be better without flowing
problems:

Host-bound latency:
There are many cpu operations during EPLB such as
eplb-algorithm、creating p2p ops、and log2phy expert converting would
spend long cpu time, as ~1s.
Communication latency: The transfer time would cost much in the
situation without nvlink. As the weight of an expert maybe transfer to
multiple new positions, thus N times send/recv for one expert, with
result long latency. We had tested that batch_isend_irecv cost more
100ms for 16 experts weight transmission in A2 server of ascend.

SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms
cost for each layer while benefit 5ms-8ms decode latency with ep_size =
64.
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing while almost the same
effect.
### Proposed Change
We will gradually migrate the EPLB logic to the VLLM community and
implement a generalized design. Relevant RFC:
https://github.com/vllm-project/vllm/issues/22246
The overall workflow involves:
<img width="801" height="302"
alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c"
src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed"
/>
1. Record experts distribution during forward. We using expert_token_num
after disptach instead of topk_ids, thus we got much smaller tensor
shape to reduce cost of hbm recording and add-operator.
2. Do all-gather for experts distribution. Using all-gather instead of
all-reduce as less traffic volume.
3. Wake up eplb worker process with experts distribution when
num_iterations comes. Run eplb algorithm in eplb worker.
4. Generate p2p send/recv ops and other operator such as log2phy would
cost long cpu time.
5. Lanch ibatch_send_recv in async_stream before forward.
6. After forward, wait for the ibatch_send_recv finish, then do uapte
expert map and expert weights.
### Co-author
Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con
Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn
Co-authored-by: qmkakaxi wjh1594260677@qq.com
Co-authored-by: Skywalker-EP 173723846@qq.com


- vLLM version: v0.10.2
- vLLM main:
567939953b

---------

Signed-off-by: offline0806 <z00858301@china.huawei.com>
Co-authored-by: offline0806 <z00858301@china.huawei.com>
2025-09-17 10:36:43 +08:00
henryxuxu0716
51a2aec115 Delete redundant codes related to communication (#2717)
### What this PR does / why we need it?
Delete redundant codes related to communication

### Does this PR introduce _any_ user-facing change?
not involve

### How was this patch tested?
not involve

- vLLM version: v0.10.1.1
- vLLM main:
6c7af8110a

---------

Signed-off-by: 刘哲续 <liuzhexu1@huawei.com>
Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>
2025-09-05 09:39:39 +08:00
xuyexiong
214b32a346 [V1][BUGFIX][0.10.1] FIX mtp on main branch (#2632)
### What this PR does / why we need it?
Fix MTP torchair bug caused by torchair refactor and moe refactor

Depends on PRs:
fused moe fix: https://github.com/vllm-project/vllm-ascend/pull/2627 
torchair multi DP fix:
https://github.com/vllm-project/vllm-ascend/pull/2626

### Does this PR introduce _any_ user-facing change?
when dp is enabled, to run mtp online server, need to disable server log
due to the current metrics does not support multi dp
`--disable-log-stats`
### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
7c8271cd1e

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-02 11:12:41 +08:00
weichen
3a5fc5ee01 [Refactor][MoE] remove redundant code after refactoring fused_moe (#2612)
### What this PR does / why we need it?
There are a lot of redundant codes related to moe here, and the
structure is not very clear.
We did the following things:

we have placed the relatively independent code related to apply_mlp into
a separate file;
removed the environment variables of alltoall_buffer and alltoall_seq.
Remove the code related to alltoall_buffer and alltoall_seq, and retain
the sole TokenDispatcher inheritance class.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e&ut

- vLLM version: v0.10.1.1
- vLLM main:
4071c76cf3

---------

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-08-30 22:28:50 +08:00
Wang Yixuan
0f81e032f0 [1/N][refactor] torchair fused_moe refactor (#2438)
### What this PR does / why we need it?
Move torchair related fused_moe section into torchair_fused_moe to make
the code clear. Next step we'll remove all torchair related code outside
of torchair_fused_moe .

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: v0.10.0
vLLM main:
08d5f7113a

- vLLM version: v0.10.1.1
- vLLM main:
170e8ea9ea

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-08-25 15:46:10 +08:00