Commit Graph

12 Commits

Author SHA1 Message Date
weichen
320edde2df [main] [refactor] refactor fused_moe.py to enable token_dispatchers (#2570)
### What this PR does / why we need it?
Enable token_dispatcher to replace fused_experts_with_xxx in eager mode
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut


- vLLM version: v0.10.1.1
- vLLM main:
704432af3c

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: sherie <963372609@qq.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
Co-authored-by: shiyuan680 <72335504+shiyuan680@users.noreply.github.com>
2025-08-28 10:13:35 +08:00
huangxialu
6881c19458 [main] convert the format of gmm to nz (#2474)
### What this PR does / why we need it?
convert the format of gmm to nz

### Does this PR introduce _any_ user-facing change?
not involved

### How was this patch tested?
ut: test_fused_ops.py and e2e: test_fused_moe.py

**performance**:
(qwen3 30B, 2k->20k)

base:
Total Token throughput (tok/s):          719.93

gmm nz:
Total Token throughput (tok/s):          728.52


- vLLM version: v0.10.1.1
- vLLM main:
bfc1edc9f5

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-08-27 11:25:02 +08:00
s30076806
6a4ec186e7 [Qwen-moe] Remove the minor operation arange (#2373)
### What this PR does / why we need it?
Integrate the arange operator to reduce the time spent and improve
performance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
56dcf4e7e9

---------

Signed-off-by: s30076806 <songjiayang2@h-partners.com>
2025-08-27 09:13:31 +08:00
Mengqing Cao
1327f9be1c Fix some ci issue and refactor modelrunner (#2445)
### What this PR does / why we need it?
Fix some ci issue and refactor modelrunner

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.0
- vLLM main:
4d9c61993a

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: weiguihua2 <weiguihua2@huawei.com>
2025-08-20 09:01:04 +08:00
shiyuan680
e14f2ef669 refactor select_experts of moe module (#2150)
### What this PR does / why we need it?
this pr refactor select_experts of moe module
i merge implementations of quantitative and non-quantitative method in a
new class
use such as vllm like ExpertsSelector.select_experts
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
test in qwen3-moe and all ut.

- vLLM version: v0.10.0
- vLLM main:
e18859298d

Signed-off-by: yangcheng <yangcheng104@huawei.com>
Co-authored-by: yangcheng (AJ) <y00806874@china.huawei.com>
2025-08-14 11:50:53 +08:00
Mengqing Cao
ad1083761f [CI][Quickfix] Fix AscendFusedMoE init error (#2268)
### What this PR does / why we need it?
Fix AscendFusedMoE init error. Use `super().__init__()` instead of
`super(FusedMoE, self).__init__()` to ensure the member variables in
base class could be called by the children class

### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new existing test.


- vLLM version: v0.10.0
- vLLM main:
766bc8162c

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-08 10:20:23 +08:00
huangxialu
dceef080b1 [main] remove torch.cat and replace it by List[0] (#2153)
### What this PR does / why we need it?
torch_npu.npu_grouped_matmul:

https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_grouped_matmul.md

According to the document, when `split_item` is 2 or 3,
`torch_npu.npu_grouped_matmul` will return a list which has one element.
Therefore, the `torch.cat` after `torch_npu.npu_grouped_matmul` is
unnecessary.

### Does this PR introduce _any_ user-facing change?
not involved

### How was this patch tested?
ut and e2e covered: `tests/ut/ops/test_fused_ops.py`,
`tests/e2e/singlecard/ops/test_fused_moe.py`

**performance**:
(qwen3 30B, 2k->20k)

base:
Total Token throughput (tok/s):          667.76 

remove cat:
Total Token throughput (tok/s):          680.82 


- vLLM version: v0.10.0
- vLLM main:
fa00c5d75b

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-08-07 17:20:19 +08:00
wangxiyuan
36e450eb0f [Misc] Nit fix for disaggregated_prefill and ascend_forward_context (#2097)
we recently added disaggregated_prefill and ascend_forward_context
feature by
ba3dfbd59e
and
df0ec55162.
This PR fix some nit introduced by them to make the code clear.
1. drop `current_platform` usage. It'll lead unknown circular import
error in some case
2. update `set_ascend_forward_context` function to make the logic clear.
for example, remove V0 support in this function.
3. Remove useless `self.local_rank_across_dp` in worker
4. Remove `soc_info.py` to use `get_ascend_soc_version` instead.
 

- vLLM version: v0.10.0
- vLLM main:
02f82fe438

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-05 08:39:02 +08:00
huangxialu
9c9a7cd90b [main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112)
backport of v0.9.1-dev:
https://github.com/vllm-project/vllm-ascend/pull/1902

origin main npu_moe_gating_top_k_softmax:
https://github.com/vllm-project/vllm-ascend/pull/1355

- vLLM version: v0.10.0
- vLLM main:
055bd3978e

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-07-31 21:05:56 +08:00
zzzzwwjj
ba3dfbd59e [main][refactor] Refactoring forward_context and model_runner_v1 (#1979)
### What this PR does / why we need it?

A refactoring of forward_context and model_runner_v1, add some context
which is necessary in model inference into forward_context, and refactor
dummy_run logic, make it more reasonable.
Some details for this PR:

Add `ascend_forward_context`;
Update mc2_v2 op, and support `active_mask` param;
Update scripts in examples dir;
refactor `dummy_run` logic;
Add soc_version for A2 and A3;

### Does this PR introduce _any_ user-facing change?

No change at user-facing.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
57c22e57f9

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-07-28 14:06:20 +08:00
Zac
2ffe051859 [Test]add ut for deepseek_v2. (#1964)
What this PR does / why we need it?
Add uts for deepseek_v2

Does this PR introduce any user-facing change?
No

How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
f3137cdd81

---------

Signed-off-by: 张帮政 <zhangbangzheng@huawei.com>
2025-07-24 10:27:50 +08:00
shiyuan680
ac0bf133f4 add ut of fused_moe.py (#1930)
### What this PR does / why we need it?
add unit test for fused_moe.py

- vLLM version: v0.9.2
- vLLM main:
2dec7c1a5d

Signed-off-by: yangcheng <yangcheng104@huawei.com>
Co-authored-by: yangcheng <yangcheng104@huawei.com>
2025-07-23 16:24:09 +08:00