### What this PR does / why we need it?
This PR repalce the vision tower in Qwen2.5-Omni-Thinker model,
Qwen2_5_VisionTransformer, with AscendQwen2_5_VisionTransformer, which
use QKV padding for padding performance.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: Ting FU <futing10@huawei.com>
This PR deletes model codes of deepseek_v2 and deepseek_v3 to reuse the
model file from vLLM.
vLLM Ascend now uses custom ops register way instead of model file
hard-coding.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Seperate deepseek v3.2 modeling form deepseek v2
### How was this patch tested?
- CI passed with existing test.
- test deepseek v3.2 locally
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
1. Move additional functionalities from fused_moe.py to
common_fused_moe.py and remove fused_moe.py
2. Remove unnecessary custom classes from qwen3_moe.py, and it will be
completely removed after we release vllm-ascend v0.11.0
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing:
1. Enable/Disable EP
3. Aclgraph & eager
4. SP
- vLLM version: v0.11.0
---------
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
### What this PR does / why we need it?
This PR is for the adaptation and optimization of qwen3_vl and
qwen3_vl_moe on the Ascend platform.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
b1068903fd
---------
Signed-off-by: booker123456 <945658361@qq.com>
### What this PR does / why we need it?
This PR fused addrmsnorm op and w8a8 quant op to get better perf.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.2
- vLLM main:
0faf3cc3e8
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
This PR deletes ~2K lines of code about deepseek modeling. It falls back
CustomDeepseekV2 modules to original vllm implementations and adapts
some modifications in vllm about deepseek and moe.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E vllm serving with torchair graph mode and eager mode.
- vLLM version: v0.10.2
- vLLM main:
759ef49b15
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
vllm-ascend won't contain model file anymore. Now pangu model file has
been moved to torchair module. The origin one can be removed.
Note: After this PR, pangu only works with torchair mode then.
- vLLM version: v0.10.1.1
- vLLM main:
8c892b1831
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Remove redundant imported `envs`, using `envs_ascend` instead.
```python
import vllm.envs as envs_vllm
import vllm_ascend.envs as envs_ascend
```
- vLLM version: v0.10.0
- vLLM main:
71683ca6f6
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Optimizes the performance of the Qwen3 quantization model by registering
a custom model and adding the AddRmsNormQuant operation. Subsequent PRs
will focus on performance optimizations based on this custom model.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
Before patch, we can see
`vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM`, it seems
not friendly format.
```
WARNING 07-14 23:57:34 [registry.py:413] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 07-14 23:57:34 [registry.py:413] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 07-14 23:57:34 [registry.py:413] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Local Test.
- vLLM version: v0.9.2
- vLLM main:
bcdfb2a330
Signed-off-by: xleoken <xleoken@163.com>
### What this PR does / why we need it?
Support Pangu Pro MoE model (https://arxiv.org/abs/2505.21411)
### Does this PR introduce _any_ user-facing change?
Yes, new model supported
### How was this patch tested?
Test locally
---------
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Add unpadded Qwen2.5-VL for verl scenario.
When using vllm-ascend for verl scenario, set `USE_OPTIMIZED_QWEN2_5_VL`
(default `1`) to `0` to use unpadded Qwen2.5-VL to avoid errors.
This is cherry-picked from 0.7.3-dev
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
### What this PR does / why we need it?
Based on the design of dual-batch overlap proposed by Deepseek team and
also the implementation of fused moe in VLLM project, we implement the
multi-stream(also known as dual-batch) overlap for deepseek+mla on
Ascend NPU. We split the input batch of model into two microbatches and
then overlap the comp/comm ops in attention and moe layers using two
streams to improve the performance. Our approach can be easily extended
when adding dispatch/combine communications for moe layer.
Compared with the previously proposed
[draft](https://github.com/vllm-project/vllm-ascend/pull/842), we use
one stream for computation ops and the other for communication ops,
separately. In out opinions, it is beneficial for arranging the order of
executing different ops and thus avoiding the contention of
computation/communication resources.
ref: [overlap for
llama](https://github.com/vllm-project/vllm/pull/15787/files)
ref: [dbo in
sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de)
### Does this PR introduce _any_ user-facing change?
Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by
setting "VLLM_ASCEND_ENABLE_DBO=1"
See /examples/offline_dualbatch_overlap_npu.py for more info.
### How was this patch tested?
This patch can be tested with vllm-0.9.0 using its online service with
benchmark tests. We have decoupled the func of dbo from vllm and it
should be able to run without any modification to the code of vllm(some
modifications is better to implement in vllm though).
Any advice/discussion is welcome.
### Performance Benchmark
We have ran the benchmark_serving script of vllm to test the performance
after using dual-batch overlap.
`python -m vllm.entrypoints.openai.api_server \
--model=DeepSeek-R1-W8A8 \
--trust-remote-code \
--distributed-executor-backend=mp \
-tp=16 \
--port 8006 \
--max-num-seqs 390 \
--max-model-len 32768 \
--max-num-batched-tokens 65536 \
--block-size 128 \
--compilation_config 0 \
--gpu-memory-utilization 0.90 \
--disable-log-requests \
--additional-config
'{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'`
and run benchmark with the parameters of :
`--dataset-name random --random-input-len 4096 --random-output-len 1
--num-prompts 200 --max-concurrency 8 --request-rate 5
--metric-percentiles 90`
1. test with the version using allgather+allreduce in Ascend 910B (tp16
ep16 + deepseek r1 w8a8)
2. test with the version using alltoall:
prefill qps: 0.90 -> 1.01
Mean TTFT:8226->7432ms
The overlap approach when using alltoall communication can be further
optimized by overlapping micro-batch1's moe comp with micro-batch2's
dispatch a2a comm
---------
Signed-off-by: zhuohuan <zxdu1997@gmail.com>
Tweak packed_modules_mapping to support W8A8 weights.
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Optimize qwen2_vl and qwen2_5_vl.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Testing this PR on 1080p picture with tp=1, bs=1 on Qwen2-VL and
Qwen2.5-VL, every fa op's during time lasting from 11ms to 9ms, got
roughly 22% perf boost.
---------
Signed-off-by: zouyida2052 <zouyida@huawei.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
Co-authored-by: zouyida2052 <zouyida@huawei.com>