### What this PR does / why we need it?
Add ut for torchair graph mode on DeepSeekV3
### How was this patch tested?
CI passed with new added test.
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
Add @jianzs as vLLM Ascend maintainer
@jianzs
----
I would like to nominate Shoujian Zheng (@jianzs
<https://github.com/jianzs>) as a maintainer, starting with my +1.
- He focuses on the code quality and good design with solid reviews in P/D
disaggregation and DeepSeek improvement area about 30+ high quality review, such
as #issuecomment-2811764833, #discussion_r2069927605 and
#pullrequestreview-2820996674. This is the most important reason why I nominated
him, because helping community developers complete PRs with high quality and
continuously ensure the quality of codebase is one of the important
responsibilities of a maintainer. We believe he is a great addition.
- Shoujian's main expertise is distributed inference. He has a lot of experience
in production about AI infra. He has very good habits and explains in great
detail all changes #issue-3023082580 anqd share results open:
#issuecomment-2853140443. And High quality PR: #706, #774, #852.
- Community Involvement: Active involved in community discussion, he is
collaborative and helps the users solve problems, involved in 30+ PR and issue,
such as #issuecomment-2911934292 and #issuecomment-2833523571.
Reference:
[1] https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html
[2] https://vllm-ascend.readthedocs.io/en/latest/community/governance.html
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
| q_rmsnorm | | kv_norm_rope_cache | | q_rope |
| matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
### What this PR does / why we need it?
fix the CANN download url
### Does this PR introduce _any_ user-facing change?
no, do not have any user-facing change
### How was this patch tested?
run the **wget** command and cann package is rightly downloaded.
---------
Signed-off-by: wan_danfeng <wonderful199082@126.com>
### What this PR does / why we need it?
fix bug in 1p1d disaggregated_prefill example
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with python find_device_ips.py and run disaggregated_prefill
example
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
### What this PR does / why we need it?
- Add qwen2.5-7b performance benchmark, this is a sub pr of #1099, for
v1 test, need more verify
- Fix get commit time after checkout
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
This PR add custom ascendc kernel vocabparallelembedding support in
vllm-ascend, related CMakeLists and setuptools is also added in this PR.
pytest -s benchmarks/ops/ben_vocabparallelembedding.py
pytest -s tests/ops/test_vocabparallelembedding.py
---------
Signed-off-by: ttanzhiqiang <389825161@qq.com>
This PR adds support for speculative decoding in AsecendScheduler.
Also inculde part of support for disaggregated prefill, full support
will be merged in follow-up PR.
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now.
keep doc to 0.9.0 until we release the first 0.9.1 release.
2. disable V0 test for PR
3. move actionlint check to lint job
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance
rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ https://github.com/vllm-project/vllm-ascend/pull/910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
https://github.com/vllm-project/vllm-ascend/pull/1100+ [Reduce memory
usage by splitting tokens in fused_experts]
---------
Signed-off-by: ttanzhiqiang <389825161@qq.com>
Contains on #1111 for completeness.
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
Implement multi-stream parallelism for MoE layers with shared experts,
where computation of shared experts will be overlapped with expert token
dispatch and combine. Also, when multi-stream is enabled, weights of
shared experts will be force to replicate across all cards, regardless
of any tensor parallelism configurations, to avoid AllReduce operations.
With the expected overlaping being:
```
| shared gate_up | shared act | | shared down |
| dispatch | routed gate_up, act, down | combine |
```
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
No.
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
### What this PR does / why we need it?
Improve assertion on Graph mode with MLA.
When running deepseek with graph mode, the fused MLA op only support
`numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion
info here to avoid users confused with this.
### Does this PR introduce _any_ user-facing change?
Adjusting tp size is required when running deepseek-v3/r1 with graph
mode. deepseek-v2-lite is not supported in graph mode.
### How was this patch tested?
Test locally as the CI machine could not run V3 due to the HBM limits.
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
The former PR https://github.com/vllm-project/vllm-ascend/pull/736
select the valid token inside the `input_ids` and `position_ids` breaks
the necessary padding required by torchair. In this PR, we pending the
pad logic after the multimodal part.
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Make sure the lint test passed before start the e2e test to save compute
resource.
Updated the patch doc to make sure the CI works as expect.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a
python module
2. Fix model runner bug to keep the same with vllm
3. Add release note for 0.9.0rc2
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Make accuarcy CI and report work
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manaully review
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Update 0.9.0rc1 contributors info
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Fix incompatibility problem for non-EPLB scenarios in #1116
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with online serving and e2e CI.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
1. Update 0.9.0rc1 release date
2. Update feature and model support list
3. Add DP known issue to release note
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add EPLB expert map import capabilities
### Does this PR introduce _any_ user-facing change?
When importing the EPLB expert map you need import expert map file by
vllm args additional_config
### How was this patch tested?
1.You need to collect expert hotness and generate an expert placement
file based on the hotness and the EPLB algorithm, or you can directly
use an existing expert placement table.
2.When launching vLLM, enable EC2 and pass the configuration via the
command-line argument:
--additional-config '{"expert_map_path": "/xxx/xxx/xx.json"}
Co-authored-by: songshanhu07 <1763685535@qq.com>
---------
Signed-off-by: songshanhu07 <1763685535@qq.com>
Signed-off-by: Yuxiao-Xu <664988918@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: songshanhu07 <1763685535@qq.com>
Co-authored-by: Xu Yuxiao <xuyuxiao2@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Remove `spec_decode.metrics` patch as this has been resolved in
https://github.com/vllm-project/vllm/pull/16983 (include in vllm
`v0.9.0`).
Returns a CUDA event recording when the copy is complete **--after
modified-->** Returns a device event (NPU Event for vllm-ascend)
recording when the copy is complete.
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
With this PR, we can migrate to the native `data_parallel.py` in vllm
examples and remove the version in vllm-ascend.
At present, `ASCEND_RT_VISIBLE_DEVICES` introduces considerable
difficulties; therefore, we must employ a temporary workaround and
manually specify the device.
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Set `ACL_OP_INIT_MODE` env var default to `0`, since vllm-ascend may
have problems in some scenarios when setting it to `1`.
Plus, the guide https://github.com/vllm-project/vllm-ascend/issues/734
has also been updated.
Signed-off-by: shen-shanshan <467638484@qq.com>
Add unpadded Qwen2.5-VL for verl scenario.
When using vllm-ascend for verl scenario, set `USE_OPTIMIZED_QWEN2_5_VL`
(default `1`) to `0` to use unpadded Qwen2.5-VL to avoid errors.
This is cherry-picked from 0.7.3-dev
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
### What this PR does / why we need it?
Fix typo of VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
The current vllm-ascend is not support the multimodal model in
vllm-ascend v1 yet. So I change the `model_runner_v1.py` file with using
MRoPE feature and so on to support this feature. It currently still not
perfect since the Ascend operator is not support the `window/full attn`
to reduce Memcpy operations, so it would out of memory if the input
embedding is too large, so We can't use `self._profile_multimodal()` for
profile since it use a big dummy input (i.e. images) as the multimodal
input.
Fixes: https://github.com/vllm-project/vllm-ascend/issues/514
### Does this PR introduce _any_ user-facing change?
No, this feature not need change the user-facing
### How was this patch tested?
I test this offline using my machine 910B3 and my own fork, and it works
well.
---------
Signed-off-by: cty <ctynb@qq.com>
### What this PR does / why we need it?
Based on the design of dual-batch overlap proposed by Deepseek team and
also the implementation of fused moe in VLLM project, we implement the
multi-stream(also known as dual-batch) overlap for deepseek+mla on
Ascend NPU. We split the input batch of model into two microbatches and
then overlap the comp/comm ops in attention and moe layers using two
streams to improve the performance. Our approach can be easily extended
when adding dispatch/combine communications for moe layer.
Compared with the previously proposed
[draft](https://github.com/vllm-project/vllm-ascend/pull/842), we use
one stream for computation ops and the other for communication ops,
separately. In out opinions, it is beneficial for arranging the order of
executing different ops and thus avoiding the contention of
computation/communication resources.
ref: [overlap for
llama](https://github.com/vllm-project/vllm/pull/15787/files)
ref: [dbo in
sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de)
### Does this PR introduce _any_ user-facing change?
Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by
setting "VLLM_ASCEND_ENABLE_DBO=1"
See /examples/offline_dualbatch_overlap_npu.py for more info.
### How was this patch tested?
This patch can be tested with vllm-0.9.0 using its online service with
benchmark tests. We have decoupled the func of dbo from vllm and it
should be able to run without any modification to the code of vllm(some
modifications is better to implement in vllm though).
Any advice/discussion is welcome.
### Performance Benchmark
We have ran the benchmark_serving script of vllm to test the performance
after using dual-batch overlap.
`python -m vllm.entrypoints.openai.api_server \
--model=DeepSeek-R1-W8A8 \
--trust-remote-code \
--distributed-executor-backend=mp \
-tp=16 \
--port 8006 \
--max-num-seqs 390 \
--max-model-len 32768 \
--max-num-batched-tokens 65536 \
--block-size 128 \
--compilation_config 0 \
--gpu-memory-utilization 0.90 \
--disable-log-requests \
--additional-config
'{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'`
and run benchmark with the parameters of :
`--dataset-name random --random-input-len 4096 --random-output-len 1
--num-prompts 200 --max-concurrency 8 --request-rate 5
--metric-percentiles 90`
1. test with the version using allgather+allreduce in Ascend 910B (tp16
ep16 + deepseek r1 w8a8)
2. test with the version using alltoall:
prefill qps: 0.90 -> 1.01
Mean TTFT:8226->7432ms
The overlap approach when using alltoall communication can be further
optimized by overlapping micro-batch1's moe comp with micro-batch2's
dispatch a2a comm
---------
Signed-off-by: zhuohuan <zxdu1997@gmail.com>
### What this PR does / why we need it?
View optimization in torchair (defaulted to on for Transpose with any of
its axis being 1) prevents the weight Transpose to be fused with later
GroupedMatmul, which decrease the performance of MoE layer when expert
parallelism equals the total number of experts (e.g. EP256 for DSKv3).
Add an option to solve this problem by disabling the optimization.
### Does this PR introduce _any_ user-facing change?
Controlled by
`additional_config.torchair_graph_config.enable_view_optimize`,
defaulted to `True`.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
### What this PR does / why we need it?
- Set default values to fix spec decode
- To avoid oom, we need to run the test in a single process
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- CI passed, espcecially multicards CI
- For spec decode test, long term CI passed
Closes: https://github.com/vllm-project/vllm-ascend/pull/1105
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
[CI]Moe alltoall communication optimization
The DeepSeek V3/R1 model has 256 routing experts. During parallel
inference, if the load of an EP rank is high, the overall communication
and computing time is slowed down, which becomes a weakness of parallel
inference because the load is unevenly distributed. However, the data
volume in the prefill phase is large, and the inter-card communication
time consumption/calculation time consumption and the data volume are
closely related to each other. Therefore, less non-linear precision loss
can be used to obtain a near-linear performance improvement.
During parallel inference, global synchronization occurs during
communication. As a result, the card with low load completes the
calculation first and waits for the card with the highest load to
complete the calculation. Therefore, if the load is unbalanced, the card
with high load slows down the overall time consumption. Significant
performance gains can be achieved by discarding a small number of
tokens, which is unacceptable in some precision-sensitive scenarios.
However, similar to quantification, it is a solution that uses an
acceptable precision loss in some scenarios for performance. In
addition, a trade-off between performance and precision can be achieved
by configuring a proportion of discarded tokens.
Perform the test on A3. The batch size is 8 (B), the prompt length is
3.5K tokens (S), and the parallel configuration is as follows: AttnDP=2,
AttnTP=8, MoeTP=1, and MoeEP=16. In this sence, we got a 10%-15%
performance gain.
Plus, the next version, we'll have an alltoallv moe.
---------
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it?
When profiling, it is often necessary to disable the call stack to
reduce profiling overhead, and adjust the profiler_level to level1 to
obtain more detailed operator and communication information.
Therefore, it is recommended to modify the default profiling
configuration.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No
Signed-off-by: ApsarasX <apsarax@outlook.com>
### What this PR does / why we need it?
Fix the bug in torch 2.5.1 that raising segment fault when enable
`pin_memory` while creating a tensor using `torch.tensor`.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
- Adds support for passing prompt_embeds to LLM.generate as
```bash
llm.generate({"prompt_embeds": input_embeds}, sampling_params)
```
or
```bash
llm.generate(
[{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)
```
- Add `prompt_embeds` to examples
### How was this patch tested?
CI passed with new added/existing test.
and I have test with the example script in this pr, and the output seems
looks good:
```bash
[Single Inference Output]
------------------------------
The capital of France is Paris. Paris is the largest city in France and is
------------------------------
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3966.87it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.99it/s, est. speed input: 177.08 toks/s, output: 63.91 toks/s]
[Batch Inference Outputs]
------------------------------
Q1: Please tell me about the capital of France.
A1: The capital of France is Paris. It is located in the northern part of the
Q2: When is the day longest during the year?
A2: The day is longest during the year at the summer solstice. This typically occurs
Q3: Where is bigger, the moon or the sun?
A3: The sun is significantly bigger than the moon.
The sun has a diameter of
------------------------------
```
---------
Signed-off-by: wangli <wangli858794774@gmail.com>