Commit Graph

138 Commits

Author SHA1 Message Date
Shanshan Shen
4e2daf5ab7 [Doc] Add qwen2-audio eager mode tutorial (#1371)
### What this PR does / why we need it?
Add qwen2-audio eager mode tutorial.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-26 16:56:05 +08:00
leo-pony
1025344912 Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode (#1374)
### What this PR does / why we need it?
Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode.
Relate RFC: https://github.com/vllm-project/vllm-ascend/issues/1248

### Does this PR introduce _any_ user-facing change?
No changes.


### How was this patch tested?
Preview

 Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-06-26 16:52:54 +08:00
wangxiyuan
205cb85a1e [Doc] Fix doc typo (#1424)
1. Fix the typo
2. Fix 404 url
3. update graph mode and additional config user guide

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 19:28:26 +08:00
Li Wang
15df8be937 [Doc] Add sleep mode doc (#1295)
### What this PR does / why we need it?
Add sleep related doc and example

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-25 14:07:14 +08:00
wangxiyuan
e4e0b7af05 [Doc] Add patch doc (#1414)
1. Format the developer guide  content to make it more clear
2. Add the patch doc for developer guide

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 12:00:45 +08:00
Mengqing Cao
c1c5d56255 [Doc] Update FAQ and add test guidance (#1360)
### What this PR does / why we need it?
- Add test guidance
- Add reduce layer guidance
- update faq on determinitic calculation

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-25 09:59:23 +08:00
Yikun Jiang
917c6b71af [TEST][DOC] Fix doctest and add system package installation (#1375)
### What this PR does / why we need it?
- Fix
[doctest](https://github.com/vllm-project/vllm-ascend/actions/workflows/vllm_ascend_doctest.yaml?query=event%3Aschedule)
- add system package installation
- Add doc for run doctests
- Cleanup all extra steps in .github/workflows/vllm_ascend_doctest.yaml
- Change schedule job from 4 ---> 12 hours

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- doctest CI passed
- Local test with
`/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`.

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-23 20:50:33 +08:00
Icey
08cfc7cb4b Modify installation.md for adding pip extra index of torch-npu (#1272)
### What this PR does / why we need it?
Modify installation.md for adding pip extra index of torch-npu

### How was this patch tested?
No need

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-06-23 15:37:50 +08:00
weiguihua2
e1123172d1 [Doc] Add reinstall instructions doc (#1303)
Add a new FAQ, if users re-install vllm-ascend with pip, the `build`
folder should be removed first

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
Signed-off-by: weiguihua <weiguihua2@huawei.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-06-23 14:06:27 +08:00
Pleaplusone
7e6efbf2a9 update torch-npu to 2.5.1.post1.dev20250619 (#1347)
### What this PR does / why we need it?
This PR update the torch_npu to newest release version
2.5.1.post1.dev20250619 .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CI tested will guarantee the update

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-06-23 09:02:09 +08:00
xleoken
4447e53d7a [Doc] Change not to no in faqs.md (#1357)
### What this PR does / why we need it?

Change not to no in faqs.md.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Local Test

Signed-off-by: xleoken <xleoken@163.com>
2025-06-23 09:01:00 +08:00
Yikun Jiang
2e5f312530 Cleanup ununsed doc (#1352)
### What this PR does / why we need it?
Cleanup ununsed doc for MoGE model, we will add back this when MoGE
model ready.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-22 15:05:30 +08:00
Yikun Jiang
c30ddb8331 Bump v0.9.1rc1 release (#1349)
### What this PR does / why we need it?
Bump v0.9.1rc1 release

Closes: https://github.com/vllm-project/vllm-ascend/pull/1341
Closes: https://github.com/vllm-project/vllm-ascend/pull/1334

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed


---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-22 13:15:36 +08:00
wangxiyuan
45be1aac0c [CI] Add codespell check for doc (#1314)
Add codespell check test for doc only PR

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-20 16:48:14 +08:00
22dimensions
761bd3d9d7 Add user guide for quantization (#1206)
### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-20 15:53:25 +08:00
songshanhu07
ebb2a70dbb static EPLB fix bug, add unit test (#1186)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
1.add static EPLB unit test
2.fix bug: Tensor cannot be directly judged by if statements
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Run the unit test.

---------

Signed-off-by: songshanhu07 <1763685535@qq.com>
2025-06-18 19:46:56 +08:00
Yikun Jiang
05dec7eda9 [Doc] Refactor and init user story page (#1224)
### What this PR does / why we need it?
This PR refactor the user stories page:
- Move it to community
- Add initial info of LLaMA-Factory, Huggingface/trl, MindIE Turbo,
GPUStack, verl
- Add a new page for LLaMA-Factory

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview locally

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-17 09:36:35 +08:00
Yikun Jiang
9d3cbc0953 [Doctest] add installation doctest (#1179)
### What this PR does / why we need it?
Install doctest

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Related: https://github.com/vllm-project/vllm-ascend/pull/983

Co-authored-by: wangli <wangli858794774@gmail.com>

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
2025-06-17 08:52:26 +08:00
Mengqing Cao
96fa7ff63b [DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235)
### What this PR does / why we need it?
1. Fix rank set in DP scenario. The new poc version of torch-npu support
setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the
rank set in `DPEngineCoreProc` directly instead of calculating local
rank across dp by hand in the patched `_init_data_parallel`

Closes: https://github.com/vllm-project/vllm-ascend/issues/1170

2. Bump torch-npu version to 2.5.1.post1.dev20250528

Closes: https://github.com/vllm-project/vllm-ascend/pull/1242
Closes: https://github.com/vllm-project/vllm-ascend/issues/1232


### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-06-16 23:09:53 +08:00
22dimensions
0d2074a1ec [Doc] fix VLLM_USE_V1 value in graph mode docs (#1226)
os.environ["VLLM_USE_V1"] must be assigned with str, not other type.


![image](https://github.com/user-attachments/assets/9d337ae5-00e5-4179-832e-c6c917dd5798)

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-15 15:41:11 +08:00
fems14
ab5d110fcc vllm-ascend support chunked prefill (#1172)
### What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA


---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-06-14 22:31:16 +08:00
Mengqing Cao
a3b5af8307 [CI/UT][Graph] Add ut for torchair graph mode (#1103)
### What this PR does / why we need it?
Add ut for torchair graph mode on DeepSeekV3

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-06-14 16:59:00 +08:00
Yikun Jiang
94a52cf577 Add ShouJian Zheng (@jianzs) as vLLM Ascend maintainer (#1203)
### What this PR does / why we need it?

Add @jianzs as vLLM Ascend maintainer

@jianzs
----
I would like to nominate Shoujian Zheng (@jianzs
<https://github.com/jianzs>) as a maintainer, starting with my +1.

- He focuses on the code quality and good design with solid reviews in P/D
disaggregation and DeepSeek improvement area about 30+ high quality review, such
as #issuecomment-2811764833, #discussion_r2069927605 and
#pullrequestreview-2820996674. This is the most important reason why I nominated
him, because helping community developers complete PRs with high quality and
continuously ensure the quality of codebase is one of the important
responsibilities of a maintainer. We believe he is a great addition.
- Shoujian's main expertise is distributed inference. He has a lot of experience
in production about AI infra. He has very good habits and explains in great
detail all changes #issue-3023082580 anqd share results open:
#issuecomment-2853140443. And High quality PR: #706, #774, #852.
- Community Involvement: Active involved in community discussion, he is
collaborative and helps the users solve problems, involved in 30+ PR and issue,
such as #issuecomment-2911934292 and #issuecomment-2833523571.

Reference:
[1] https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html
[2] https://vllm-ascend.readthedocs.io/en/latest/community/governance.html

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-13 18:25:50 +08:00
sdmyzlp
e72f94e38f Support multistream of MLA vector operations (#1135)
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-12 21:42:09 +08:00
Wan_Danfeng
55c0e68883 [Doc] Add Referer header for CANN package download url. (#1192)
### What this PR does / why we need it?
fix the CANN download url

### Does this PR introduce _any_ user-facing change?
no, do not have any user-facing change

### How was this patch tested?
run the **wget** command and cann package is rightly downloaded.

---------

Signed-off-by: wan_danfeng <wonderful199082@126.com>
2025-06-12 21:22:23 +08:00
chenwaner
e46dc142bf Enable kvcache_nz for the decode process in torchair graph mode (#1098)
What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test: 

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2948542159

https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2954496588

---------

Signed-off-by: chenwaner <861645847@qq.com>
2025-06-11 14:09:28 +08:00
yz
4153a5091b [Doc] Fix the config parameter name "enable" in graph_mode.md. (#1159)
Fix the doc typo in graph_mode.md

Signed-off-by: yzim <43207690+yzim@users.noreply.github.com>
2025-06-11 11:03:37 +08:00
depeng1994
860a5ef7fd provide an e2e guide for execute duration profiling (#1113)
### What this PR does / why we need it?
provide an e2e guide for execute duration profiling


Signed-off-by: depeng1994 <depengzhang@foxmail.com>
2025-06-11 10:02:11 +08:00
sdmyzlp
7bdc606677 Support multistream of shared experts in FusedMoE (#997)
Contains on #1111 for completeness.

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Implement multi-stream parallelism for MoE layers with shared experts,
where computation of shared experts will be overlapped with expert token
dispatch and combine. Also, when multi-stream is enabled, weights of
shared experts will be force to replicate across all cards, regardless
of any tensor parallelism configurations, to avoid AllReduce operations.

With the expected overlaping being:
```
| shared gate_up | shared act |              | shared down |
|    dispatch    | routed gate_up, act, down |   combine   |
```

<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
No.

<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-11 09:18:38 +08:00
Mengqing Cao
8dd686dfa2 [MLA][Graph] Improve assertion on Graph mode with MLA (#933)
### What this PR does / why we need it?
Improve assertion on Graph mode with MLA.

When running deepseek with graph mode, the fused MLA op only support
`numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion
info here to avoid users confused with this.

### Does this PR introduce _any_ user-facing change?
Adjusting tp size is required when running deepseek-v3/r1 with graph
mode. deepseek-v2-lite is not supported in graph mode.

### How was this patch tested?
Test locally as the CI machine could not run V3 due to the HBM limits.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-10 22:26:53 +08:00
wangxiyuan
b75cb788dd [Bugfix] add compilation/__init__.py to fix import error (#1152)
1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a
python module
2. Fix model runner bug to keep the same with vllm
3. Add release note for 0.9.0rc2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-10 17:14:25 +08:00
zhangxinyuehfad
e68e81f2ce [CI] Make accuarcy CI and report work (#1078)
### What this PR does / why we need it?
Make accuarcy CI and report work

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manaully review

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-10 14:35:44 +08:00
Yikun Jiang
71aee6f97d Update 0.9.0rc1 contributors info (#1148)
### What this PR does / why we need it?
Update 0.9.0rc1 contributors info

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-10 13:29:09 +08:00
wangxiyuan
571f88f85e [Doc] Update 0.9.0rc1 release date (#1139)
1. Update 0.9.0rc1 release date
2. Update feature and model support list
3. Add DP known issue to  release note

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-09 22:51:02 +08:00
wangxiyuan
5ac4872f5e [Doc] Add 0.9.0rc1 release note (#1106)
Add the release note for v0.9.0rc1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-09 19:39:21 +08:00
Yuxiao-Xu
6b853f15fe Add static EPLB (#1116)
### What this PR does / why we need it?
   Add EPLB expert map import capabilities
### Does this PR introduce _any_ user-facing change?
When importing the EPLB expert map you need import expert map file by
vllm args additional_config
### How was this patch tested?
1.You need to collect expert hotness and generate an expert placement
file based on the hotness and the EPLB algorithm, or you can directly
use an existing expert placement table.
2.When launching vLLM, enable EC2 and pass the configuration via the
command-line argument:
      --additional-config '{"expert_map_path": "/xxx/xxx/xx.json"}
Co-authored-by: songshanhu07 <1763685535@qq.com>

---------

Signed-off-by: songshanhu07 <1763685535@qq.com>
Signed-off-by: Yuxiao-Xu <664988918@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: songshanhu07 <1763685535@qq.com>
Co-authored-by: Xu Yuxiao <xuyuxiao2@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-09 19:28:11 +08:00
Yikun Jiang
e63fc6f280 Init vLLM Ascend maintainers info (#1124)
### What this PR does / why we need it?
As plus of https://github.com/vllm-project/vllm-ascend/pull/1070, this
patch adds `Nominating and Removing Maintainers` section (reference some
design from [PyTorch
Governance](https://docs.pytorch.org/docs/stable/community/governance.html))

Below are key info about existing maintainers:

## @wangxiyuan: 
- Super active code and high quality reviewer [450+ PR
reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Awangxiyuan).
- One of the top contributors, he also active contribute [50+ commits
](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Awangxiyuan+)
with good quality, he dares to [refactor the
code](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Awangxiyuan+is%3Aclosed+refactor),
which also shows his deep understanding of vllm and vllm ascend.
- He leads the [[RFC]: Hardware
pluggable](https://github.com/vllm-project/vllm/issues/11162) feature,
this make vllm-ascend project become true.
- Active community involved cross wechat group, slack, github issue.
Involved on [150+
issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Awangxiyuan)
and help users. He is also the spearker of vLLM Beijing meetup help more
users understand vLLM Ascend.
- Relase manager of
[v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1),
[v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1),
[v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2),
[v0.8.4rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc1),
[v0.7.3.post1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3.post1).

## @Yikun: 
- High active code reviewer: [190+ PR
reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AYikun),
especially for new developers to help them onboarding.
- One of the top contributors with sustained contributions: [50+
commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3AYikun+)
since the first day of vLLM Ascend.
- High quality contributions around vLLM compatibility guarantee and
also maintain [CI
](https://github.com/vllm-project/vllm-ascend/pull/1040) and [test
Framework](https://github.com/vllm-project/vllm-ascend/pull/730).
- Active community involved cross local group, github issue Involved on
[170+
issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3AYikun).
He is also main organizer of vLLM Beijing Meetup and speaker of [PyTorch
Day China
2025](https://pytorchdaychina2025.sched.com/event/2401V/poster-session)
to help vLLM Ascend growth.
- Relase manager of
[v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc2),
[v0.8.5rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.5rc1),
[v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3).

## @ganyi1996ppo 
- High active code and high quality reviewer: [90+ PR
reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Aganyi1996ppo),
he has a deep understanding of Ascend operators can always find some key
issues, has deeply understand of the codebase, good code quality and
qualified judgement.
- Major and high quality contributions: [10+
commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Aganyi1996ppo)
with high quality.
- He is the main contributor of [Custom AscendC op
support](https://github.com/vllm-project/vllm-ascend/pull/371),
[Deepseekv3 performance
optimization](https://github.com/vllm-project/vllm-ascend/pull/598).
- Community Involvement‌: Involved on [11+ issue and help
users](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Aganyi1996ppo),
share [custom ops
topic](https://www.bilibili.com/video/BV1Z25az3EqS/?share_source=copy_web&vd_source=72ef9c665af5f2f1370abe26ce1f719f&t=1342)
on vLLM Ascend Weekly meeting.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-09 16:32:58 +08:00
sdmyzlp
3640c60b0e Avoid unfused Transpose in DeepSeekV3 EP256 MoE layer (#1091)
### What this PR does / why we need it?

View optimization in torchair (defaulted to on for Transpose with any of
its axis being 1) prevents the weight Transpose to be fused with later
GroupedMatmul, which decrease the performance of MoE layer when expert
parallelism equals the total number of experts (e.g. EP256 for DSKv3).
Add an option to solve this problem by disabling the optimization.

### Does this PR introduce _any_ user-facing change?

Controlled by
`additional_config.torchair_graph_config.enable_view_optimize`,
defaulted to `True`.

### How was this patch tested?

Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-07 14:28:20 +08:00
wangxiyuan
0395ab30be [Doc] Add graph mode user doc (#1083)
Add graph mode user guide doc.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-06 21:14:34 +08:00
wangxiyuan
dab19d5dca [BugFix] Fix ascend config check (#1092)
Fix the ascend config check logic:
1. refactor check_ascend_config to make it clear:
    1. torchair graph should not work with enforce_eager=True
    2. aclgraph should not work with torchair graph
3. add refresh config for rlhf case
4. fix a typo in model runner
5. change expert_tensor_parallel_size default to 0 to keep the same as
before

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-06 18:54:37 +08:00
depeng1994
6b094a2bd4 [ModelRunner]Add profile execute duration observation (#1013)
### What this PR does / why we need it?
We need to **observe the time consumed in each stage of inference
(including pre-processing, model forward, etc.), without any performance
loss**.
Therefore, we use the event timestamp mechanism of the NPU to mark any
stage during the execution of the NPU device (this marking operation is
executed asynchronously, with no performance loss).
Additionally, we provide a blocking synchronization API
`pop_captured_sync` to be called at an appropriate time, to print the
time consumed in all observed stages.

**model_runner_v1.py file only changed 5 lines, all of which were
`ProfileExecuteDuration()` calls, and nothing else was changed, while
more changes were showed due to the alignment issue.**

### Does this PR introduce _any_ user-facing change?
Use  env `VLLM_MODEL_EXECUTE_TIME_OBSERVE `to enable this feature

### How was this patch tested?

Tested in deepseek model,Print like this:
```
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms

```

---------

Signed-off-by: depeng1994 <depengzhang@foxmail.com>
2025-06-06 09:29:34 +08:00
wangxiyuan
e1ab6d318e [Misc] Refactor additional_config (#1029)
More and more config options are added to additional_config. This PR
provide a new AscendConfig to manage these config options by an easier
way to make code cleaner and readable.

 This PR also added the `additional_config` doc for users.

Added the test_ascend_config.py to make sure the new AscendConfig works
as expect.

TODO: Add e2e test with torchair and deepseek once the CI resource is
available.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-05 16:28:01 +08:00
Yikun Jiang
fd136e6762 Add vLLM Ascend project governance docs (#1070)
### What this PR does / why we need it?
Add vLLM Ascend project governance and first contributors docs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Closes: https://github.com/vllm-project/vllm-ascend/issues/828
Closes: https://github.com/vllm-project/vllm-ascend/issues/929

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-05 11:56:51 +08:00
wangxiyuan
5903547d09 [doc] add 0.7.3.post1 release note (#1008)
Add release note for 0.7.3.post1
Add the missing release note back for 0.7.3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-29 17:38:34 +08:00
22dimensions
c464c32b81 add doc for offline quantization inference (#1009)
add example for offline inference with quantized model

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-05-29 17:32:42 +08:00
yupeng
8ddc0a1002 [DOC] mark v1 multi-lora functional (#932)
### What this PR does / why we need it?
Update feature support for lora

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
preview

Signed-off-by: paulyu <paulyu0307@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
2025-05-22 19:53:14 +08:00
hfadzxy
58b413752b [Doc] Support XLM-RoBERTa-based and MiniCPM3 model (#820)
### What this PR does / why we need it?
support XLM-RoBERTa-based and MiniCPM3 model

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-05-21 15:44:54 +08:00
22dimensions
d5401a08be [DOC] update modelslim version (#908)
1. update modelslim version to fix deepseek related issues
2. add note for "--quantization ascend"

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-05-21 09:12:02 +08:00
22dimensions
a8730e7a3c [Doc] update quantization docs with QwQ-32B-W8A8 example (#835)
1. replace deepseek-v2-lite model with more pratical model QwQ 32B
2. fix some incorrect commands
3. replase modelslim version with a more formal tag

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-05-17 15:25:17 +08:00
hfadzxy
fd515cd60b [Doc][BugFix]Fix Release Compatibility Matrix (#865)
### What this PR does / why we need it?
Fix Release Compatibility Matrix

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-05-15 15:38:38 +08:00