xc-llm-ascend

Author	SHA1	Message	Date
Shanshan Shen	4e2daf5ab7	[Doc] Add qwen2-audio eager mode tutorial (#1371 ) ### What this PR does / why we need it? Add qwen2-audio eager mode tutorial. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-26 16:56:05 +08:00
leo-pony	1025344912	Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode (#1374 ) ### What this PR does / why we need it? Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode. Relate RFC: https://github.com/vllm-project/vllm-ascend/issues/1248 ### Does this PR introduce _any_ user-facing change? No changes. ### How was this patch tested? Preview Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-06-26 16:52:54 +08:00
wangxiyuan	205cb85a1e	[Doc] Fix doc typo (#1424 ) 1. Fix the typo 2. Fix 404 url 3. update graph mode and additional config user guide Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-25 19:28:26 +08:00
Li Wang	15df8be937	[Doc] Add sleep mode doc (#1295 ) ### What this PR does / why we need it? Add sleep related doc and example --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-25 14:07:14 +08:00
wangxiyuan	e4e0b7af05	[Doc] Add patch doc (#1414 ) 1. Format the developer guide content to make it more clear 2. Add the patch doc for developer guide Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-25 12:00:45 +08:00
Mengqing Cao	c1c5d56255	[Doc] Update FAQ and add test guidance (#1360 ) ### What this PR does / why we need it? - Add test guidance - Add reduce layer guidance - update faq on determinitic calculation --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-25 09:59:23 +08:00
Yikun Jiang	917c6b71af	[TEST][DOC] Fix doctest and add system package installation (#1375 ) ### What this PR does / why we need it? - Fix [doctest](https://github.com/vllm-project/vllm-ascend/actions/workflows/vllm_ascend_doctest.yaml?query=event%3Aschedule) - add system package installation - Add doc for run doctests - Cleanup all extra steps in .github/workflows/vllm_ascend_doctest.yaml - Change schedule job from 4 ---> 12 hours ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - doctest CI passed - Local test with `/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`. Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-23 20:50:33 +08:00
Icey	08cfc7cb4b	Modify installation.md for adding pip extra index of torch-npu (#1272 ) ### What this PR does / why we need it? Modify installation.md for adding pip extra index of torch-npu ### How was this patch tested? No need --------- Signed-off-by: Icey <1790571317@qq.com>	2025-06-23 15:37:50 +08:00
weiguihua2	e1123172d1	[Doc] Add reinstall instructions doc (#1303 ) Add a new FAQ, if users re-install vllm-ascend with pip, the `build` folder should be removed first --------- Signed-off-by: rjg-lyh <1318825571@qq.com> Signed-off-by: weiguihua <weiguihua2@huawei.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-06-23 14:06:27 +08:00
Pleaplusone	7e6efbf2a9	update torch-npu to 2.5.1.post1.dev20250619 (#1347 ) ### What this PR does / why we need it? This PR update the torch_npu to newest release version 2.5.1.post1.dev20250619 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI tested will guarantee the update Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-06-23 09:02:09 +08:00
xleoken	4447e53d7a	[Doc] Change not to no in faqs.md (#1357 ) ### What this PR does / why we need it? Change not to no in faqs.md. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local Test Signed-off-by: xleoken <xleoken@163.com>	2025-06-23 09:01:00 +08:00
Yikun Jiang	2e5f312530	Cleanup ununsed doc (#1352 ) ### What this PR does / why we need it? Cleanup ununsed doc for MoGE model, we will add back this when MoGE model ready. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-22 15:05:30 +08:00
Yikun Jiang	c30ddb8331	Bump v0.9.1rc1 release (#1349 ) ### What this PR does / why we need it? Bump v0.9.1rc1 release Closes: https://github.com/vllm-project/vllm-ascend/pull/1341 Closes: https://github.com/vllm-project/vllm-ascend/pull/1334 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com>	2025-06-22 13:15:36 +08:00
wangxiyuan	45be1aac0c	[CI] Add codespell check for doc (#1314 ) Add codespell check test for doc only PR Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-20 16:48:14 +08:00
22dimensions	761bd3d9d7	Add user guide for quantization (#1206 ) ### What this PR does / why we need it? Add user guide for quantization ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-06-20 15:53:25 +08:00
songshanhu07	ebb2a70dbb	static EPLB fix bug, add unit test (#1186 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> 1.add static EPLB unit test 2.fix bug: Tensor cannot be directly judged by if statements ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Run the unit test. --------- Signed-off-by: songshanhu07 <1763685535@qq.com>	2025-06-18 19:46:56 +08:00
Yikun Jiang	05dec7eda9	[Doc] Refactor and init user story page (#1224 ) ### What this PR does / why we need it? This PR refactor the user stories page: - Move it to community - Add initial info of LLaMA-Factory, Huggingface/trl, MindIE Turbo, GPUStack, verl - Add a new page for LLaMA-Factory ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview locally Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-17 09:36:35 +08:00
Yikun Jiang	9d3cbc0953	[Doctest] add installation doctest (#1179 ) ### What this PR does / why we need it? Install doctest ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Related: https://github.com/vllm-project/vllm-ascend/pull/983 Co-authored-by: wangli <wangli858794774@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com>	2025-06-17 08:52:26 +08:00
Mengqing Cao	96fa7ff63b	[DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235 ) ### What this PR does / why we need it? 1. Fix rank set in DP scenario. The new poc version of torch-npu support setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the rank set in `DPEngineCoreProc` directly instead of calculating local rank across dp by hand in the patched `_init_data_parallel` Closes: https://github.com/vllm-project/vllm-ascend/issues/1170 2. Bump torch-npu version to 2.5.1.post1.dev20250528 Closes: https://github.com/vllm-project/vllm-ascend/pull/1242 Closes: https://github.com/vllm-project/vllm-ascend/issues/1232 ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-06-16 23:09:53 +08:00
22dimensions	0d2074a1ec	[Doc] fix VLLM_USE_V1 value in graph mode docs (#1226 ) os.environ["VLLM_USE_V1"] must be assigned with str, not other type. ![image](https://github.com/user-attachments/assets/9d337ae5-00e5-4179-832e-c6c917dd5798) Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-06-15 15:41:11 +08:00
fems14	ab5d110fcc	vllm-ascend support chunked prefill (#1172 ) ### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-06-14 22:31:16 +08:00
Mengqing Cao	a3b5af8307	[CI/UT][Graph] Add ut for torchair graph mode (#1103 ) ### What this PR does / why we need it? Add ut for torchair graph mode on DeepSeekV3 ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Mengqing Cao <cmq0113@163.com>	2025-06-14 16:59:00 +08:00
Yikun Jiang	94a52cf577	Add ShouJian Zheng (@jianzs) as vLLM Ascend maintainer (#1203 ) ### What this PR does / why we need it? Add @jianzs as vLLM Ascend maintainer @jianzs ---- I would like to nominate Shoujian Zheng (@jianzs <https://github.com/jianzs>) as a maintainer, starting with my +1. - He focuses on the code quality and good design with solid reviews in P/D disaggregation and DeepSeek improvement area about 30+ high quality review, such as #issuecomment-2811764833, #discussion_r2069927605 and #pullrequestreview-2820996674. This is the most important reason why I nominated him, because helping community developers complete PRs with high quality and continuously ensure the quality of codebase is one of the important responsibilities of a maintainer. We believe he is a great addition. - Shoujian's main expertise is distributed inference. He has a lot of experience in production about AI infra. He has very good habits and explains in great detail all changes #issue-3023082580 anqd share results open: #issuecomment-2853140443. And High quality PR: #706, #774, #852. - Community Involvement: Active involved in community discussion, he is collaborative and helps the users solve problems, involved in 30+ PR and issue, such as #issuecomment-2911934292 and #issuecomment-2833523571. Reference: [1] https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html [2] https://vllm-ascend.readthedocs.io/en/latest/community/governance.html Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-13 18:25:50 +08:00
sdmyzlp	e72f94e38f	Support multistream of MLA vector operations (#1135 ) ### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` \| q_rmsnorm \| \| kv_norm_rope_cache \| \| q_rope \| \| matmul W_DQ \| matmul W_DKV \| index \| index \| matmul W_UQ \| split \| matmul W_KV_T \| ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-12 21:42:09 +08:00
Wan_Danfeng	55c0e68883	[Doc] Add Referer header for CANN package download url. (#1192 ) ### What this PR does / why we need it? fix the CANN download url ### Does this PR introduce _any_ user-facing change? no, do not have any user-facing change ### How was this patch tested? run the wget command and cann package is rightly downloaded. --------- Signed-off-by: wan_danfeng <wonderful199082@126.com>	2025-06-12 21:22:23 +08:00
chenwaner	e46dc142bf	Enable kvcache_nz for the decode process in torchair graph mode (#1098 ) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2948542159 https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2954496588 --------- Signed-off-by: chenwaner <861645847@qq.com>	2025-06-11 14:09:28 +08:00
yz	4153a5091b	[Doc] Fix the config parameter name "enable" in graph_mode.md. (#1159 ) Fix the doc typo in graph_mode.md Signed-off-by: yzim <43207690+yzim@users.noreply.github.com>	2025-06-11 11:03:37 +08:00
depeng1994	860a5ef7fd	provide an e2e guide for execute duration profiling (#1113 ) ### What this PR does / why we need it? provide an e2e guide for execute duration profiling Signed-off-by: depeng1994 <depengzhang@foxmail.com>	2025-06-11 10:02:11 +08:00
sdmyzlp	7bdc606677	Support multistream of shared experts in FusedMoE (#997 ) Contains on #1111 for completeness. <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` \| shared gate_up \| shared act \| \| shared down \| \| dispatch \| routed gate_up, act, down \| combine \| ``` <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? No. <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-11 09:18:38 +08:00
Mengqing Cao	8dd686dfa2	[MLA][Graph] Improve assertion on Graph mode with MLA (#933 ) ### What this PR does / why we need it? Improve assertion on Graph mode with MLA. When running deepseek with graph mode, the fused MLA op only support `numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion info here to avoid users confused with this. ### Does this PR introduce _any_ user-facing change? Adjusting tp size is required when running deepseek-v3/r1 with graph mode. deepseek-v2-lite is not supported in graph mode. ### How was this patch tested? Test locally as the CI machine could not run V3 due to the HBM limits. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-10 22:26:53 +08:00
wangxiyuan	b75cb788dd	[Bugfix] add compilation/__init__.py to fix import error (#1152 ) 1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a python module 2. Fix model runner bug to keep the same with vllm 3. Add release note for 0.9.0rc2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-10 17:14:25 +08:00
zhangxinyuehfad	e68e81f2ce	[CI] Make accuarcy CI and report work (#1078 ) ### What this PR does / why we need it? Make accuarcy CI and report work ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manaully review Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-06-10 14:35:44 +08:00
Yikun Jiang	71aee6f97d	Update 0.9.0rc1 contributors info (#1148 ) ### What this PR does / why we need it? Update 0.9.0rc1 contributors info ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-10 13:29:09 +08:00
wangxiyuan	571f88f85e	[Doc] Update 0.9.0rc1 release date (#1139 ) 1. Update 0.9.0rc1 release date 2. Update feature and model support list 3. Add DP known issue to release note Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-09 22:51:02 +08:00
wangxiyuan	5ac4872f5e	[Doc] Add 0.9.0rc1 release note (#1106 ) Add the release note for v0.9.0rc1 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-09 19:39:21 +08:00
Yuxiao-Xu	6b853f15fe	Add static EPLB (#1116 ) ### What this PR does / why we need it? Add EPLB expert map import capabilities ### Does this PR introduce _any_ user-facing change? When importing the EPLB expert map you need import expert map file by vllm args additional_config ### How was this patch tested? 1.You need to collect expert hotness and generate an expert placement file based on the hotness and the EPLB algorithm, or you can directly use an existing expert placement table. 2.When launching vLLM, enable EC2 and pass the configuration via the command-line argument: --additional-config '{"expert_map_path": "/xxx/xxx/xx.json"} Co-authored-by: songshanhu07 <1763685535@qq.com> --------- Signed-off-by: songshanhu07 <1763685535@qq.com> Signed-off-by: Yuxiao-Xu <664988918@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: songshanhu07 <1763685535@qq.com> Co-authored-by: Xu Yuxiao <xuyuxiao2@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-09 19:28:11 +08:00
Yikun Jiang	e63fc6f280	Init vLLM Ascend maintainers info (#1124 ) ### What this PR does / why we need it? As plus of https://github.com/vllm-project/vllm-ascend/pull/1070, this patch adds `Nominating and Removing Maintainers` section (reference some design from [PyTorch Governance](https://docs.pytorch.org/docs/stable/community/governance.html)) Below are key info about existing maintainers: ## @wangxiyuan: - Super active code and high quality reviewer [450+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Awangxiyuan). - One of the top contributors, he also active contribute [50+ commits ](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Awangxiyuan+) with good quality, he dares to [refactor the code](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Awangxiyuan+is%3Aclosed+refactor), which also shows his deep understanding of vllm and vllm ascend. - He leads the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162) feature, this make vllm-ascend project become true. - Active community involved cross wechat group, slack, github issue. Involved on [150+ issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Awangxiyuan) and help users. He is also the spearker of vLLM Beijing meetup help more users understand vLLM Ascend. - Relase manager of [v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2), [v0.8.4rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc1), [v0.7.3.post1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3.post1). ## @Yikun: - High active code reviewer: [190+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AYikun), especially for new developers to help them onboarding. - One of the top contributors with sustained contributions: [50+ commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3AYikun+) since the first day of vLLM Ascend. - High quality contributions around vLLM compatibility guarantee and also maintain [CI ](https://github.com/vllm-project/vllm-ascend/pull/1040) and [test Framework](https://github.com/vllm-project/vllm-ascend/pull/730). - Active community involved cross local group, github issue Involved on [170+ issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3AYikun). He is also main organizer of vLLM Beijing Meetup and speaker of [PyTorch Day China 2025](https://pytorchdaychina2025.sched.com/event/2401V/poster-session) to help vLLM Ascend growth. - Relase manager of [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc2), [v0.8.5rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.5rc1), [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3). ## @ganyi1996ppo - High active code and high quality reviewer: [90+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Aganyi1996ppo), he has a deep understanding of Ascend operators can always find some key issues, has deeply understand of the codebase, good code quality and qualified judgement. - Major and high quality contributions: [10+ commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Aganyi1996ppo) with high quality. - He is the main contributor of [Custom AscendC op support](https://github.com/vllm-project/vllm-ascend/pull/371), [Deepseekv3 performance optimization](https://github.com/vllm-project/vllm-ascend/pull/598). - Community Involvement‌: Involved on [11+ issue and help users](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Aganyi1996ppo), share [custom ops topic](https://www.bilibili.com/video/BV1Z25az3EqS/?share_source=copy_web&vd_source=72ef9c665af5f2f1370abe26ce1f719f&t=1342) on vLLM Ascend Weekly meeting. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-09 16:32:58 +08:00
sdmyzlp	3640c60b0e	Avoid unfused Transpose in DeepSeekV3 EP256 MoE layer (#1091 ) ### What this PR does / why we need it? View optimization in torchair (defaulted to on for Transpose with any of its axis being 1) prevents the weight Transpose to be fused with later GroupedMatmul, which decrease the performance of MoE layer when expert parallelism equals the total number of experts (e.g. EP256 for DSKv3). Add an option to solve this problem by disabling the optimization. ### Does this PR introduce _any_ user-facing change? Controlled by `additional_config.torchair_graph_config.enable_view_optimize`, defaulted to `True`. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-07 14:28:20 +08:00
wangxiyuan	0395ab30be	[Doc] Add graph mode user doc (#1083 ) Add graph mode user guide doc. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-06 21:14:34 +08:00
wangxiyuan	dab19d5dca	[BugFix] Fix ascend config check (#1092 ) Fix the ascend config check logic: 1. refactor check_ascend_config to make it clear: 1. torchair graph should not work with enforce_eager=True 2. aclgraph should not work with torchair graph 3. add refresh config for rlhf case 4. fix a typo in model runner 5. change expert_tensor_parallel_size default to 0 to keep the same as before Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-06 18:54:37 +08:00
depeng1994	6b094a2bd4	[ModelRunner]Add profile execute duration observation (#1013 ) ### What this PR does / why we need it? We need to observe the time consumed in each stage of inference (including pre-processing, model forward, etc.), without any performance loss. Therefore, we use the event timestamp mechanism of the NPU to mark any stage during the execution of the NPU device (this marking operation is executed asynchronously, with no performance loss). Additionally, we provide a blocking synchronization API `pop_captured_sync` to be called at an appropriate time, to print the time consumed in all observed stages. model_runner_v1.py file only changed 5 lines, all of which were `ProfileExecuteDuration()` calls, and nothing else was changed， while more changes were showed due to the alignment issue. ### Does this PR introduce _any_ user-facing change? Use env `VLLM_MODEL_EXECUTE_TIME_OBSERVE `to enable this feature ### How was this patch tested? Tested in deepseek model，Print like this: ``` 5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms 5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms 5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms 5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms 5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms 5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms 5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms 5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms 5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms 5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms 5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms 5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms 5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms 5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms 5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms 5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms 5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms 5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms ``` --------- Signed-off-by: depeng1994 <depengzhang@foxmail.com>	2025-06-06 09:29:34 +08:00
wangxiyuan	e1ab6d318e	[Misc] Refactor additional_config (#1029 ) More and more config options are added to additional_config. This PR provide a new AscendConfig to manage these config options by an easier way to make code cleaner and readable. This PR also added the `additional_config` doc for users. Added the test_ascend_config.py to make sure the new AscendConfig works as expect. TODO: Add e2e test with torchair and deepseek once the CI resource is available. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-05 16:28:01 +08:00
Yikun Jiang	fd136e6762	Add vLLM Ascend project governance docs (#1070 ) ### What this PR does / why we need it? Add vLLM Ascend project governance and first contributors docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Closes: https://github.com/vllm-project/vllm-ascend/issues/828 Closes: https://github.com/vllm-project/vllm-ascend/issues/929 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-05 11:56:51 +08:00
wangxiyuan	5903547d09	[doc] add 0.7.3.post1 release note (#1008 ) Add release note for 0.7.3.post1 Add the missing release note back for 0.7.3 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-29 17:38:34 +08:00
22dimensions	c464c32b81	add doc for offline quantization inference (#1009 ) add example for offline inference with quantized model Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-05-29 17:32:42 +08:00
yupeng	8ddc0a1002	[DOC] mark v1 multi-lora functional (#932 ) ### What this PR does / why we need it? Update feature support for lora ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? preview Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>	2025-05-22 19:53:14 +08:00
hfadzxy	58b413752b	[Doc] Support XLM-RoBERTa-based and MiniCPM3 model (#820 ) ### What this PR does / why we need it? support XLM-RoBERTa-based and MiniCPM3 model --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-05-21 15:44:54 +08:00
22dimensions	d5401a08be	[DOC] update modelslim version (#908 ) 1. update modelslim version to fix deepseek related issues 2. add note for "--quantization ascend" Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-05-21 09:12:02 +08:00
22dimensions	a8730e7a3c	[Doc] update quantization docs with QwQ-32B-W8A8 example (#835 ) 1. replace deepseek-v2-lite model with more pratical model QwQ 32B 2. fix some incorrect commands 3. replase modelslim version with a more formal tag Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-05-17 15:25:17 +08:00
hfadzxy	fd515cd60b	[Doc][BugFix]Fix Release Compatibility Matrix (#865 ) ### What this PR does / why we need it? Fix Release Compatibility Matrix Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-05-15 15:38:38 +08:00

1 2 3

138 Commits