xc-llm-ascend

Author	SHA1	Message	Date
offline893	76844eec78	Dynamic Expert Load Balance with Zero-like-overhead (#2956 ) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `567939953b` --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>	2025-09-17 10:36:43 +08:00
wangxiyuan	048bfd5553	[Release] Add release note for v0.10.2rc1 (#2921 ) Add release note for v0.10.2rc1 - vLLM version: v0.10.2 - vLLM main: `b834b4cbf1` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-16 01:20:05 +08:00
CaranLic	168ad600b5	[main] add pd transfer for ascend scheduler (#2753 ) ### What this PR does / why we need it? For offline scenarios, adjust the scheduling process to prioritize the prefill phase of all requests, then process the decode phase of all requests. ### How was this patch tested? ``` max_num_seqs=24, additional_config={ "ascend_scheduler_config":{ "enabled": True, "enable_pd_transfer": True, "decode_max_num_seqs": 24, "enable_chunked_prefill": False } }, ``` \| input \| output \| num prompts \| max_num_seqs \| dp \| tp \| scheduler \| tps \| \| ------ \| ------ \| ---------- \| ---------------- \| ---- \| ---- \| ---------------- \| --------------- \| \| dapo-math-17K \| 2K \| 384 \| 24 \| 2 \| 1 \| v1 \| 234.06 \| \| dapo-math-17K \| 2K \| 384 \| 24 \| 2 \| 1 \| pd transfer \| 239.59(+2.4%) \| \| dapo-math-17K\| 2K \| 384 \| 24 \| 4 \| 1 \| v1 \| 222.85 \| \| dapo-math-17K\| 2K \| 384 \| 24 \| 4 \| 1 \| pd transfer \| 225.81(+1.3%) \| - vLLM version: v0.10.1.1 - vLLM main: `6fb2788163` --------- Signed-off-by: CaranLic <740821011@qq.com>	2025-09-10 08:46:39 +08:00
lidenghui1110	5a7181569c	[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167 ) ### What this PR does / why we need it? This PR introduces Oproj matrix tensor model parallel to achieve decreasing of memory consumption. It only support graph mode in pure DP scenario. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8 GB NPU memory per RANK. We got best performance when oproj_tensor_parallel_size=4 without TPOT increasing. performance data: <img width="1442" height="442" alt="image" src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| oproj_tensor_parallel_size \| Split the o_proj matrix along the row dimension (head num * head dim) into oproj_tensor_parallel_size pieces. \| No \| int \| default value is None, once this value is set, the feature will be enabled, head num * head dim must be divisible by this value. \| example `--additional_config={"oproj_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `eddaafc1c7` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zzh <zzh_201018@outlook.com>	2025-09-07 10:31:32 +08:00
Mengqing Cao	7e16b4a7cd	[ReleaseNote] Add Release Note for v0.10.1rc1 (#2635 ) Add Release Note for v0.10.1rc1 - vLLM version: v0.10.1.1 - vLLM main: `b5ee1e3261` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-04 11:26:47 +08:00
wangxiyuan	41b028aa5f	[Doc] add v0.9.1 release note (#2646 ) Add release note for 0.9.1 - vLLM version: v0.10.1.1 - vLLM main: `8bd5844989` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-03 18:04:27 +08:00
panchao-hub	ea53f9076e	support torchair mode (#2641 ) ### What this PR does / why we need it? support torchair mode ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `5438967fbc` Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>	2025-09-01 15:49:07 +08:00
lidenghui1110	600b08f754	[Feat]: Add custom lmhead tensor model parallel (#2309 ) ### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| lmhead_tensor_parallel_size \| Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces \| No \| int \| default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. \| example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `de533ab2a1` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>	2025-08-29 11:41:21 +08:00
LeeWenquan	c8d1df3a3f	[Refactor][WIP] Refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl (#2465 ) ### What this PR does / why we need it? In order to support fused kernels, multi-stream, communication optimization etc, it's better to aggregate all opreations in Attention layer togather. This PR tries to refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl. Note that new mla_v1 doesn't take torchair into consideration. So this PR can only be merged after torchair related mla_v1 is isolated into a new file. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? ### Features Test <img width="506" height="141" alt="image" src="https://github.com/user-attachments/assets/f1ab2906-a1ac-4450-8433-94811cd89466" /> ### Performance After Refact <img width="648" height="486" alt="image" src="https://github.com/user-attachments/assets/e33e038c-c5d9-4ba7-a8e9-1ac22f9833eb" /> ### Performance Before Refact <img width="618" height="494" alt="image" src="https://github.com/user-attachments/assets/83861dc2-dc51-4af3-9310-90ab10c43bb1" /> - vLLM version: v0.10.1.1 - vLLM main: `e03940762b` --------- Signed-off-by: lwq <liwenquan5@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: lwq <liwenquan5@huawei.com> Co-authored-by: whx-sjtu <2952154980@qq.com>	2025-08-28 10:35:57 +08:00
Li Wang	042605f4b2	[Doc] Add stable modelslim branch (#2545 ) ### What this PR does / why we need it? The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial delivery version of modelslim in Q3, and has been verified available ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `7d67a9d9f9` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-27 09:05:46 +08:00
Shanshan Shen	f0be3eed84	[Doc] Add release note for `v0.9.1rc3` (#2488 ) ### What this PR does / why we need it? Add release note for `v0.9.1rc3`. - vLLM version: v0.10.0 - vLLM main: `53415653ff` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-22 16:06:29 +08:00
yupeng	973a7cfdf0	[DOC] update doc: LoRA with ACLGraph (#2430 ) ### What this PR does / why we need it? Update DOC. Guide users to run LoRA with ACLGraph. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. - vLLM version: v0.10.0 - vLLM main: `de7b67a023` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-08-21 08:55:55 +08:00
Wang Kunpeng	1de16ead8e	[main][bugfix] Modify the default value of the enable_shared_pert_dp to false (#2457 ) ### What this PR does / why we need it? enable_shared_pert_dp is currently on by default. This optimization is currently only valid for deepseek series models. The default opening affects the accuracy of the qwen series models. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? use parameter --additional_config='{"enable_shared_expert_dp": true}' - vLLM version: v0.10.0 - vLLM main: `d983769c41` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-08-20 20:25:53 +08:00
Li Wang	2ad7e1251e	[Doc] Fix quant documentation to make it reproducible (#2277 ) ### What this PR does / why we need it? Fixed the expression of msit for code clone - vLLM version: v0.10.0 - vLLM main: `afa5b7ca0b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-14 17:19:47 +08:00
Wang Kunpeng	dc585f148a	[main][prefill optimization] Optimize parallel strategies to reduce communication overhead (#2198 ) ### What this PR does / why we need it? 1.Shared Expert Sharding Strategy Update: Switched from TP-aligned to pure DP for shared experts, enabling more efficient execution. 2.O_Proj AllReduce → ReduceScatter: Reduced communication overhead by using ReduceScatter, made possible by pure DP sharding. 3.AllGather Postponed: Delayed to after QKV down projection to reduce synchronization impact during prefill. ### How was this patch tested? Adding ut case in `tests/ut/attention/test_mla_v1.py` #### How to run use parameter `--additional_config='{"enable_shared_expert_dp": true}'` ##### a.How to run eager mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --enforce-eager --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true}' ##### b.How to run graph mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true,"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>	2025-08-12 14:12:12 +08:00
Mengqing Cao	4604882a3e	[ReleaseNote] Release note of v0.10.0rc1 (#2225 ) ### What this PR does / why we need it? Release note of v0.10.0rc1 - vLLM version: v0.10.0 - vLLM main: `8e8e0b6af1` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-07 14:46:49 +08:00
zhangxinyuehfad	92eebc0c9b	[Doc] Update user guide for suported models (#2263 ) ### What this PR does / why we need it? Update user guide for suported models - vLLM version: v0.10.0 - vLLM main: `4be02a3776` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-07 14:39:51 +08:00
Li Wang	bf84f2dbfa	[Doc] Support kimi-k2-w8a8 (#2162 ) ### What this PR does / why we need it? In fact, the kimi-k2 model is similar to the deepseek model, and we only need to make a few changes to support it. what does this pr do: 1. Add kimi-k2-w8a8 deployment doc 2. Update quantization doc 3. Upgrade torchair support list ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-06 19:28:47 +08:00
Yikun Jiang	54ace9e12b	Add release note for v0.9.1rc2 (#2188 ) ### What this PR does / why we need it? Add release note for v0.9.1rc2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.0 - vLLM main: `c494f96fbc` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-06 09:04:46 +08:00
Wang Kunpeng	e3a2443c3a	[main][Doc] add mla pertoken quantization FAQ (#2018 ) ### What this PR does / why we need it? When using deepseek series models generated by the --dynamic parameter, if torchair graph mode is enabled, we should modify the configuration file in the CANN package to prevent incorrect inference results. - vLLM version: v0.10.0 - vLLM main: `7728dd77bb` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-07-27 08:47:51 +08:00
Li Wang	bdfb065b5d	[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011 ) ### What this PR does / why we need it? 1. Enable pymarkdown check 2. Enable python `__init__.py` check for vllm and vllm-ascend 3. Make clean code ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `29c6fbe58c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-25 22:16:10 +08:00
wangxiyuan	326dcf2576	[Doc] Update support feature (#1828 ) The feature support matrix is out of date. This PR refresh the content. - vLLM version: v0.9.2 - vLLM main: `107111a859` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-23 15:19:15 +08:00
Mengqing Cao	8cfd257992	[Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681 ) ### What this PR does / why we need it? Remove ETP/EP maintained in branch main. We drop this as there is no relevant scenarios to use ETP now, and we may subsequently advocate implementing expert tensor parallelism in vLLM to support scenarios where the expert is needed to be sliced This is a part of #1422 backport. Fixes https://github.com/vllm-project/vllm-ascend/issues/1396 https://github.com/vllm-project/vllm-ascend/issues/1154 ### Does this PR introduce _any_ user-facing change? We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in vllm instead. ### How was this patch tested? CI passed with new added and existing test. - vLLM version: v0.9.2 - vLLM main: `fe8a2c544a` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-21 09:08:04 +08:00
wangxiyuan	eb921d2b6f	[Doc] Fix 404 error (#1797 ) Fix url 404 error in doc - vLLM version: v0.9.2 - vLLM main: `9ad0a4588b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 11:52:38 +08:00
wangxiyuan	3c404de1b1	[Release]Update release note (#1753 ) There is still issue with pp in some case. such as aclgraph, ray. Remove the related doc in release note Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-11 17:58:26 +08:00
wangxiyuan	b5b7e0ecc7	[Doc] Add qwen3 embedding 8b guide (#1734 ) 1. Add the tutorials for qwen3-embedding-8b 2. Remove VLLM_USE_V1=1 in docs, it's useless any more from 0.9.2 - vLLM version: v0.9.2 - vLLM main: `5923ab9524` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-11 17:40:17 +08:00
wangxiyuan	9c560b009a	[Release] Add 0.9.2rc1 release note (#1725 ) Add release note for 0.9.2rc1, we'll release soon - vLLM version: v0.9.2 - vLLM main: `7bd4c37ae7` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-11 17:36:05 +08:00
wangxiyuan	3d1e6a5929	[Doc] Update user doc index (#1581 ) Add user doc index to make the user guide more clear - vLLM version: v0.9.1 - vLLM main: `49e8c7ea25` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-10 14:26:59 +08:00
Li Wang	c7446438a9	[1/N][CI] Move linting system to pre-commits hooks (#1256 ) ### What this PR does / why we need it? Follow vllm-project/vllm lint way: https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml Enable pre-commit to avoid some low level error AMAP. This pr is one step of #1241, The purpose is make linting system more clear and convenient, on this step, Mainly did the following things: yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit, enforce-import-regex-instead-of-re. TODO: - clang-format(check for csrc with google style) need clean code, disable for now - pymarkdown need clean code, disable for now - shellcheck need clean code, disable for now ### Does this PR introduce _any_ user-facing change? Only developer UX change: https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally ``` pip install -r requirements-lint.txt && pre-commit install bash format.sh ``` ### How was this patch tested? CI passed with new added/existing test. Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) - vLLM version: v0.9.1 - vLLM main: `5358cce5ff` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-10 14:17:15 +08:00
Yikun Jiang	e4e9ea02ab	Upgrade vLLM version to v0.9.2 (#1652 ) ### What this PR does / why we need it? This patch upgrade vLLM version to v0.9.2, this patch didn't remove the v0.9.1 compatible code to easy review. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `14601f5fba` - Accuracy test with 0.9.2: https://github.com/vllm-project/vllm-ascend/actions/runs/16121612087 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-08 14:18:17 +08:00
Angazenn	a5f33590d3	[CORE]initial support for torchair with non-mla backend (#1506 ) ### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:21:42 +08:00
yupeng	d96da1f00c	[DOC] Fix word spelling (#1595 ) ### What this PR does / why we need it? Fix word spelling in DOC. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Signed-off-by: paulyu12 <507435917@qq.com>	2025-07-02 21:42:39 +08:00
yupeng	c3c8c9317c	[DOC] add LoRA user guide (#1265 ) ### What this PR does / why we need it? Add LoRA user guide to DOC. The content refers to [LoRA Adapters](https://docs.vllm.ai/en/latest/features/lora.html). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-07-02 14:41:31 +08:00
Shanshan Shen	ba577dfc52	[Doc] Add Structured Output guide (#1499 ) ### What this PR does / why we need it? Add Structured Output guide. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-30 17:21:44 +08:00
Zhu Yi Lin	b308a7a258	support pangumoe w8a8c8 and docs (#1477 ) ### What this PR does / why we need it? support pangu moe w8a8c8 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. Signed-off-by: zhuyilin <809721801@qq.com>	2025-06-28 18:51:07 +08:00
wangxiyuan	205cb85a1e	[Doc] Fix doc typo (#1424 ) 1. Fix the typo 2. Fix 404 url 3. update graph mode and additional config user guide Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-25 19:28:26 +08:00
Li Wang	15df8be937	[Doc] Add sleep mode doc (#1295 ) ### What this PR does / why we need it? Add sleep related doc and example --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-25 14:07:14 +08:00
Yikun Jiang	2e5f312530	Cleanup ununsed doc (#1352 ) ### What this PR does / why we need it? Cleanup ununsed doc for MoGE model, we will add back this when MoGE model ready. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-22 15:05:30 +08:00
Yikun Jiang	c30ddb8331	Bump v0.9.1rc1 release (#1349 ) ### What this PR does / why we need it? Bump v0.9.1rc1 release Closes: https://github.com/vllm-project/vllm-ascend/pull/1341 Closes: https://github.com/vllm-project/vllm-ascend/pull/1334 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com>	2025-06-22 13:15:36 +08:00
wangxiyuan	45be1aac0c	[CI] Add codespell check for doc (#1314 ) Add codespell check test for doc only PR Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-20 16:48:14 +08:00
22dimensions	761bd3d9d7	Add user guide for quantization (#1206 ) ### What this PR does / why we need it? Add user guide for quantization ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-06-20 15:53:25 +08:00
22dimensions	0d2074a1ec	[Doc] fix VLLM_USE_V1 value in graph mode docs (#1226 ) os.environ["VLLM_USE_V1"] must be assigned with str, not other type. ![image](https://github.com/user-attachments/assets/9d337ae5-00e5-4179-832e-c6c917dd5798) Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-06-15 15:41:11 +08:00
fems14	ab5d110fcc	vllm-ascend support chunked prefill (#1172 ) ### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-06-14 22:31:16 +08:00
Mengqing Cao	a3b5af8307	[CI/UT][Graph] Add ut for torchair graph mode (#1103 ) ### What this PR does / why we need it? Add ut for torchair graph mode on DeepSeekV3 ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Mengqing Cao <cmq0113@163.com>	2025-06-14 16:59:00 +08:00
sdmyzlp	e72f94e38f	Support multistream of MLA vector operations (#1135 ) ### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` \| q_rmsnorm \| \| kv_norm_rope_cache \| \| q_rope \| \| matmul W_DQ \| matmul W_DKV \| index \| index \| matmul W_UQ \| split \| matmul W_KV_T \| ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-12 21:42:09 +08:00
chenwaner	e46dc142bf	Enable kvcache_nz for the decode process in torchair graph mode (#1098 ) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2948542159 https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2954496588 --------- Signed-off-by: chenwaner <861645847@qq.com>	2025-06-11 14:09:28 +08:00
yz	4153a5091b	[Doc] Fix the config parameter name "enable" in graph_mode.md. (#1159 ) Fix the doc typo in graph_mode.md Signed-off-by: yzim <43207690+yzim@users.noreply.github.com>	2025-06-11 11:03:37 +08:00
sdmyzlp	7bdc606677	Support multistream of shared experts in FusedMoE (#997 ) Contains on #1111 for completeness. <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` \| shared gate_up \| shared act \| \| shared down \| \| dispatch \| routed gate_up, act, down \| combine \| ``` <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? No. <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-11 09:18:38 +08:00
wangxiyuan	b75cb788dd	[Bugfix] add compilation/__init__.py to fix import error (#1152 ) 1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a python module 2. Fix model runner bug to keep the same with vllm 3. Add release note for 0.9.0rc2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-10 17:14:25 +08:00
wangxiyuan	571f88f85e	[Doc] Update 0.9.0rc1 release date (#1139 ) 1. Update 0.9.0rc1 release date 2. Update feature and model support list 3. Add DP known issue to release note Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-09 22:51:02 +08:00

1 2

80 Commits