xc-llm-ascend

Author	SHA1	Message	Date
TaoYu Chen	5fe883fa43	fix the title of modelrunner's prepare inputs docs (#3457 ) ### What this PR does / why we need it? Fix the wrong title of the modelrunner_prepare_inputs docs ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>	2025-10-14 20:35:58 +08:00
yuzhup	78777237a9	[2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `gate_up_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": True, "prefetch_ratio": { "moe": { "gate_up": 0.8 }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com>	2025-10-14 20:16:33 +08:00
wangxiaoteng888	19b85ef1bc	[Bugfix] multi_node_pd_disaggregation_mooncake.md update (#3400 ) ### What this PR does / why we need it? multi_node_pd_disaggregation_mooncake.md update. Fix issues encountered during service startup. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiaoteng@huawei.com <wangxiaoteng@huawei.com>	2025-10-14 09:29:35 +08:00
wangxiyuan	49b850270f	[Community] Nominate new maintainers: @yiz-liu @paulyu12 @weijinqian0 @nalinaly (#3406 ) I'd like to nominate 4 new maintainers for vllm-ascend: ---- Yizhou Liu [@yiz-liu](https://github.com/yiz-liu) ---- Review Quality‌: He has completed [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Ayiz-liu) and provided solutions or guides for [10+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3Ayiz-liu), which includes many quality review like [#issue-3428408401](https://github.com/vllm-project/vllm-ascend/issues/3002#issue-3428408401), [#discussion_r2224572309](https://github.com/vllm-project/vllm-ascend/pull/1803#discussion_r2224572309), [#issuecomment-2982470226](https://github.com/vllm-project/vllm-ascend/pull/1261#issuecomment-2982470226), [#issuecomment-2903621197](https://github.com/vllm-project/vllm-ascend/pull/836#issuecomment-2903621197), [#issuecomment-2857678691](https://github.com/vllm-project/vllm-ascend/issues/778#issuecomment-2857678691). Sustained and High-Quality Contributions: He has contributed more than [30+ commits](https://github.com/vllm-project/vllm-ascend/commits?author=yiz-liu) since Mar.2025, especially, aclgraph, DP, and EP related contributions are the main reason why I nominated him. As the owner of aclgraph support, he continuously improves aclgraph stability and performance as well as fixes key bugs. he laid the groundwork for EP-related functionality and delivered multiple foundational improvements Community involvement: He has a very good habit of logging issues：https://github.com/vllm-project/vllm-ascend/issues/1649 and is also very active and involved in [many issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Ayiz-liu%20-author%3Ayiz-liu) to help users resolve issues. ---- Peng Yu [@paulyu12](https://github.com/paulyu12) --- The main reasons for his nomination are his expertise and key contributions to the LORA and sustained and major contributions (initial support/doc/bugfix) around Lora. Sustained and Major Contributions: @paulyu12 starts his contribution with [Lora and Mulit-Lora support](`697908f5cd`) since Apr 2025, he contributed about [10+ commits and bugfixes](`697908f5cd`) on vllm-ascend. Review Quality‌ and Community Involvement‌: He also helped more than 10+ users address [Lora related issues](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Apaulyu12+-author%3Apaulyu12+is%3Aclosed). I believe his addition will further improve vLLM Ascend Lora support. ---- Jinqian Wei [@weijinqian0](https://github.com/weijinqian0) --- The main reasons for his nomination are his key contributions to the RL scene and the high quality of his code reviews. Review Quality‌: He has completed [60+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Aweijinqian0+is%3Aopen+-author%3Aweijinqian0) since June. 2025, include [#comment-3284055430](https://github.com/vllm-project/vllm-ascend/pull/2791#issuecomment-3284055430), [discussion_r2332166704](https://github.com/vllm-project/vllm-ascend/pull/2817#discussion_r2332166704), [discussion_r2343289692](https://github.com/vllm-project/vllm-ascend/pull/2846#discussion_r2343289692) high quality review. Sustained and Quality Contributions: He has Deep understanding of ‌vLLM‌ and ‌vLLM Ascend‌ codebases and solid contributions in RL scene (about [10+ PR merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Aweijinqian0+is%3Amerged+) and 10+ PRs merged as co-author. - Code Refactor: As a co-author, he participated in the refactoring of the MOE module https://github.com/vllm-project/vllm-ascend/pull/2150 https://github.com/vllm-project/vllm-ascend/pull/2706 https://github.com/vllm-project/vllm-ascend/pull/2867 - Performance Enhancement for RL: Participated as a co-author in the design and development of the solution, contributing to the planning of core capabilities. https://github.com/vllm-project/vllm-ascend/pull/1547 https://github.com/vllm-project/vllm-ascend/pull/2120 and so on. So I think he's a great addition to the vLLM Ascend Maintainer team. ---- Chuanyu Qin [@nalinaly](https://github.com/nalinaly) --- The main reason I nominated Qinchuanyu is because he is the initial designer of aclgraph and torch-npu, two key components of vllm-ascend. Considering aclgraph will eventually become the main path for vllm-ascend's graph model, I propose to nominate him. Sustained and Major Contributions: In fact, chuanyu actively helped the users/developers of vllm-ascend since Mar 2025 ([vllm-discuss#162](https://discuss.vllm.ai/t/can-ascend-officially-draft-a-documentation-on-the-vllm-ascend-adaptation-for-graph-mode/162/5)), and also helped early users of vllm-ascend understand aclgraph. He provided lots of help in the process of integrating aclgraph with vllm-ascend. Community Involvement‌: As speaker, he also presents help users understand aclgraph and torch_npu [《The design philosophy of torch_npu and the high performance principle of aclGraph》](https://github.com/PyTorch-China/pytorch-meetup/blob/main/beijing-2025/%E3%80%905%E3%80%91torch_npu%20%E7%9A%84%E8%AE%BE%E8%AE%A1%E5%93%B2%E5%AD%A6%E4%B8%8E%20aclGraph%20%E9%AB%98%E6%80%A7%E8%83%BD%E5%8E%9F%E7%90%86-%E7%A7%A6%E4%BC%A0%E7%91%9C-0920.pdf) ---- They have activate contribution to vllm-ascend or have rich experience for ascend AI. Welcome! - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-14 08:51:58 +08:00
wangxiaoteng888	ca05f7d632	[Bugfix] TP size larger than KV cache head causes accuracy issues (#3366 ) ### What this PR does / why we need it? Resolve the issue where, in the case of unequal TP (Tensor Parallelism), the TP size is larger than the number of model attention kvcache heads, causing the KV cache to generate duplicates, which leads to transmission errors in the original code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-10-11 11:22:23 +08:00
wangxiyuan	ba19dd3183	Revert PTA upgrade PR (#3352 ) we notice that torch npu 0919 doesn't work. This PR revert related change which rely on 0919 version. Revert PR: #3295 #3205 #3102 Related: #3353 - vLLM version: v0.11.0	2025-10-10 14:09:53 +08:00
Li Wang	60b7c936c5	[Doc] Update deepseek-v3.2 doc (#3319 ) ### What this PR does / why we need it? Upgrade deepseek-v3.2 doc for A2 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-10 08:55:39 +08:00
Ruri	ff37575936	[1/N][Feat] Add weight prefetch feature for Attention layers (#3146 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `qkv_proj.weight` and `o_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": false, "prefetch_ratio": { "attn": { "qkv": 1.0, "o": 1.0, }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Co-authored-by: yuzhup <15705211260@163.com>	2025-10-09 20:38:39 +08:00
Yikun Jiang	2dde1268c7	Fix doc for A2 series and cleanup note (#3307 ) ### What this PR does / why we need it? Fix doc for A2 series and cleanup note ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-10-01 14:39:48 +08:00
wangxiyuan	b8c58d68e1	[Doc] Add deepseek v3.2 tutorial (#3275 ) Add deepseek v3.2 tutorial - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-30 17:54:31 +08:00
wangxiyuan	4abdcdba4e	upgrade pta to 0919 (#3295 ) ### What this PR does / why we need it? Upgrade torch-npu to the newest POC version ### Does this PR introduce _any_ user-facing change? yes, user need upgrade the pta version as well. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-30 17:14:23 +08:00
wangxiyuan	00ba071022	[Doc] Release note for v0.11.0rc0 (#3224 ) ### What this PR does / why we need it? Add release note for v0.11.0rc0 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-30 03:26:18 +08:00
Yikun Jiang	5503a3142f	Bump version to v0.11.0rc3 (#3213 ) ### What this PR does / why we need it? Bump version to v0.11.0rc2 and prepare vLLM Ascend v0.11.0rc0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-29 21:48:06 +08:00
weiguihua2	065486820b	[Doc] add faqs:install vllm-ascend will overwrite existing torch-npu (#3245 ) ### What this PR does / why we need it? add faqs:install vllm-ascend will overwrite existing torch-npu ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-09-29 12:02:23 +08:00
Peipei	cf445c41f9	[Doc]Add qwen3_vl series guide (#3227 ) ### What this PR does / why we need it? This PR provides user guide documents for Qwen3-VL 4B and Qwen3-VL-235B-A22B. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: booker123456 <945658361@qq.com>	2025-09-28 21:35:52 +08:00
lilinsiman	1705501ae2	[BugFix] Fix ACLgraph bug in Qwen3_32b_int8 case (#3204 ) ### What this PR does / why we need it? 1. Solved the issue where sizes capture failed for the Qwen3-32b-int8 model when aclgraph, dp1, and tp4 were enabled. 2. Added the exception thrown when sizes capture fails and provided a solution 3. Add this common problem to the FAQ doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-09-28 17:44:04 +08:00
Wang Kunpeng	859e861d92	[main][quantization] Support deepseek w4a8 per-channel quantization (#3011 ) ### What this PR does / why we need it? 1.Support deepseek w4a8 per-channel quantization 2.The eager mode supports converting weights to the NZ format ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? #### How to get weights using Modelslim ##### Installation steps git clone https://gitcode.com/Ascend/msit.git cd msit/msmodelslim bash install.sh ##### Generate w4a8 per-channel weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-09-27 21:01:16 +08:00
offline893	5d13bbe796	[BugFix]Modify eplb feature guide. (#3183 ) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` Co-authored-by: offline0806 <3337230449@qq.com>	2025-09-25 17:01:51 +08:00
Csrayz	80524f5711	[CORE] concurrent partial prefills (#2372 ) # What this PR does / why we need it? When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to https://github.com/vllm-project/vllm/pull/10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled. This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card. Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests. Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios. Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). - vLLM version: v0.10.2 - vLLM main: `1dfea5f4a9` --------- Signed-off-by: Csrayz <jover@cmbchina.com>	2025-09-24 17:12:55 +08:00
Jianwei Mao	d586255678	fix wrong --num-gpus parameter requirements, and avoid ambiguity (#3116 ) fix the problem of https://github.com/vllm-project/vllm-ascend/issues/3114 - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: Jianwei Mao <maojianwei2012@126.com>	2025-09-23 11:58:44 +08:00
Li Wang	02f89d166f	[CI] Update vllm version to 20250922(5aeb925) (#3091 ) ### What this PR does / why we need it? This pr bump vllm commit hash to `5aeb925452` fix issues: 1. https://github.com/vllm-project/vllm/pull/25345 has remove v0 metadata 2. https://github.com/vllm-project/vllm/pull/25332 3. https://github.com/vllm-project/vllm/pull/25334 4. https://github.com/vllm-project/vllm/pull/23558, note that this vllm commit update the model register logic, which will check all the model registered have the `vllm.model_executor.models` path , which breaks our custom registration of the deepseek_v3 model (it doesn't exist in the vllm model path). so I move deepseek_v3 model registy to deepseek_v2 to solve temporary ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-22 22:18:13 +08:00
whx	0a526768f5	[Feature] Support moe multi-stream for aclgraph. (#2946 ) This PR puts the calculation of shared experts into a separate stream, overlaping with routing experts. - vLLM version: v0.10.2 - vLLM main: `fbd6523ac0` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-19 11:06:45 +08:00
LeeWenquan	f4e3d22432	Remove chunked_prefill_for_mla and fix ring_mla bug (#2781 ) ### What this PR does / why we need it? Remove chunked prefill for mla branch in mla , and change dtype of prefill_mask to avoid accuracy problem ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `ef7eefe17a` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-09-18 19:43:26 +08:00
Li Wang	4267f5d55f	[Doc] Add multi-node ray backend tutorial (#2376 ) ### What this PR does / why we need it? Add multi-node ray backend tutorial for Qwen235B-A3B ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f4cd80f944` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-18 15:30:18 +08:00
1Fire4	1f6465c399	Add an option of enable frozen parameter (#2869 ) ### What this PR does / why we need it? Add an option of enable frozen parameter ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `68dbde5dbb` Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>	2025-09-17 12:00:44 +08:00
offline893	76844eec78	Dynamic Expert Load Balance with Zero-like-overhead (#2956 ) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `567939953b` --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>	2025-09-17 10:36:43 +08:00
Yikun Jiang	0aba644633	Update max_tokens and prompt in qwen3 online doc (#2945 ) ### What this PR does / why we need it? Update max_tokens and prompt in qwen3 online doc Before: ``` "'max_tokens' or 'max_completion_tokens' is too large: 4096. This model's maximum context length is 4096 tokens and your request has 18 input tokens (4096 > 4096 - 18). None" ``` After: ``` curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct", "messages": [ {"role": "user", "content": "Who are you?"} ], "temperature": 0.6, "top_p": 0.95, "top_k": 20, "max_tokens": 32 }' .{"id":"chatcmpl-8ddbd65c9ddc405397219a6792feb9a0","object":"chat.completion","created":1757985049,"model":"/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to assist you in generating various","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":12,"total_tokens":44,"completion_tokens":32,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manually test on my local env - CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-16 09:27:50 +08:00
wangxiyuan	048bfd5553	[Release] Add release note for v0.10.2rc1 (#2921 ) Add release note for v0.10.2rc1 - vLLM version: v0.10.2 - vLLM main: `b834b4cbf1` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-16 01:20:05 +08:00
Yikun Jiang	b5ccef6115	[Doc] Add doc for Qwen3 Next (#2916 ) ### What this PR does / why we need it? Add doc for Qwen3 Next ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc CI passed Related: https://github.com/vllm-project/vllm-ascend/issues/2884 - vLLM version: v0.10.2 - vLLM main: `01413e0cf5` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-16 01:16:06 +08:00
Yikun Jiang	0747a6e68c	Bump vLLM version to v0.10.2 (#2914 ) ### What this PR does / why we need it? Bump vLLM version to v0.10.2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2rc3 - vLLM main: `15b8fef453` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-14 06:57:59 +08:00
Yikun Jiang	f97a64ba7f	Bump vLLM version to v0.10.2rc3 (#2911 ) ### What this PR does / why we need it? Bump vLLM version to v0.10.2rc3 https://github.com/vllm-project/vllm/compare/v0.10.2rc2...v0.10.2rc3 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2rc2 - vLLM main: `15b8fef453` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 19:15:48 +08:00
Yikun Jiang	8ece6956e7	Revert "Upgrade CANN version to 8.3.rc1.alpha001 (#2903 )" (#2909 ) ### What this PR does / why we need it? This reverts commit `339fceb89c`. ### Does this PR introduce _any_ user-facing change? Yes, use 8.2rc1 image by default ### How was this patch tested? CI passed - vLLM version: v0.10.2rc2 - vLLM main: `cfa3234a5b` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 16:21:54 +08:00
Yikun Jiang	339fceb89c	Upgrade CANN version to 8.3.rc1.alpha001 (#2903 ) ### What this PR does / why we need it? Upgrade CANN version to 8.3.rc1.alpha001 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.2rc2 - vLLM main: `89e08d6d18` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 12:10:21 +08:00
Yikun Jiang	138e932630	Bump vLLM version to v0.10.2rc2 (#2902 ) ### What this PR does / why we need it? Upgrade vLLM version to 0.10.2rc2 ### Does this PR introduce _any_ user-facing change? Yes, image will use 0.10.2rc2 vLLM ### How was this patch tested? - vLLM version: main - vLLM main: `f17c075884` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 11:39:48 +08:00
CaranLic	168ad600b5	[main] add pd transfer for ascend scheduler (#2753 ) ### What this PR does / why we need it? For offline scenarios, adjust the scheduling process to prioritize the prefill phase of all requests, then process the decode phase of all requests. ### How was this patch tested? ``` max_num_seqs=24, additional_config={ "ascend_scheduler_config":{ "enabled": True, "enable_pd_transfer": True, "decode_max_num_seqs": 24, "enable_chunked_prefill": False } }, ``` \| input \| output \| num prompts \| max_num_seqs \| dp \| tp \| scheduler \| tps \| \| ------ \| ------ \| ---------- \| ---------------- \| ---- \| ---- \| ---------------- \| --------------- \| \| dapo-math-17K \| 2K \| 384 \| 24 \| 2 \| 1 \| v1 \| 234.06 \| \| dapo-math-17K \| 2K \| 384 \| 24 \| 2 \| 1 \| pd transfer \| 239.59(+2.4%) \| \| dapo-math-17K\| 2K \| 384 \| 24 \| 4 \| 1 \| v1 \| 222.85 \| \| dapo-math-17K\| 2K \| 384 \| 24 \| 4 \| 1 \| pd transfer \| 225.81(+1.3%) \| - vLLM version: v0.10.1.1 - vLLM main: `6fb2788163` --------- Signed-off-by: CaranLic <740821011@qq.com>	2025-09-10 08:46:39 +08:00
Mengqing Cao	edf1f600ad	[CI] Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 (#2840 ) ### What this PR does / why we need it? Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 ### Does this PR introduce _any_ user-facing change? branch main of vllm-ascend will not be compatible with vllm v0.10.1 and v0.10.1.1 ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `6fb2788163` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-10 08:43:10 +08:00
yupeng	a746f8274f	[DOC] Qwen3 PD disaggregation user guide (#2751 ) ### What this PR does / why we need it? The PR is for the document of the prefiller&decoder disaggregation deloyment guide. The scenario of the guide is: - Use 3 nodes totally and 2 NPUs on each node - Qwen3-30B-A3B - 1P2D - Expert Parallel The deployment can be used to verify PD Disggregation / Expert Parallel features with a slightly less resources. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-09-07 10:35:37 +08:00
Yikun Jiang	752e272a55	Add note for Ascend HDK version (#2765 ) ### What this PR does / why we need it? Add note for Ascend HDK version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-07 10:33:41 +08:00
lidenghui1110	5a7181569c	[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167 ) ### What this PR does / why we need it? This PR introduces Oproj matrix tensor model parallel to achieve decreasing of memory consumption. It only support graph mode in pure DP scenario. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8 GB NPU memory per RANK. We got best performance when oproj_tensor_parallel_size=4 without TPOT increasing. performance data: <img width="1442" height="442" alt="image" src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| oproj_tensor_parallel_size \| Split the o_proj matrix along the row dimension (head num * head dim) into oproj_tensor_parallel_size pieces. \| No \| int \| default value is None, once this value is set, the feature will be enabled, head num * head dim must be divisible by this value. \| example `--additional_config={"oproj_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `eddaafc1c7` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zzh <zzh_201018@outlook.com>	2025-09-07 10:31:32 +08:00
vllm-ascend-ci	3a2a7d88db	[Doc] Update accuracy reports for v0.10.1rc1 (#2755 ) The accuracy results running on NPU Altlas A2 have changed, updating reports for: All models (Qwen3-30B-A3B, Qwen2.5-VL-7B-Instruct, Qwen3-8B-Base, DeepSeek-V2-Lite) - [Workflow run][1] [1]: https://github.com/vllm-project/vllm-ascend/actions/runs/17459225764 - vLLM version: v0.10.1.1 - vLLM main: `2b30afa442` Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com> Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>	2025-09-04 22:17:17 +08:00
Mengqing Cao	7e16b4a7cd	[ReleaseNote] Add Release Note for v0.10.1rc1 (#2635 ) Add Release Note for v0.10.1rc1 - vLLM version: v0.10.1.1 - vLLM main: `b5ee1e3261` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-04 11:26:47 +08:00
wangxiyuan	41b028aa5f	[Doc] add v0.9.1 release note (#2646 ) Add release note for 0.9.1 - vLLM version: v0.10.1.1 - vLLM main: `8bd5844989` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-03 18:04:27 +08:00
panchao-hub	ea53f9076e	support torchair mode (#2641 ) ### What this PR does / why we need it? support torchair mode ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `5438967fbc` Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>	2025-09-01 15:49:07 +08:00
lidenghui1110	600b08f754	[Feat]: Add custom lmhead tensor model parallel (#2309 ) ### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| lmhead_tensor_parallel_size \| Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces \| No \| int \| default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. \| example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `de533ab2a1` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>	2025-08-29 11:41:21 +08:00
LeeWenquan	c8d1df3a3f	[Refactor][WIP] Refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl (#2465 ) ### What this PR does / why we need it? In order to support fused kernels, multi-stream, communication optimization etc, it's better to aggregate all opreations in Attention layer togather. This PR tries to refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl. Note that new mla_v1 doesn't take torchair into consideration. So this PR can only be merged after torchair related mla_v1 is isolated into a new file. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? ### Features Test <img width="506" height="141" alt="image" src="https://github.com/user-attachments/assets/f1ab2906-a1ac-4450-8433-94811cd89466" /> ### Performance After Refact <img width="648" height="486" alt="image" src="https://github.com/user-attachments/assets/e33e038c-c5d9-4ba7-a8e9-1ac22f9833eb" /> ### Performance Before Refact <img width="618" height="494" alt="image" src="https://github.com/user-attachments/assets/83861dc2-dc51-4af3-9310-90ab10c43bb1" /> - vLLM version: v0.10.1.1 - vLLM main: `e03940762b` --------- Signed-off-by: lwq <liwenquan5@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: lwq <liwenquan5@huawei.com> Co-authored-by: whx-sjtu <2952154980@qq.com>	2025-08-28 10:35:57 +08:00
Li Wang	516e14ae6a	[Doc] Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8 (#2553 ) ### What this PR does / why we need it? Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `de02b07db4` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-27 14:16:44 +08:00
Li Wang	042605f4b2	[Doc] Add stable modelslim branch (#2545 ) ### What this PR does / why we need it? The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial delivery version of modelslim in Q3, and has been verified available ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `7d67a9d9f9` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-27 09:05:46 +08:00
Shanshan Shen	334c44613a	[Doc] Update release version info (#2518 ) ### What this PR does / why we need it? Update release version info. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `712d0f88d8` Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-08-25 15:39:10 +08:00
Shanshan Shen	98c68220c1	[Doc] Update `v0.9.1rc3` doc (#2512 ) ### What this PR does / why we need it? Update `v0.9.1rc3` doc, which are supplements to https://github.com/vllm-project/vllm-ascend/pull/2488. - vLLM version: v0.10.0 - vLLM main: `170e8ea9ea` Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-08-25 11:39:29 +08:00
Mengqing Cao	4c4ffeebe5	[Doc] update vllm version in ci (#2513 ) ### What this PR does / why we need it? update vllm version in ci - vLLM version: v0.10.0 - vLLM main: `170e8ea9ea` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-25 11:35:37 +08:00

1 2 3 4 5

238 Commits