xc-llm-ascend

Author	SHA1	Message	Date
Csrayz	80524f5711	[CORE] concurrent partial prefills (#2372 ) # What this PR does / why we need it? When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to https://github.com/vllm-project/vllm/pull/10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled. This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card. Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests. Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios. Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). - vLLM version: v0.10.2 - vLLM main: `1dfea5f4a9` --------- Signed-off-by: Csrayz <jover@cmbchina.com>	2025-09-24 17:12:55 +08:00
Jianwei Mao	d586255678	fix wrong --num-gpus parameter requirements, and avoid ambiguity (#3116 ) fix the problem of https://github.com/vllm-project/vllm-ascend/issues/3114 - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: Jianwei Mao <maojianwei2012@126.com>	2025-09-23 11:58:44 +08:00
Li Wang	02f89d166f	[CI] Update vllm version to 20250922(5aeb925) (#3091 ) ### What this PR does / why we need it? This pr bump vllm commit hash to `5aeb925452` fix issues: 1. https://github.com/vllm-project/vllm/pull/25345 has remove v0 metadata 2. https://github.com/vllm-project/vllm/pull/25332 3. https://github.com/vllm-project/vllm/pull/25334 4. https://github.com/vllm-project/vllm/pull/23558, note that this vllm commit update the model register logic, which will check all the model registered have the `vllm.model_executor.models` path , which breaks our custom registration of the deepseek_v3 model (it doesn't exist in the vllm model path). so I move deepseek_v3 model registy to deepseek_v2 to solve temporary ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-22 22:18:13 +08:00
whx	0a526768f5	[Feature] Support moe multi-stream for aclgraph. (#2946 ) This PR puts the calculation of shared experts into a separate stream, overlaping with routing experts. - vLLM version: v0.10.2 - vLLM main: `fbd6523ac0` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-19 11:06:45 +08:00
LeeWenquan	f4e3d22432	Remove chunked_prefill_for_mla and fix ring_mla bug (#2781 ) ### What this PR does / why we need it? Remove chunked prefill for mla branch in mla , and change dtype of prefill_mask to avoid accuracy problem ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `ef7eefe17a` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-09-18 19:43:26 +08:00
Li Wang	4267f5d55f	[Doc] Add multi-node ray backend tutorial (#2376 ) ### What this PR does / why we need it? Add multi-node ray backend tutorial for Qwen235B-A3B ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f4cd80f944` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-18 15:30:18 +08:00
1Fire4	1f6465c399	Add an option of enable frozen parameter (#2869 ) ### What this PR does / why we need it? Add an option of enable frozen parameter ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `68dbde5dbb` Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>	2025-09-17 12:00:44 +08:00
offline893	76844eec78	Dynamic Expert Load Balance with Zero-like-overhead (#2956 ) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `567939953b` --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>	2025-09-17 10:36:43 +08:00
Yikun Jiang	0aba644633	Update max_tokens and prompt in qwen3 online doc (#2945 ) ### What this PR does / why we need it? Update max_tokens and prompt in qwen3 online doc Before: ``` "'max_tokens' or 'max_completion_tokens' is too large: 4096. This model's maximum context length is 4096 tokens and your request has 18 input tokens (4096 > 4096 - 18). None" ``` After: ``` curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct", "messages": [ {"role": "user", "content": "Who are you?"} ], "temperature": 0.6, "top_p": 0.95, "top_k": 20, "max_tokens": 32 }' .{"id":"chatcmpl-8ddbd65c9ddc405397219a6792feb9a0","object":"chat.completion","created":1757985049,"model":"/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to assist you in generating various","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":12,"total_tokens":44,"completion_tokens":32,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manually test on my local env - CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-16 09:27:50 +08:00
wangxiyuan	048bfd5553	[Release] Add release note for v0.10.2rc1 (#2921 ) Add release note for v0.10.2rc1 - vLLM version: v0.10.2 - vLLM main: `b834b4cbf1` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-16 01:20:05 +08:00
Yikun Jiang	b5ccef6115	[Doc] Add doc for Qwen3 Next (#2916 ) ### What this PR does / why we need it? Add doc for Qwen3 Next ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc CI passed Related: https://github.com/vllm-project/vllm-ascend/issues/2884 - vLLM version: v0.10.2 - vLLM main: `01413e0cf5` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-16 01:16:06 +08:00
Yikun Jiang	0747a6e68c	Bump vLLM version to v0.10.2 (#2914 ) ### What this PR does / why we need it? Bump vLLM version to v0.10.2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2rc3 - vLLM main: `15b8fef453` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-14 06:57:59 +08:00
Yikun Jiang	f97a64ba7f	Bump vLLM version to v0.10.2rc3 (#2911 ) ### What this PR does / why we need it? Bump vLLM version to v0.10.2rc3 https://github.com/vllm-project/vllm/compare/v0.10.2rc2...v0.10.2rc3 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2rc2 - vLLM main: `15b8fef453` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 19:15:48 +08:00
Yikun Jiang	8ece6956e7	Revert "Upgrade CANN version to 8.3.rc1.alpha001 (#2903 )" (#2909 ) ### What this PR does / why we need it? This reverts commit `339fceb89c`. ### Does this PR introduce _any_ user-facing change? Yes, use 8.2rc1 image by default ### How was this patch tested? CI passed - vLLM version: v0.10.2rc2 - vLLM main: `cfa3234a5b` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 16:21:54 +08:00
Yikun Jiang	339fceb89c	Upgrade CANN version to 8.3.rc1.alpha001 (#2903 ) ### What this PR does / why we need it? Upgrade CANN version to 8.3.rc1.alpha001 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.2rc2 - vLLM main: `89e08d6d18` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 12:10:21 +08:00
Yikun Jiang	138e932630	Bump vLLM version to v0.10.2rc2 (#2902 ) ### What this PR does / why we need it? Upgrade vLLM version to 0.10.2rc2 ### Does this PR introduce _any_ user-facing change? Yes, image will use 0.10.2rc2 vLLM ### How was this patch tested? - vLLM version: main - vLLM main: `f17c075884` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-13 11:39:48 +08:00
CaranLic	168ad600b5	[main] add pd transfer for ascend scheduler (#2753 ) ### What this PR does / why we need it? For offline scenarios, adjust the scheduling process to prioritize the prefill phase of all requests, then process the decode phase of all requests. ### How was this patch tested? ``` max_num_seqs=24, additional_config={ "ascend_scheduler_config":{ "enabled": True, "enable_pd_transfer": True, "decode_max_num_seqs": 24, "enable_chunked_prefill": False } }, ``` \| input \| output \| num prompts \| max_num_seqs \| dp \| tp \| scheduler \| tps \| \| ------ \| ------ \| ---------- \| ---------------- \| ---- \| ---- \| ---------------- \| --------------- \| \| dapo-math-17K \| 2K \| 384 \| 24 \| 2 \| 1 \| v1 \| 234.06 \| \| dapo-math-17K \| 2K \| 384 \| 24 \| 2 \| 1 \| pd transfer \| 239.59(+2.4%) \| \| dapo-math-17K\| 2K \| 384 \| 24 \| 4 \| 1 \| v1 \| 222.85 \| \| dapo-math-17K\| 2K \| 384 \| 24 \| 4 \| 1 \| pd transfer \| 225.81(+1.3%) \| - vLLM version: v0.10.1.1 - vLLM main: `6fb2788163` --------- Signed-off-by: CaranLic <740821011@qq.com>	2025-09-10 08:46:39 +08:00
Mengqing Cao	edf1f600ad	[CI] Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 (#2840 ) ### What this PR does / why we need it? Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 ### Does this PR introduce _any_ user-facing change? branch main of vllm-ascend will not be compatible with vllm v0.10.1 and v0.10.1.1 ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `6fb2788163` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-10 08:43:10 +08:00
yupeng	a746f8274f	[DOC] Qwen3 PD disaggregation user guide (#2751 ) ### What this PR does / why we need it? The PR is for the document of the prefiller&decoder disaggregation deloyment guide. The scenario of the guide is: - Use 3 nodes totally and 2 NPUs on each node - Qwen3-30B-A3B - 1P2D - Expert Parallel The deployment can be used to verify PD Disggregation / Expert Parallel features with a slightly less resources. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-09-07 10:35:37 +08:00
Yikun Jiang	752e272a55	Add note for Ascend HDK version (#2765 ) ### What this PR does / why we need it? Add note for Ascend HDK version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-07 10:33:41 +08:00
lidenghui1110	5a7181569c	[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167 ) ### What this PR does / why we need it? This PR introduces Oproj matrix tensor model parallel to achieve decreasing of memory consumption. It only support graph mode in pure DP scenario. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8 GB NPU memory per RANK. We got best performance when oproj_tensor_parallel_size=4 without TPOT increasing. performance data: <img width="1442" height="442" alt="image" src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| oproj_tensor_parallel_size \| Split the o_proj matrix along the row dimension (head num * head dim) into oproj_tensor_parallel_size pieces. \| No \| int \| default value is None, once this value is set, the feature will be enabled, head num * head dim must be divisible by this value. \| example `--additional_config={"oproj_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `eddaafc1c7` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zzh <zzh_201018@outlook.com>	2025-09-07 10:31:32 +08:00
vllm-ascend-ci	3a2a7d88db	[Doc] Update accuracy reports for v0.10.1rc1 (#2755 ) The accuracy results running on NPU Altlas A2 have changed, updating reports for: All models (Qwen3-30B-A3B, Qwen2.5-VL-7B-Instruct, Qwen3-8B-Base, DeepSeek-V2-Lite) - [Workflow run][1] [1]: https://github.com/vllm-project/vllm-ascend/actions/runs/17459225764 - vLLM version: v0.10.1.1 - vLLM main: `2b30afa442` Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com> Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>	2025-09-04 22:17:17 +08:00
Mengqing Cao	7e16b4a7cd	[ReleaseNote] Add Release Note for v0.10.1rc1 (#2635 ) Add Release Note for v0.10.1rc1 - vLLM version: v0.10.1.1 - vLLM main: `b5ee1e3261` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-04 11:26:47 +08:00
wangxiyuan	41b028aa5f	[Doc] add v0.9.1 release note (#2646 ) Add release note for 0.9.1 - vLLM version: v0.10.1.1 - vLLM main: `8bd5844989` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-03 18:04:27 +08:00
panchao-hub	ea53f9076e	support torchair mode (#2641 ) ### What this PR does / why we need it? support torchair mode ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `5438967fbc` Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>	2025-09-01 15:49:07 +08:00
lidenghui1110	600b08f754	[Feat]: Add custom lmhead tensor model parallel (#2309 ) ### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| lmhead_tensor_parallel_size \| Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces \| No \| int \| default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. \| example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `de533ab2a1` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>	2025-08-29 11:41:21 +08:00
LeeWenquan	c8d1df3a3f	[Refactor][WIP] Refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl (#2465 ) ### What this PR does / why we need it? In order to support fused kernels, multi-stream, communication optimization etc, it's better to aggregate all opreations in Attention layer togather. This PR tries to refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl. Note that new mla_v1 doesn't take torchair into consideration. So this PR can only be merged after torchair related mla_v1 is isolated into a new file. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? ### Features Test <img width="506" height="141" alt="image" src="https://github.com/user-attachments/assets/f1ab2906-a1ac-4450-8433-94811cd89466" /> ### Performance After Refact <img width="648" height="486" alt="image" src="https://github.com/user-attachments/assets/e33e038c-c5d9-4ba7-a8e9-1ac22f9833eb" /> ### Performance Before Refact <img width="618" height="494" alt="image" src="https://github.com/user-attachments/assets/83861dc2-dc51-4af3-9310-90ab10c43bb1" /> - vLLM version: v0.10.1.1 - vLLM main: `e03940762b` --------- Signed-off-by: lwq <liwenquan5@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: lwq <liwenquan5@huawei.com> Co-authored-by: whx-sjtu <2952154980@qq.com>	2025-08-28 10:35:57 +08:00
Li Wang	516e14ae6a	[Doc] Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8 (#2553 ) ### What this PR does / why we need it? Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `de02b07db4` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-27 14:16:44 +08:00
Li Wang	042605f4b2	[Doc] Add stable modelslim branch (#2545 ) ### What this PR does / why we need it? The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial delivery version of modelslim in Q3, and has been verified available ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `7d67a9d9f9` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-27 09:05:46 +08:00
Shanshan Shen	334c44613a	[Doc] Update release version info (#2518 ) ### What this PR does / why we need it? Update release version info. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `712d0f88d8` Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-08-25 15:39:10 +08:00
Shanshan Shen	98c68220c1	[Doc] Update `v0.9.1rc3` doc (#2512 ) ### What this PR does / why we need it? Update `v0.9.1rc3` doc, which are supplements to https://github.com/vllm-project/vllm-ascend/pull/2488. - vLLM version: v0.10.0 - vLLM main: `170e8ea9ea` Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-08-25 11:39:29 +08:00
Mengqing Cao	4c4ffeebe5	[Doc] update vllm version in ci (#2513 ) ### What this PR does / why we need it? update vllm version in ci - vLLM version: v0.10.0 - vLLM main: `170e8ea9ea` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-25 11:35:37 +08:00
Shanshan Shen	f0be3eed84	[Doc] Add release note for `v0.9.1rc3` (#2488 ) ### What this PR does / why we need it? Add release note for `v0.9.1rc3`. - vLLM version: v0.10.0 - vLLM main: `53415653ff` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-22 16:06:29 +08:00
LookAround0301	e9fb895b10	[Doc] Add feature branch long_seq_optimization (#2477 ) ### What this PR does / why we need it? Add cp/sp feature branch - vLLM version: v0.10.0 - vLLM main: `0c6e40bbaa` Signed-off-by: LookAround <lixushi@huawei.com>	2025-08-22 08:53:12 +08:00
Yikun Jiang	67a222c383	[Doc] Add feature branch policy (#2432 ) ### What this PR does / why we need it? This patch add the feature branch policy. After this patch: maintainers are allowed to create a feature branch. Feature branches are used for collaboration and must include an RFC link, merge plan and mentor info. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.0 - vLLM main: `7be5d113d8` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-21 10:37:21 +08:00
yupeng	973a7cfdf0	[DOC] update doc: LoRA with ACLGraph (#2430 ) ### What this PR does / why we need it? Update DOC. Guide users to run LoRA with ACLGraph. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. - vLLM version: v0.10.0 - vLLM main: `de7b67a023` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-08-21 08:55:55 +08:00
Wang Kunpeng	1de16ead8e	[main][bugfix] Modify the default value of the enable_shared_pert_dp to false (#2457 ) ### What this PR does / why we need it? enable_shared_pert_dp is currently on by default. This optimization is currently only valid for deepseek series models. The default opening affects the accuracy of the qwen series models. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? use parameter --additional_config='{"enable_shared_expert_dp": true}' - vLLM version: v0.10.0 - vLLM main: `d983769c41` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-08-20 20:25:53 +08:00
Jade Zheng	955411611c	Nominate Mengqing Cao as vllm-ascend maintainer (#2433 ) I would like to nominate Mengqing Cao (@MengqingCao https://github.com/MengqingCao) as a maintainer, starting with my +1. ## Reason Review Quality‌: She has completed [120+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+commenter%3Amengqingcao+-author%3Amengqingcao) since Feb. 2025, include [#review-3077842852](https://github.com/vllm-project/vllm-ascend/pull/2088#pullrequestreview-3077842852), [comment-2990074116](https://github.com/vllm-project/vllm-ascend/pull/1032#issuecomment-2990074116), [comment-2921063723](https://github.com/vllm-project/vllm-ascend/pull/1013#issuecomment-2921063723) high quality review. Sustained and Quality Contributions: She has Deep understanding of ‌vLLM‌ and ‌vLLM Ascend‌ codebases and solid contributions include The vLLM contributions and help vLLM Ascend release is the main reason I nominated her: - vLLM: Things worth mentioning that she completed [28+ PR contributions](https://github.com/vllm-project/vllm/pulls?q=is%3Apr+author%3AMengqingCao+is%3Amerged+) in vllm-project/vllm, especially for vLLM platform module to improve vLLM mult hardware support. She is one of the important co-authors of [vllm#8054](https://github.com/vllm-project/vllm/pull/8054) and hardware plugin RFC, this makes vllm-ascend plugin possible. Community Involvement: She is also very active and involved in [60+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aclosed%20-author%3AMengqingCao%20commenter%3AMengqingCao). So I think she's a great addition to the vLLM Ascend Maintainer team. - ✅Review Quality‌: She has completed 120+ reviews since Feb. 2025. https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+commenter%3Amengqingcao+-author%3Amengqingcao, include https://github.com/vllm-project/vllm-ascend/pull/2088#pullrequestreview-3077842852, https://github.com/vllm-project/vllm-ascend/pull/1446#issuecomment-3015166908, https://github.com/vllm-project/vllm-ascend/pull/1032#issuecomment-2990074116, https://github.com/vllm-project/vllm-ascend/pull/1013#issuecomment-2921063723 quality review. - ✅Sustained Contributions: 99+ PR merged in vllm-project/vllm-ascend https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AMengqingCao+is%3Amerged - ✅Quality Contribution‌: She is one of the important co-authors of https://github.com/vllm-project/vllm/pull/8054 , this makes vllm-ascend plugin possible. Things worth mentioning that she complete 28+ PR contributions in vllm-project/vllm, especially for vLLM platform module to improve vLLM mult hardware support: https://github.com/vllm-project/vllm/pulls?q=is%3Apr+author%3AMengqingCao+is%3Amerged+. At 2025 Q2, She also lead the [[RFC]: E2E CI test for key features](https://github.com/vllm-project/vllm-ascend/issues/413) and [[RFC]: Unit test coverage improvement](https://github.com/vllm-project/vllm-ascend/issues/1298) to help vllm ascend improve the coverage. Her main contributions focus on the adaptation of parallel strategies and communicator, such as https://github.com/vllm-project/vllm-ascend/pull/1800, https://github.com/vllm-project/vllm-ascend/pull/1856. These contributions are sufficient to prove she has “Deep understanding of ‌vLLM‌ and ‌vLLM Ascend‌ codebases” - ✅Community Involvement‌: Involved in 63+ issue reviewer https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aclosed%20-author%3AMengqingCao%20commenter%3AMengqingCao She led the v0.10.1 release as release manager - vLLM version: v0.10.0 - vLLM main: `78dba404ad` Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-08-19 14:13:54 +08:00
wangxiyuan	6335fe39ea	Nominate ApsarasX as vllm-ascend maintainer (#2419 ) I would like to nominate Wengang Chen (@ApsarasX https://github.com/ApsarasX) as a maintainer, starting with my +1. ## Reason Review Quality‌: He focuses on the vLLM Ascend Core module review with 100+ high quality review, such as [#2326 (comment)](https://github.com/vllm-project/vllm-ascend/pull/2326#discussion_r2268509365), [#768 (comment)](https://github.com/vllm-project/vllm-ascend/pull/768#discussion_r2075278516), [#2312 (comment)](https://github.com/vllm-project/vllm-ascend/pull/2312#issuecomment-3174677159), [#2268 (comment)](https://github.com/vllm-project/vllm-ascend/pull/2268#discussion_r2260920578), [#2192 (comment)](https://github.com/vllm-project/vllm-ascend/pull/2192#issuecomment-3149414586), [#2156 (comment)](https://github.com/vllm-project/vllm-ascend/pull/2156#discussion_r2249096673). This helped vLLM Ascend v0.9.x and v0.10.x to be released with high quality. Sustained and Quality Contributions: He has a very good habit of sharing his design ideas, development process, performance test results, such as [#966](https://github.com/vllm-project/vllm-ascend/pull/966), he contributed [many PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Amerged+), valuable bugfixes and also perf improvements. Community Involvement: Active involved in community discussion, he is collaborative and helps the users solve problems, involved in [120+ PR and issues](https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX). He is also the speaker of [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q). So I think he's a great addition to the vLLM Ascend Maintainer team. - ✅Review Quality‌: 108+ PR with valuable review https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AApsarasX with many valuable review, like https://github.com/vllm-project/vllm-ascend/pull/2326#discussion_r2268509365 https://github.com/vllm-project/vllm-ascend/pull/768#discussion_r2075278516 https://github.com/vllm-project/vllm-ascend/pull/2312#issuecomment-3174677159 https://github.com/vllm-project/vllm-ascend/pull/2268#discussion_r2260920578 https://github.com/vllm-project/vllm-ascend/pull/2192#issuecomment-3149414586 https://github.com/vllm-project/vllm-ascend/pull/2156#discussion_r2249096673 - ✅ Sustained and Major Contributions https://github.com/vllm-project/vllm-ascend/pulls/ApsarasX - ✅ Quality Contribution‌: https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3AApsarasX+is%3Aclosed Good quality with well documents [Perf] Refactor tensor disposal logic to reduce memory usage https://github.com/vllm-project/vllm-ascend/pull/966 - ✅Community Involvement‌: 7 issue: https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aclosed%20author%3AApsarasX - 120+ PR and issue: https://github.com/vllm-project/vllm-ascend/issues?q=commenter%3AApsarasX Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-19 10:44:35 +08:00
TaoYu Chen	9e7c168d99	Add ModelRunner_prepare_inputs doc (#1493 ) ### What this PR does / why we need it? To help more developers quickly get started with vLLM, we need to write clear and easy-to-understand code documentation and technical interpretations. This will effectively lower the learning curve, attract more excellent contributors, and collectively build a better developer community. Add ModelRunner_prepare_inputs doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Pass CI - vLLM version: v0.10.0 - vLLM main: `4be02a3776` --------- Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>	2025-08-18 15:41:24 +08:00
Li Wang	2ad7e1251e	[Doc] Fix quant documentation to make it reproducible (#2277 ) ### What this PR does / why we need it? Fixed the expression of msit for code clone - vLLM version: v0.10.0 - vLLM main: `afa5b7ca0b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-14 17:19:47 +08:00
jack	8bfd16a145	[Doc] Add container image save/load FAQ for offline environments (#2347 ) ### What this PR does / why we need it? Add Docker export/import guide for air-gapped environments ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA - vLLM version: v0.10.0 - vLLM main: `d16aa3dae4` Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2025-08-13 16:00:43 +08:00
Mengqing Cao	49ec6c98b7	[Doc] Update faq (#2334 ) ### What this PR does / why we need it? - update determinitic calculation - update support device ### Does this PR introduce _any_ user-facing change? - Users should update ray and protobuf when using ray as distributed backend - Users should change to use `export HCCL_DETERMINISTIC=true` when enabling determinitic calculation ### How was this patch tested? N/A - vLLM version: v0.10.0 - vLLM main: `ea1292ad3e` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-12 14:12:53 +08:00
Wang Kunpeng	dc585f148a	[main][prefill optimization] Optimize parallel strategies to reduce communication overhead (#2198 ) ### What this PR does / why we need it? 1.Shared Expert Sharding Strategy Update: Switched from TP-aligned to pure DP for shared experts, enabling more efficient execution. 2.O_Proj AllReduce → ReduceScatter: Reduced communication overhead by using ReduceScatter, made possible by pure DP sharding. 3.AllGather Postponed: Delayed to after QKV down projection to reduce synchronization impact during prefill. ### How was this patch tested? Adding ut case in `tests/ut/attention/test_mla_v1.py` #### How to run use parameter `--additional_config='{"enable_shared_expert_dp": true}'` ##### a.How to run eager mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --enforce-eager --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true}' ##### b.How to run graph mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true,"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>	2025-08-12 14:12:12 +08:00
Mengqing Cao	4604882a3e	[ReleaseNote] Release note of v0.10.0rc1 (#2225 ) ### What this PR does / why we need it? Release note of v0.10.0rc1 - vLLM version: v0.10.0 - vLLM main: `8e8e0b6af1` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-07 14:46:49 +08:00
zhangxinyuehfad	92eebc0c9b	[Doc] Update user guide for suported models (#2263 ) ### What this PR does / why we need it? Update user guide for suported models - vLLM version: v0.10.0 - vLLM main: `4be02a3776` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-07 14:39:51 +08:00
22dimensions	440d28a138	[Tutorial] Add qwen3 8b w4a8 tutorial (#2249 ) ### What this PR does / why we need it? Add a new single npu quantization tutorial, and using the latest qwen3 model. - vLLM version: v0.10.0 - vLLM main: `8e8e0b6af1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-07 14:39:38 +08:00
zhangxinyuehfad	bcd0b532f5	[Doc] Update user guide for using lm-eval (#1325 ) ### What this PR does / why we need it? Update user guide for using lm-eval 1. add using lm-eval on online server 2. add using offline datasets - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-07 14:15:49 +08:00
zhangxinyuehfad	dbba3cabb0	[Doc] Update tutorials for single_npu_audio and single_npu_multimodal (#2252 ) ### What this PR does / why we need it? Update tutorials for single_npu_audio and single_npu_multimodal - vLLM version: v0.10.0 - vLLM main: `6b47ef24de` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-07 14:08:14 +08:00
Li Wang	bf84f2dbfa	[Doc] Support kimi-k2-w8a8 (#2162 ) ### What this PR does / why we need it? In fact, the kimi-k2 model is similar to the deepseek model, and we only need to make a few changes to support it. what does this pr do: 1. Add kimi-k2-w8a8 deployment doc 2. Update quantization doc 3. Upgrade torchair support list ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-06 19:28:47 +08:00

1 2 3 4 5

227 Commits