xc-llm-ascend

Author	SHA1	Message	Date
ZT-AIA	24328aaf00	update vllm pin to 12.27 (#5412 ) ### What this PR does / why we need it? update vllm pin to 12.27 1、Fix Qwen2-MoE shared_expert_gate ：https://github.com/vllm-project/vllm/pull/31339 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? vLLM version: release/v0.13.0 vLLM main: `5326c89803` Co-authored-by: leo-pony [nengjunma@outlook.com](nengjunma@outlook.com) --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-12-28 00:19:36 +08:00
Mengqing Cao	1b5d5abf86	[ReleaseNote] Add release note for v0.13.0rc1 (#5334 ) ### What this PR does / why we need it? Add release note for v0.13.0rc1 - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-27 18:46:57 +08:00
Li Wang	58adf7c8ac	[Bugfix] Correctly handle the output shape in multimodal attention (#5443 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/5297, for `AscendMMEncoderAttention` forward, we should keep the output shape consistence with the input - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-27 18:42:46 +08:00
Li Wang	1d81bfaed1	Fix nightly (#5413 ) ### What this PR does / why we need it? This pacth mainly do the following things: 1. Bugfix for multi_node_tests log, log names must be unique when uploading logs. 2. Optimize `get_cluster_ips` logic, increase the max retry times for robustness 3. Abandoned the existing gh-proxy temporarily until it is stable enough. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-27 18:16:46 +08:00
jiangkuaixue123	e91e11d3b0	[bugfix] fix typo of _skip_all_reduce_across_dp_group (#5435 ) ### What this PR does / why we need it? fix typo of _skip_all_reduce_across_dp_group ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com>	2025-12-27 17:50:04 +08:00
weiguihua2	c30c3dc831	[Doc]modify pcp tutorial doc (#5440 ) ### What this PR does / why we need it? modify pcp tutorial doc Because some optimization points have been submitted as PRs and haven't been merged yet, I'll update the performance data now and refresh it again after the PRs are merged. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-27 17:47:09 +08:00
Mengqing Cao	77cd960524	[Misc] fast fail for exiting if tools/install_flash_infer_attention_score_ops_a2.sh (#5422 ) ### What this PR does / why we need it? Use `set -euo pipefail` to exit if tools/install_flash_infer_attention_score_ops_a2.sh failed in any line - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-27 17:30:34 +08:00
MengLong Chen	b8b5521f5b	[Doc] Update DeepSeek V3.1/R1 2P1D doc (#5387 ) ### What this PR does / why we need it? The PR updates the documentation for DeepSeek-V3.1 and DeepSeek-R1 in the scenario of prefill-decode disaggregation. Updated some PD separation-related setting parameters and optimal configurations. This script has been verified. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2025-12-27 17:28:43 +08:00
cookieyyds	843751768e	[DOC]Fix model weight download links (#5436 ) Updated download links for DeepSeek-V3.2 model weights. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>	2025-12-27 17:14:31 +08:00
Zhu Yi Lin	04104031d0	[Doc] Modify DeepSeek-R1/V3.1 documentation (#5426 ) ### What this PR does / why we need it? Modify DeepSeek-R1/V3.1 documentation. Mainly update the mtp size and some other configs. Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-27 17:13:58 +08:00
realliujiaxu	09f71c14a6	Revert "[feat] enable hierarchical mc2 ops on A2 by default (#5300 )" (#5434 ) We'll release 0.13.0 soon. The main branch is freeze. Let's revert the newest change and redo it once 0.13.0 is released. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-12-27 17:06:58 +08:00
realliujiaxu	2add3dc3e0	[Bugfix] fix greedy temperature detection (#5417 ) ### What this PR does / why we need it? fix greedy temperature detection from https://github.com/vllm-project/vllm/pull/27077 - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-12-27 17:04:10 +08:00
Angazenn	eab306b09c	[doc] Update Qwen3-235B doc for reproducing latest performance (#5323 ) ### What this PR does / why we need it? This PR updates Qwen3-235B doc to give a simple recipe for repreducing our latest perfomance on Atlas A3 servers. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-27 15:55:58 +08:00
hwhaokun	12da9f9460	[feat] enable hierarchical mc2 ops on A2 by default (#5300 ) ### What this PR does / why we need it? Previously, it was necessary to set the environment variables HCCL_INTRA_PCIE_ENABLE=1 and HCCL_INTRA_ROCE_ENABLE=0. This PR enables hierarchical MC2 operations on A2 by default. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: hwhaokun <haokun0405@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-12-27 15:45:25 +08:00
Zhu Yi Lin	be2a947521	[Doc] delete environment variable HCCL_OP_EXPANSION_MODE in DeepSeekV3.1/R1 (#5419 ) ### What this PR does / why we need it? Currently, HCCL_OP_EXPANSION_MODE="AIV" is causing some freezing issues on A2.so we have temporarily removed it from the documentation. Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-27 12:44:50 +08:00
LookAround0301	ca31d6823e	[Doc] add long_sequence feature user guide (#5343 ) ### What this PR does / why we need it? add long_sequence feature user guide - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: LookAround <lixushi@huawei.com>	2025-12-27 10:44:43 +08:00
hwhaokun	cb2fbf7df2	[bugfix] solve dp scenario Host-Device sync (#5298 ) ### What this PR does / why we need it? In the speculative decoding scenario, the original code performs Host-Device synchronization, which slows down the main model's execution speed. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: hwhaokun <haokun0405@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-12-27 10:36:59 +08:00
weiguihua2	69f96950e1	[Doc] modify pcp tutorials (#5411 ) ### What this PR does / why we need it? modify pcp tutorials modify pcp perf statistics and add note: Context parallel feature currently is only supported on Atlas A3 device, and will be supported on Atlas A2 in the future. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-27 10:36:10 +08:00
whx	3f33ad23fe	[BugFix] Fix npu-cpu offloading interface change bug. (#5290 ) ### What this PR does / why we need it? Last month the interface of `OffloadingSpec` has changed(https://github.com/vllm-project/vllm/pull/27743). This PR fixes this bug and adds e2e test for cpu offloading. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-12-27 10:21:20 +08:00
fems14	2ef4d1979e	[bugfix][main]KV Pool for KV Transfer in PD Disaggregation Scenarios (#5398 ) ### What this PR does / why we need it? 1.KV Pool for KV Transfer in PD Disaggregation Scenarios Error Resolution 2.Update KV Pool Documentation ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-12-27 09:53:57 +08:00
weiguihua2	ce52e17bf3	[Doc]add long sequence tutorials (#5364 ) ### What this PR does / why we need it? Provide sample guidance for running long-sequence DeepSeek across multiple nodes To guide users on using the context parallel feature, a practical example is provided. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-27 09:52:11 +08:00
wangxiyuan	d1f0df7b4b	Revert "MLA prefill preformance optimization (#5275 )" (#5410 ) We'll release 0.13.0 soon. The main branch is freeze. Let's revert the newest change and redo it once 0.13.0 is released - vLLM version: release/v0.13.0 - vLLM main: `81786c8774`	2025-12-27 09:48:56 +08:00
pichangping	711f1861e4	MLA prefill preformance optimization (#5275 ) ### What this PR does / why we need it? Since the _npu_ring_mla operator deteriorates in long-sequencescenarios, the long sequence is split into shorter sequences for input to improve performance. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: pichangping <1337510399@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-27 09:19:45 +08:00
jiangyunfan1	1486e0d06c	[TEST]Add vllm bench (#5306 ) ### What this PR does / why we need it? This PR adds vllm bench common method, we need it to add some test cases later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-12-27 09:16:08 +08:00
Zetong Li	16ef2474bf	[Test] Add acceptance test for eagle/eagle3 (#5366 ) ### What this PR does / why we need it? This PR aims to add acceptance test for eagle/eagle3 via llama/qwen. We obtained golden baselines by running several times (based on healthy main), which is feasible and convincing. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-12-27 08:50:01 +08:00
Mengqing Cao	8ed6f98a5a	[Build] Add installation script of fused_infer_attention_score kernel with flash decoding (#5402 ) ### What this PR does / why we need it? Add installation script of `fused_infer_attention_score` kernel with flash decoding ### Userface changes Users can install the kernel `fused_infer_attention_score` with flash decoding feature by `bash tools/install_flash_infer_attention_score_ops_a2.sh` or `bash tools/install_flash_infer_attention_score_ops_a3.sh` - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-27 02:01:06 +08:00
Nengjun Ma	f5af6bbd1e	[CI] Add qwen-235b-a22b a2 multi-node test (#5393 ) ### What this PR does / why we need it? Qwen3-235B-A22B belongs to the TopN model, but there is currently a lack of care for the test cases of the wen3-235B-A22B model on Atlas A2, and most of the machines currently owned by users in the community are A2. When users encounter problems, we currently have no way of knowing whether the model runs normally on the corresponding version of the code, so we added it. In addition, we currently see TopN models such as: qwen-dense, qwen3-30b-a3b, Qwen3-Next, Qwen2.5-Omni, but Qwen3-235B-A22B is missing. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Test with multi-node, result as following: 1. Accuracy test (Time for executing this test case: 25 minutes) test running successfully, accuracy as following: ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 95.68 ``` 2. Perf test (Time for executing this test case: 1h15 minutes) test running successfully, throughput as following(This is the atlas A3, for A2 the result about A3/1.3): ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤══════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪══════╡ │ E2EL │ total │ 384086.3958 ms │ 214767.0486 ms │ 528014.771 ms │ 387621.5746 ms │ 388776.7492 ms │ 390164.3559 ms │ 488105.8512 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ TTFT │ total │ 159409.9868 ms │ 1849.4588 ms │ 302439.6965 ms │ 162183.7007 ms │ 162965.477 ms │ 164274.1936 ms │ 262578.6041 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ TPOT │ total │ 149.8842 ms │ 130.2175 ms │ 151.2625 ms │ 150.473 ms │ 150.6978 ms │ 150.9102 ms │ 151.2131 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ ITL │ total │ 149.6789 ms │ 0.0099 ms │ 283.0242 ms │ 150.3276 ms │ 156.8649 ms │ 168.1372 ms │ 199.378 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ InputTokens │ total │ 3654.3079 │ 3108.0 │ 4280.0 │ 3629.0 │ 3728.0 │ 3842.1 │ 4079.0 │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ OutputTokens │ total │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ OutputTokenThroughput │ total │ 3.935 token/s │ 2.8408 token/s │ 6.9843 token/s │ 3.8698 token/s │ 3.8799 token/s │ 3.9916 token/s │ 6.2137 token/s │ 2800 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧══════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 4391524.3389 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 2800 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 2800 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 244.8903 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 256 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 0.6376 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 10232062 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 22.924 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 4200000 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 2329.9568 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 956.3877 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 3286.3445 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-12-26 23:46:09 +08:00
ZT-AIA	1d8aa892bf	Update vllm pin to 12.26 (#5378 ) ### What this PR does / why we need it? Update vllm pin to 12.26 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-26 23:44:48 +08:00
Jade Zheng	8b9ca86827	[Feature] Remove the transpose step after attention and switch to transpose_batchmatmul (#5390 ) 1. The `npu_fused_infer_attention_score` kernel supports specifying the output layout. By selecting the appropriate layout, we can avoid the transpose operation typically required after the attention. 2. The `transpose_batchmatmul` function allows us to control whether the output tensor is transposed. If we configure `perm_y`, an additional transpose after executing `v_up` becomes unnecessary. - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-26 22:03:46 +08:00
Wang Kunpeng	bc5b7a5fb5	[bugfix] Fix MHA model runtime error in aclgraph mode (#5397 ) ### What this PR does / why we need it? Currently, MHA models (eg: minicpm-2b, Baichuan-7b) will encounter errors when running in piecewise graph mode, with error messages similar to: ``` (E89999): When layout is TND and PA not enabled, keyT(8) and valueT(8) must be equal to the last element of actualSeqenceLengthKV(5)[FUNC:CheckInputShapeWhenLayoutIsTND][FILE:prompt_flash_attention_tiling.cpp][LINE:3618] ``` The error occurs because the qkv in the Prefill stage is also padded, causing the shape to be inconsistent with actual_seq_lengths. Add unpadding logic for kv. - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-26 21:37:28 +08:00
LeeWenquan	7685d0c239	rollback causal_conv1d_fn to torch ops & update qwen3Next doc (#5391 ) ### What this PR does / why we need it? Rollback causal_conv1d_fn ops from triton to torch version to fix hanging issues，meanwhile update Qwen3Next doc - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-12-26 19:57:38 +08:00
jiangyunfan1	48854aef5c	[TEST]Add sending request with and without chat (#5286 ) ### What this PR does / why we need it? This PR adds the method for sending chat and non-chat request, we need it to test much folloing cases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-12-26 18:04:17 +08:00
Jade Zheng	0dfdfa9526	[Feature] Enhance all-reduce skipping logic for MoE models in NPUModelRunner (#5329 ) Besides enabling `recompute_scheduler_enable`, we can skip all_reduce when max_num_batched_tokens is below mc2's requirement. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-26 17:39:44 +08:00
Zhu Yi Lin	06732dbf5b	[Doc] update R1/V3.1 doc (#5383 ) ### What this PR does / why we need it? This PR updates DeepSeek-R1/V3.1 doc to give a simple recipe for repreducing our latest perfomance on Atlas A3/A2 servers. ### Does this PR introduce any user-facing change? No. Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-26 17:09:22 +08:00
zhangsicheng5	8ed87dfa84	[doc] Add context parallel user guide (#5358 ) 1. Add context parallel user guide 2. Add context parallel related message in supported features/models - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-12-26 17:03:47 +08:00
Zetong Li	09390eaf32	[Bugfix] Fix unsuitable moe_comm_type under ep=1 scenario (#5388 ) ### What this PR does / why we need it? This PR aims to fix unsuitable `moe_comm_type` under `ep=1` scenario. The related issue #5375 have reported that `ep=1` can cause errors in local environment, but those cases work well on ci. The point is the difference between machines and `moe_comm_type` may not be chosen correctly. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: Zetong Li <slippersss@126.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-26 16:45:45 +08:00
Qiu	da0b113cf5	[doc]<PCP&DCP> add developer guide for PCP&DCP (#5372 ) ### What this PR does / why we need it? add developer guide for PCP&DCP - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2025-12-26 16:17:38 +08:00
Zhu Yi Lin	18302c8467	Revert "Add MagicMTP(block verify) and Triton optimization (#4443 )" (#5380 ) ### What this PR does / why we need it? #4443 introduces a precision issue in scenarios where MTP >= 3 + deepseek v3.1, and this pr reverts it - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-26 15:06:13 +08:00
zhangyiming	45c5bcd962	[E2E] Optimize the E2E test time. (#5294 ) ### What this PR does / why we need it? Add cudagraph_capture_sizes for E2E CI test. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-26 14:17:50 +08:00
wangxiyuan	29d2fe653d	cleanup ascend config (#5296 ) 1. refresh additional config doc 2. move kv config logic to platform. 3. improve `dump_config` init logic and rename it to `dump_config_path` this change is user impacted. dump_config is changed from dict to string. 4. correct `enable_async_exponential` type 5. remove useless `chunked_prefill_for_mla` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-26 14:07:37 +08:00
ZT-AIA	adaa89a7a5	Update vllm pin to 12.25 (#5342 ) ### What this PR does / why we need it? - Fix vllm break in the pr: 1.[Drop v0.14 deprecations ]https://github.com/vllm-project/vllm/pull/31285 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: ZT-AIA <1028681969@qq.com>	2025-12-26 14:05:40 +08:00
Li Wang	c2f776b846	[Nightly] Initial logging for nightly multi-node testing (#5362 ) ### What this PR does / why we need it? Currently, our multi-node logs only show the master node's logs (via the Kubernetes API), which is insufficient for effective problem localization if other nodes experience issues. Therefore, this pull request adds the ability to upload logs for other nodes. Next plan: Output structured directory logs, including logs from each node and the polog. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-26 11:39:07 +08:00
XiaoxinWang	320877d488	move contiguous in fused_sigmoid_gating_delta_rule_update to model_runner_v1 (#5274 ) ### What this PR does / why we need it? The contiguous() operation temporarily increases memory usage, leading to higher peak GPU memory, which necessitates reducing gpu_memory_utilization. However, making tensors contiguous in modelrunnerv1 significantly enhances operator performance, resulting in greater end-to-end model benefits despite the memory overhead. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-26 09:19:47 +08:00
Icey	9b2a7d8866	[BugFix][Fusion] Patch compile backend to make fusion available (#5308 ) Currently, the vllm pr: https://github.com/vllm-project/vllm/pull/24252 is causing operator fusion to fail, which can be mitigated by patching the backend. Once the problem is completely resolved, I will submit a new pull request to remove the patch. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-26 09:18:16 +08:00
Qi Mao	7372225bcb	[FIX] Update _causal_conv1d_update_kernel for Efficient Conv State Handling on NPU (#5322 ) Description: This PR updates the implementation of the Triton operator for deployment on NPU devices, focusing on optimizing grid size and memory handling based on NPU limitations. Design Plan: Grid Calculation: The grid size is now dynamically calculated by batch and dim to ensure that the number of programs executed does not exceed the NPU's vector core capacity. This ensures optimal parallelism without overloading the hardware. Data Block Handling: Due to the limited on-chip memory (UB) on Ascend NPUs, this implementation splits large data into smaller chunks of 32k or less per block. The kernel performs a for-loop to process the data in these smaller chunks, minimizing memory usage and avoiding potential overflows. Changes Compared to GPU Implementation: Grid and Block Sizing: For GPU, the grid and block size were determined based on available thread counts and memory size. In contrast, the NPU version dynamically adjusts these parameters using B_TILE and BLOCK_N to optimize for NPU’s architecture. Memory Chunking: The original GPU implementation did not require chunking due to the higher available memory and processing capacity. For the NPU, data is divided into smaller chunks (32k or smaller) to comply with memory constraints on the device. The kernel has been modified to handle this chunking mechanism inside a loop. Optimized Thread Usage: The NPU implementation takes into account the hardware-specific thread limit (24 threads per vector core), ensuring that the number of active programs is aligned with the NPU's vector core count, avoiding over-subscription that would lead to serial processing. This PR ensures that the operator functions efficiently on Ascend NPU, considering hardware limitations while maintaining the same functionality and input parameters as the GPU implementation. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: maoxx241 <maomaoyu870@gmail.com>	2025-12-26 09:12:30 +08:00
Mengqing Cao	4ce32c1a8d	[CI] Skip failed test cases to recover CI (#5368 ) ### What this PR does / why we need it? Skip `test_minicpm_2b` to recover CI. Not sure why this ci failed, but we'd skip it quickly to recover CI. test_minicpm_2b related failed PRs: https://github.com/vllm-project/vllm-ascend/actions/runs/20502414919/job/58911802576?pr=5274 https://github.com/vllm-project/vllm-ascend/actions/runs/20502596934/job/58912315736?pr=5322 - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-26 08:18:23 +08:00
Feng Liu	1858f3d36e	[Bugfix] Fix Qwen P/D Disaggregation accuracy issue (#5340 ) ### What this PR does / why we need it? Fix Qwen P/D Disaggregation accuracy issue - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: F.Liu <liufeng248@huawei.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2025-12-25 22:46:08 +08:00
cookieyyds	2da8038dd2	[doc] update using command (#5373 ) ### What this PR does / why we need it? Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-25 22:28:35 +08:00
Magnus	59f11dd1cb	[Bugfix] fix xlite decode-only e2e test (#5354 ) ### What this PR does / why we need it? fix xlite decode-only e2e test, xlite decode-only mode utilizes aclgraph's prefill and will be affected by aclgraph, so shortened test length. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: changdawei1 <changdawei3@huawei.com>	2025-12-25 16:30:17 +08:00
weiguihua2	d752c030e9	[Bugfix] fix pcp 128K break (#5266 ) ### What this PR does / why we need it? [Bugfix] Fixing the issue where 128K context does not work in long sequence scenarios. This issue is caused by not splitting num_token according to pcp_size during profile_run. During `profile_run`, a warm-up is performed based on `self.max_num_tokens`. When PCP is enabled, each PCP group will only schedule up to `self.max_num_tokens / pcp_size`. After `profile_run` is completed, the original scheduling size needs to be restored. This is a temporary workaround; once https://github.com/vllm-project/vllm/pull/28988/files is implemented, this part can be removed. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-25 11:58:52 +08:00

1 2 3 4 5 ...

1869 Commits