xc-llm-ascend

Author	SHA1	Message	Date
zzhxxx	64d29875f9	[Refactor] Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for sharded CP (#5698 ) ### What this PR does / why we need it? Based on the Sharded-CP feature PR:https://github.com/vllm-project/vllm-ascend/pull/4702; RFC:https://github.com/vllm-project/vllm/issues/30055 This PR officially integrates Deepseek V3.2's DSA-CP support on the basis of https://github.com/vllm-project/vllm-ascend/pull/4702, improving inference efficiency and scalability under mixed prefill-decode workloads. The main improvements include: - Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for TP=1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <jaychou1620@gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2026-01-09 15:58:40 +08:00
ZT-AIA	e11ff8e535	[BufFix]Fix the error when using Ascend custom operators with rank=128 (#5394 ) ### What this PR does / why we need it? The customized ascend operator sgmv_expand and sgmv_shrink applies only to the scenario where rank is 8,16,32,64. When rank >= 128, the operator is out of range, causing the model to report an error. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Depends on this commit https://github.com/vllm-project/vllm/pull/31408 - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2026-01-09 15:57:43 +08:00
wangxiyuan	d36ca88cf4	[CI] Avoid lint and ut for PR push (#5762 ) 1. Don't run lint and ut again once the PR is merged to save CI resource 2. Update codecov every 4 hour 3. rename `model_downloader` to suitable name 4. update schedule job to better time. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-09 15:57:06 +08:00
lhchg	dc99cfdc15	[CustomOp] support TensorList for dispatchFFNCombine (#5665 ) ### What this PR does / why we need it? To support tensorList for dispatch_ffn_combine, to adjust eplb ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Single Operator Testing - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lhchg <lhao_cheng@163.com> Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>	2026-01-09 15:56:29 +08:00
Wang Xiaoran	3ce5a34468	[BugFix] Xlite: Bypass the padding of the graph mode in non-MTP cases to obtain the correct decode num. (#5711 ) ### What this PR does / why we need it? This PR fixes a bug in Xlite backend(https://atomgit.com/openeuler/GVirt/issues/1), The direct cause of the problem is that the XModel::PrepareAttn function obtained an illegal number of tokens to be inferred, -540. This illegal value is due to the padding feature of inference in graph mode and the residual state across steps. This issue is triggered when a prefill request is newly added in a step and a decode ends simultaneously. It is first fixed using num_decode_tokens instead of attn_metadata.num_decodes. 1. In graph mode, vllm_ascend has padding characteristics. In the _prepare_inputs function, if the number of tokens to be inferred is less than the set threshold (8 in this case), the attn_metadata.num_decode array will be expanded to 8. 2. Meanwhile, vllm_ascend uses the class variable self.query_start_loc of NPUModelRunner to record the tokens to be inferred. Due to poor coordination with the graph mode padding mechanism when crossing steps, in some cases (such as when a decode request is completed in a certain step and a new prefill request is added at the same time), negative values may be calculated for attn_metadata.query_lens. 3. After type conversion, the negative values in query_lens cause an overflow. Xlite detects that the number of tokens to be inferred for the decode request is too large and triggers a "decode len too long" alert. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Same with https://atomgit.com/openeuler/GVirt/issues/1 - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wwwumr <1127858301@qq.com>	2026-01-09 15:55:30 +08:00
InSec	2d713fee93	[CI] Accuracy issue of qwen3-next-w8a8 nightly test fix. (#5746 ) ### What this PR does / why we need it? Close the Full Graph mode to temporarily avoid accuracy issue for Qwen3-Next-80B-A3B-Instruct-W8A8. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: InSec <1790766300@qq.com>	2026-01-09 15:55:13 +08:00
Rui Kang	be941cab71	[BugFix] NetLoader: No backend type associated with device type npu (#5700 ) What this PR does / why we need it? This PR fixes a bug in NetLoader [PR#2888](https://github.com/vllm-project/vllm-ascend/pull/2888). The bug was caused by [PR#3612](https://github.com/vllm-project/vllm-ascend/pull/3612) ([1/N][Refactor] Refactor code to adapt with vllm main), which removed the `stateless_init_device_torch_dist_pg` function from platform.py, leading to a failure in the call. This PR adds a way to create a stateless process group that does not depend on external code. Does this PR introduce any user-facing change? No How was this patch tested? Same with [PR#2888](https://github.com/vllm-project/vllm-ascend/pull/2888) - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: destinysky <kangrui10@126.com>	2026-01-09 15:54:54 +08:00
Li Wang	64904ab5b6	[CI] lint and ut use self_hosted runner (#5652 ) ### What this PR does / why we need it? lint and ut use self_hosted runner - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-09 14:26:14 +08:00
zzhxxx	36d74aba58	[Doc][fix] Fix the title of the document for the layer_sharding feature (#5759 ) ### What this PR does / why we need it? Fix the title of the document for the layer_sharding feature - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2026-01-09 14:15:22 +08:00
whx	ee2ed573f1	[BugFix][DS 3.2] Fix ds indexer accuracy problem caused by rope. (#4641 ) ### What this PR does / why we need it? The rotary algorithm in deepseek indexer should be neox-style instead of gptj style. PR #4413 fix this accuracy bug with new triton kernel. This PR fixes original pytorch version. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with existing test. - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-01-09 14:11:44 +08:00
zyz111222	98c788a65a	[Doc] add PaddleOCR-VL tutorials guide (#5556 ) ### What this PR does / why we need it? 1. add PaddleOCR-VL.md in the `docs/source/tutorials/` 2. add PaddleOCR-VL index in `docs/source/tutorials/index.md` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by CI - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: zouyizhou <zouyizhou@huawei.com>	2026-01-09 11:01:25 +08:00
LeeWenquan	a3a74d6984	[CI] Add qwen3 next ci (#5395 ) ### What this PR does / why we need it? Add Qwen3Next CI ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-01-09 10:29:09 +08:00
Chenxi Qian	40eb3e1836	[OP] Enable custom op aclnnMoeInitRoutingCustom (#5332 ) ### What this PR does / why we need it? This PR enables custom op `aclnnMoeInitRoutingCustom` introduced in PR #5251 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: zzzzwwjj <1183291235@qq.com>	2026-01-09 09:35:18 +08:00
Li Wang	595d3484c4	[Nightly] Move ops to the correct path (#5642 ) ### What this PR does / why we need it? Move ops to the correct path where they belong - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-09 09:23:36 +08:00
wangxiyuan	1ff1c96d13	[CI] Remove workflow_dispatch way for image build (#5742 ) There is some problem for workflow_dispatch way for image build. Let's remove it first to make CI happy. I'll add it back once it's well tested. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-09 09:20:30 +08:00
zhenwenqi2024	97f6be8108	[feature]dcp&pcp support mlapo (#5672 ) ### What this PR does / why we need it? mlapo in deepseek is a huge performance improvement in decode, this pr support pcp & dcp with mlapo ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-08 23:49:23 +08:00
meihanc	6315a31399	[CI] Add triton ascend in nightly CI (#5716 ) ### What this PR does / why we need it? Add triton ascend in nightly ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-08 21:17:32 +08:00
Yizhou	f4605c2b3c	[Fix] Fixes speculative decode indexing and unpad condition for attention metadata (#5626 ) ### What this PR does / why we need it? This addresses the issue brought up by #5356 and #4963, and we believe the unnecessary conditions are the root cause. Change the unpad trigger to be driven by actual size mismatches (num_reqs vs base_num_reqs or scheduled vs input token counts) rather than specific speculative-method flags. Then remove brittle workarounds that forced request counts and sliced query start locations. This prevents incorrect indexing and length mismatches during speculative decoding and makes metadata unpadding more robust across scheduling modes. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Tested by existing cases. - vLLM version: v0.13.0 - vLLM main: `8be6432bda` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-08 19:41:08 +08:00
meihanc	503822c56c	[Doc] Add Qwen3-Omni-30B-A3B-Thinking Tutorials (#3991 ) ### What this PR does / why we need it? Add Qwen3-Omni-30B-A3B-Thinking Tutorials ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-08 16:57:20 +08:00
cookieyyds	8b3a7a9e87	[bugfix] Support dsv3.2 enable both mtp and full_decode_only (#5679 ) ### What this PR does / why we need it? #5230 this PR introduced a problem when both mtp and full_decode_only are enabled for the DSV32 model, the operators cannot be compiled into the graph. This PR fixes that issue. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>	2026-01-08 15:47:31 +08:00
drslark	ccbc5e2ba1	[Feat][Bugfix][main] Adapted SP to eagle3 (#5562 ) ### What this PR does / why we need it? Adapted sp to eagle3. There may still be some problems, e.g., accuracy in some scenes, `sp`+`dp`... We will fix them later. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? We tested it mainly in a new `e2e`. ```shell pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance ``` ```text . =============================== warnings summary =============================== <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============= 3 passed, 1 skipped, 2 warnings in 142.05s (0:02:22) ============= ``` It passed. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-08 15:33:52 +08:00
wangxiyuan	d03cc9c456	[CI] Fix image build workflow_dispatch error (#5717 ) type `raw` must contain `value` section. This PR fix the image build error - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-08 15:07:33 +08:00
Li Wang	920bbe932f	[CI] Drop outdated cases (#5709 ) ### What this PR does / why we need it? Correcting some outdated use cases: `tests/e2e/singlecard/test_aclgraph_accuracy.py::test_models_output` -> `tests/e2e/singlecard/test_aclgraph_accuracy.py::test_piecewise_res_consistency` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-08 11:23:44 +08:00
LI SHENGYONG	b69db4ce55	[EPLB][CI] EPLB add aclgraph and redundant expert ci (#5625 ) ### What this PR does / why we need it? EPLB currently does not have CI related to aclgraph and redundancy experts; this PR adds them. release on #5529 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Tested the use cases to be added in this PR. PASSED ====================================================== warnings summary ========================================================== <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================== 1 passed, 2 warnings in 272.24s (0:04:32) ===================================================== - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-08 09:51:48 +08:00
wangxiyuan	264cc254cc	[CI] fix image build tag (#5703 ) ref doesn't work with workflow_dispatch, let's change it to raw way This PR also merge the pr_create job into one runner to save resource. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-08 09:27:45 +08:00
Nengjun Ma	48811bc0b8	Optimize the print info format when deprecated code is used in vllm-ascend (#5696 ) ### What this PR does / why we need it? Optimize the warning print information format when detects depredated code is used in vllm-ascend. ### Does this PR introduce _any_ user-facing change? NA - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-08 09:26:49 +08:00
Aoxuan Chen	8763953f56	[Feature] add the magicmtp speculative decoding acceleration algorithm (#5542 ) ### What this PR does / why we need it? 1. MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy. 2. Added Triton and PyTorch implementations, and added E2E test cases. ### Does this PR introduce _any_ user-facing change? MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: chenaoxuan <cax1165@163.com>	2026-01-08 09:15:55 +08:00
lidenghui1110	481138e1d2	[bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (#4311 ) ### What this PR does / why we need it? func `get_kv_cache_spec` in model_runner changed a lot and caused error in cpuoffloading connector which is copied from model_runner, this PR adapts to new implemented `get_kv_cache_spec` to fix it. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-01-08 09:15:09 +08:00
zzhxxx	f7db812ed7	[refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181 ) ### What this PR does / why we need it? - Delete the environment variable `VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED` - Introduce layer_sharding as a configurable feature in additional_config - Revise the term "shared weight" to "shard weight." Configuration : The feature is opt-in via the additional_config argument: ``` --additional-config '{ "layer_sharding": ["o_proj", "q_b_proj"] }' ``` This is orthogonal to standard tensor parallelism and weight replication strategies. It is treated as a separate, explicit feature.It can be used in any scenario, combined with the flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature or the ShardedCP #4702 feature, to achieve significant performance. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2026-01-08 09:05:02 +08:00
zxr2333	20a8cf061b	[BugFix][P/D] Fix pre-create link parameter error (#5694 ) ### What this PR does / why we need it? Fix pre-create link parameter error, `batch_transfer_sync_write` requires list. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-01-08 08:41:10 +08:00
ZCG12345	3be8e33fe9	[Kernel] Add moe_gating_top_k operator support for Ascend NPU (#5579 ) ### What this PR does / why we need it? 1.replace moe_gating_top_k from torch_npu with custom op 2.enable the renorm function of moe_gating_top_k in softmax scenerio ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need test - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: ZCG12345 <2097562023@qq.com>	2026-01-07 21:42:31 +08:00
Li Wang	1165b2c863	[1/N][CI] Refactor accuracy test (#5400 ) ### What this PR does / why we need it? 1. Accuracy testing no longer compares eager and graph modes; instead, it directly extracts the golden result under the graph mode configuration (the implicit purpose of this case is to verify whether modifications affect existing results) 2. Next step: finer-grained supervision of logits/sampler results ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-07 20:58:15 +08:00
Icey	b94fc13d3f	[BugFix][Fusion] Fix graph fusion failure problem (#5676 ) Currently, the vllm pull request (https://github.com/vllm-project/vllm/pull/24252) is causing operator fusion to fail. This issue was previously fixed by patching the backend. The root cause has been identified, and the problem can be resolved with this pull request. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-07 18:42:55 +08:00
Icey	137f28341d	[Tests] Add qwen3-8b nightly test (#5597 ) ### What this PR does / why we need it? Add qwen3-8b nightly test - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-07 18:42:05 +08:00
Mengqing Cao	3f4f2b4ae6	[Refactor] Import global var form vllm instead of overwirte it (#5469 ) ### What this PR does / why we need it? Import global var form vllm instead of overwirte it, so that we could use the correct global variant value - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-01-07 18:41:45 +08:00
LICO67373	380f089fbf	[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870 ) ## What this PR does / why we need it? This PR fixes the `AttentionMaskBuilder` singleton initialization issue introduced in PR #4779 and removes the unused `pcp_prefill_mask` field. ### Background After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton` decorator, the class constructor now requires a `device` parameter. However, two initialization sites were still using the old parameterless constructor, causing failures. ### Changes 1. Fix singleton initialization - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendMLAMetadataBuilder.__init__()` - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendAttentionMetadataBuilder.__init__()` 2. Remove unused field - Removed `pcp_prefill_mask` field from `AscendPrefillContextParallelMetadata` (never used in codebase) - Updated related test assertions ### Related - Issue #5463 - PR #4779 (Unify all mask generation methods) - PR #5389 (Make AttentionMaskBuilder singleton) ## Does this PR introduce _any_ user-facing change? No. This is an internal refactoring. ## How was this patch tested? - ✅ Local testing: No linter errors - ✅ Unit tests for attention modules verified - ⏳ CI pipeline Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-07 17:09:52 +08:00
wangxiyuan	91790fd85a	[CI] move image and wheel job to schedule way (#5685 ) move image and wheel job to schedule way to save CI resource - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 16:40:19 +08:00
无脸男	1140789e83	[Bugfix] Fix the graph capture failure issue in the eagle3+full scenario. (#5553 ) ### What this PR does / why we need it? When launching the service in the scenario where the cudagraph_mode is set to FULL and Eagle3 acceleration is enabled for inference, an error in fia will cause graph capture to fail. This PR fixes the issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: WithHades <244036962@qq.com>	2026-01-07 15:57:16 +08:00
weiguihua2	2b8a9ce8bd	[Bugfix] fix resource are insufficient when pcp and piecewise (#5377 ) ### What this PR does / why we need it? Resolving the issue of insufficient resources during service operation when PCP is enabled in a piecewise scenario. When enabling PCP and executing in piecewise mode, the curl request fails due to insufficient resources, resulting in the error message "The resources are insufficient." Through profiling analysis, it was found that the PCP communication domain also occupies streams and consumes resources. Therefore, when updating aclgraph sizes, the PCP communication domain needs to be taken into account. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-07 15:39:52 +08:00
Paco Xu	4f9808002b	[CI] Add workflow to cancel running workflows on PR close (#5646 ) Example: https://github.com/vllm-project/vllm-ascend/actions/runs/20735955959/job/59533181655 is still running after https://github.com/vllm-project/vllm-ascend/pull/5612 is closed. And the action will be running for more than 2 hours, which needs to be cleanup. It seems that the Github Aciton will not cancel it automatically, so I add this to cannel those PR related actions once it is closed. Tested in https://github.com/pacoxu/pacoxu/actions/runs/20743173119. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: Paco Xu <roollingstone@gmail.com>	2026-01-07 15:38:10 +08:00
Li Wang	d314ea8d3d	[CI] Bump lm-eval version to v0.4.9.2 (#5655 ) ### What this PR does / why we need it? fix https://github.com/vllm-project/vllm-ascend/issues/2865, lm-eval [got an official update last month](https://github.com/EleutherAI/lm-evaluation-harness/releases/tag/v0.4.9.2), so let's bump the version. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-07 14:15:53 +08:00
wangxiyuan	6f7a81cd9f	[CI] cleanup single/multi-card test (#5623 ) 1. speed up e2e light test. 2. create `2-cards` and `4-cards` folder in multicard 3. move ops to nightly 4. run test in Alphabetical Order - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 14:13:34 +08:00
SILONG ZENG	1afbc01ed4	[misc]Add Kimi-K2 series to CI model list (#5656 ) ### What this PR does / why we need it? Add the model to CI for subsequent addition of nightly test cases: - moonshotai/Kimi-K2-Thinking - vllm-ascend/Kimi-K2-Instruct-W8A8 ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com>	2026-01-07 11:32:48 +08:00
UnifiedCacheManager	d6bb17f10e	[Bugfix]Add register_kv_cache in ucm_connector (#5657 ) ### What this PR does / why we need it? To adapt different shapes of the KV cache, UCM optimized the initialization of store by moving it into `register_kv_caches`. Therefore, this update adds `register_kv_caches` interface to UCMConnectorV1. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2026-01-07 11:30:33 +08:00
LI SHENGYONG	cd59323e40	[Bugfix] Revert pr4214 multi-stream collect expert hotpot (#5529 ) ### What this PR does / why we need it? PR4214 was intended to collect expert heat by processing multiple streams, which could lead to memory overwriting and accuracy issues. After communicating with the PR submitter, this PR has been reverted. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? qwen3-moe dynamic eplb Befor revert \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 43.33 \| After revert \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| baseline (without eplb) \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-07 11:26:47 +08:00
wangyibo1005	25baf6df09	[Feature]EPLB:Adapt DispatchGmmCombineDecode operator to eplb tensor list and expert token numbers (#5552 ) #### What this PR does / why we need it? This PR adapt DispatchGmmCombineDecode operator to eplb tensor list and expert token numbers. This operator support gmm1, gmm2, gmm1Scale and gmm2Scale in format of list. This operator support couting how many token each local expert recieves by expertTokensNum . - vLLM version: v0.13.0 - vLLM main: `7157596103` More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476	2026-01-07 11:23:42 +08:00
starmountain1997	086c093347	[CI] Add DeepSeek-V3.2-W8A8 nightly ci test (#5371 ) # What this PR does / why we need it? Add DeepSeek-V3.2-W8A8 dual-node nightly CI test and update A3 nightly test configuration: 1. Add DeepSeek-V3.2-W8A8 dual-node test: tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml - 2 nodes, 16 NPUs per node (32 NPUs total) - Configuration: 2P+1D (data-parallel-size=4, tensor-parallel-size=8, data-parallel-size-local=2) - Includes performance and accuracy benchmarks with GSM8K dataset 2. Update A3 nightly workflow: .github/workflows/nightly_test_a3.yaml - Added DeepSeek-V3.2-W8A8 dual-node test to the A3 nightly test matrix - Test name: multi-node-dpsk3.2-2node 3. Improve test scripts: Updated .github/workflows/_e2e_nightly_multi_node.yaml and related scripts for better multi-node testing support test on A3 instances - Performance baseline: 1 (threshold: 0.97) - Accuracy baseline: 95% (threshold: 5%) - Test dataset: GSM8K with 512 prompts for performance, gsm8k-lite for accuracy --------- Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-01-07 10:02:02 +08:00
Feng Liu	cbc987db0b	[bugfix (pcp)] fix chunked prefill accurancy issue (#5647 ) ### What this PR does / why we need it? Purpose: initialize padded slot mapping buffer to prevent garbage values. In PCP mode, the `pcp_padded_slot_mapping` buffer is reused across invocations. Without explicit initialization, this buffer retain stale values from previous runs, which can lead to incorrect results. This change ensures the buffer is filled with -1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2026-01-07 10:01:27 +08:00
wangxiyuan	1112208052	[Refactor] Cleanup platform (#5566 ) ### What this PR does / why we need it? 1. add `COMPILATION_PASS_KEY` constant 2. clean up useless platform interface `empty_cache`, `synchronize`, `mem_get_info`, `clear_npu_memory` 3. rename `CUSTOM_OP_REGISTERED` to `_CUSTOM_OP_REGISTERED` 4. remove uesless env `VLLM_ENABLE_CUDAGRAPH_GC` NPUPlatform is the interface called by vLLM. Do not call it inner vllm-ascend. ### Does this PR introduce _any_ user-facing change? This PR is just a cleanup. All CI should pass. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 09:25:55 +08:00
Ronald	6ea2afe5fa	[Feature] implement basic framework for batch invariant (#5517 ) ### What this PR does / why we need it? This PR implement the basic framework for batch invariant, please see https://github.com/vllm-project/vllm-ascend/issues/5487. ### Does this PR introduce _any_ user-facing change? we reuse the function `vllm_is_batch_invariant` in vllm to judge if batch invariant is enabled. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Signed-off-by: Lord_of_Ironhill <suiweiyi@huawei.com> Signed-off-by: zjchenn <zjchenn@gmail.com> Signed-off-by: wangx700 <wangxin700@huawei.com> Co-authored-by: Lord_of_Ironhill <suiweiyi@huawei.com> Co-authored-by: zjchenn <zjchenn@gmail.com> Co-authored-by: wangx700 <wangxin700@huawei.com>	2026-01-07 09:11:26 +08:00

1 2 3 4 5 ...

2016 Commits