xc-llm-ascend

Author	SHA1	Message	Date
liziyu	330e25ab1d	[P/D] Performance enhancement of Layerwise connector in TP asymmetric scenarios (#5540 ) ### What this PR does / why we need it? [P/D] Performance enhancement of Layerwise connector in TP asymmetric scenarios 1. Session fusion: For transmission tasks at each layer, aggregate transmission tasks with the same destination and merge them into a single task for assignment. 2. Alltoall aggregation: For TP asymmetric scenarios, perform all alltoall operations at once according to the block granularity for all requests. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support https://github.com/vllm-project/vllm-ascend/issues/4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-06 20:25:36 +08:00
wangxiyuan	cd1162e25a	[Misc] Remove useless weight loader patch (#5619 ) The patch for weight loader is useless now. Let's remove it - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-06 20:17:32 +08:00
InSec	089ca2ddcc	[Nightly][Test] Add Qwen3-Next-80B-A3B-Instruct-W8A8 nightly test (#5616 ) ### What this PR does / why we need it? There was an accuracy issue with Qwen3-Next-80B-A3B-Instruct-W8A8 model in the old version of Triton-Ascend, so, we are now adding one nightly test to maintain it. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: IncSec <1790766300@qq.com>	2026-01-06 17:36:00 +08:00
yeyifan	cc0110abb4	[Bugfix] Remove swa parameter of fia (#5602 ) ### What this PR does / why we need it? When using the swa parameter in fia, headDim does not currently support 256, and when gemma3's headDim is equal to 256, an error will occur. Therefore, code rollback is required, and it will be incorporated after cann supports it. ### Does this PR introduce _any_ user-facing change? Remove swa parameter of fia. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: nsdie <yeyifan@huawei.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-01-06 17:24:43 +08:00
Mercykid-bash	29e2f9a43e	Bugfix: Align expert map shapes with redundant experts in EPLB adjustment (#5285 ) #### Overview This PR fixes a shape mismatch bug between `expert_placement_map` and `log2phy_expert_map` when redundant experts are enabled in the vLLM-Ascend platform. The issue occurred during the initialization of expert maps and their updates via EPLB (Expert Load Balancer) adjustment, leading to potential tensor shape errors and incorrect expert routing in distributed MoE deployments. #### Key Changes 1. Unify expert map shape calculation logic - Ensure the shape of `expert_placement_map` and `log2phy_expert_map` strictly aligns with the total number of experts (including redundant experts) during initialization. - Update the shape adjustment logic in EPLB dynamic update process to match the initial expert map dimensions. 2. Add shape consistency checks - Add assertion statements to verify the shape consistency of the two maps after initialization and EPLB adjustment, preventing silent shape mismatches in subsequent operations. #### Impact - Resolves tensor shape errors when using redundant experts with EPLB on Ascend platform. - Ensures correct expert routing and load balancing for MoE models with redundant expert configurations. - No breaking changes to existing functionality; compatible with non-redundant expert deployments. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Che Ruan <cr623@ic.ac.uk> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: Che Ruan <cr623@ic.ac.uk> Co-authored-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-06 17:22:36 +08:00
Zetong Li	fe3f2c7702	[Refactor][EAGLE] 3/N delete redundant methods in mtp_proposer (#5420 ) ### What this PR does / why we need it? This PR aims to delete redundant methods in mtp_proposer. All the deleted methods now can be found in eagle_proposer. We also remove some methods in eagle_proposer since they are identical to those in vllm-eagle. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-01-06 16:47:39 +08:00
Shanshan Shen	b94d589769	[MM][Bugfix] Update `hf_config` to `hf_text_config` (#5319 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/5205, update `hf_config` to `hf_text_config`. Find more details at https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534 and https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-06 16:41:39 +08:00
Magnus	293b2275df	[CI] Specify the version of xlite (#5612 ) ### What this PR does / why we need it? Pin the xlite version to avoid CI failures during its upgrade. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: changdawei1 <changdawei3@huawei.com>	2026-01-06 16:02:16 +08:00
wjunLu	b8f245792e	[Main2Main] Upgrade vllm commit to 0106 (#5617 ) ### What this PR does / why we need it? Upgrade vllm commit to 0106 - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-06 15:50:40 +08:00
meihanc	c1dcddce3f	[CI]update bisheng version (#5621 ) ### What this PR does / why we need it? update bisheng version in 20260105 - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-06 15:22:22 +08:00
Qiu	e07938047e	[UT][PCP&DCP] UT for block_table.py (#5032 ) ## Purpose This PR add unit test for `compute_slot_mapping` function in `block_table.py` with various `pcp_size` & `dcp_size` & `cp_kv_cache_interleave_size`. ## Test Plan ``` pytest tests/ut/worker/test_block_table.py ``` ## Test Result ``` ==== 3 passed, 2 warnings in 0.20s ==== ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-06 11:19:25 +08:00
wjunLu	3cf059a72b	[Main2Main] Upgrade vllm commit to 0105 (#5595 ) ### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since https://github.com/vllm-project/vllm/pull/31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to https://github.com/vllm-project/vllm/pull/30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to https://github.com/vllm-project/vllm/pull/31584 4. Adapt `self.block_size` calling due to https://github.com/vllm-project/vllm/pull/31540 5. Modify `test_mla_v1.py` due to https://github.com/vllm-project/vllm/pull/28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-06 08:44:29 +08:00
Li Wang	c5e2f48510	[CI] mv ops to correct path (#5615 ) ### What this PR does / why we need it? mv ops to correct path :`tests/e2e/nightly/single_node/ops/singlecard_ops/triton` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-05 23:17:07 +08:00
dsxsteven	129ba9fe1b	[BugFix] Fix Smoke Testing Bug for DSR1 longseq (#5613 ) ### What this PR does / why we need it? Fix Smoke Testing Bug for DSR1 longseq We need to make this change because the daily smoke test case is throwing an error: "max_tokens or max_completion_tokens is too large: 32768.This model's maximum context length is 32768 tokens and your request has 128 input tokens". We encounter this error due to max-out-len equals to max-model-len. We can fix this error by increasing max-model-len argument in the script. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: daishixun <dsxsteven@sina.com>	2026-01-05 22:40:28 +08:00
ZixuanWang	8eae949d11	Revert "[Feat] enable hierarchical mc2 ops on A2 by default (#5545 )" (#5611 ) This reverts commit `fb9fdcdbe4`. ### What this PR does / why we need it? this pr breaks the smoke test because of that leads the error of aclnnNeScalar:Kernel Run failed. opType: 25, NotEqual launch failed for NotEqual, errno:361001 <img width="1149" height="166" alt="A6C9453D-4F0B-4256-DD80-A9C181DAB2D9" src="https://github.com/user-attachments/assets/cab9c4b8-3fd1-4c6b-b424-474b46042726" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: zxwang <1476209578@qq.com>	2026-01-05 22:39:05 +08:00
Angazenn	11e75494b1	[TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (#5267 ) ### What this PR does / why we need it? Add nightly test for triton split_rmsnorm_rope ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-05 21:35:37 +08:00
Chen Chen	a2daacbd71	[perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... (#5192 ) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior #4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2026-01-05 21:29:45 +08:00
Qiu	b10ef9b9f3	[docs] Correct image about prefill phase of PCP (#5598 ) ### What this PR does / why we need it? Remove the incorrectly depicted DCP all_gather operation in the prefill stage PCP for GQA diagram. Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-05 20:21:59 +08:00
meihanc	a034941d06	[CI] update triton-ascend version (#5584 ) ### What this PR does / why we need it? update triton-ascend version to 20260105 - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-05 20:20:11 +08:00
Chao Lei	473431e7e2	[P/D]Remove mooncake kvpool unused parameter `local_hostname` (#5574 ) ### What this PR does / why we need it? In mooncake kvpool, `local_hostname` is not used. Instead, the local IP is obtained directly via `get_ip()`. Therefore, remove this parameter to avoid confusion. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: LCAIZJ <leichao139636@163.com>	2026-01-05 20:18:59 +08:00
Debonet	d86021f7b4	[Bugfix] record cos and sin cache in AscendRotaryEmbedding (#5516 ) ### What this PR does / why we need it? In scenarios where models like [Moonlight](https://modelscope.cn/models/moonshotai/Moonlight-16B-A3B-Instruct) (using MLA but without `rope_scaling` in config.json) invoke `AscendRotaryEmbedding`. `_cos_cache` and `_sin_cache` are not recorded correctly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: Debonex <719893090@qq.com>	2026-01-05 20:12:41 +08:00
meihanc	16b1bee804	[bugfix] fix test_camem failed with triton-ascend (#5492 ) ### What this PR does / why we need it? This fixes a bug that occurred when running `test_camem.py` in the triton-ascend environment `NPU function error: aclrtGetMemInfo(ACL_HBM_MEM, &device_free, &device_total)` - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-05 20:10:54 +08:00
ZT-AIA	58e8d19c35	[UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat (#5474 ) ### What this PR does / why we need it? [UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? pytest -sv tests/ut/ops/test_fused_qkvzba_split_reshape_cat.py - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-01-05 20:05:07 +08:00
Li Wang	1e6228d8cd	[CI] Download models from ms (#5405 ) ### What this PR does / why we need it? Add a new workflow to allow developers submit a pull_request downloading new models for CI cache - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-05 19:59:13 +08:00
huqi	2d22700d69	Docs: Add A3 Docker image guidance for Atlas A3 machines (#5256 ) Fixes #3386 - Update Qwen3-30B-A3B.md to use A3-specific image tag - Update Qwen3-Dense.md to provide both A2 and A3 image options - Update Qwen3-Next.md to use A3-specific image for Atlas A3 environments Previously, documentation only mentioned A2 images (vllm-ascend:version) but Atlas A3 machines require A3-specific images (vllm-ascend:version-a3). This change ensures users select the correct image for their hardware. 🤖 Generated with [Claude Code](https://claude.com/claude-code) - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: hu-qi <huqi1024@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>	2026-01-05 19:42:42 +08:00
huqi	9d8b4c8d9d	[Doc] Add NNAL installation guide and requirements (#5235 ) Fixes #2727 - Add NNAL to the software requirements table with version information - Add note explaining that prebuilt Docker images include NNAL - Add warning message for manual installation when encountering libatb.so errors - Improve visibility of NNAL installation instructions to prevent runtime errors This addresses the issue where users encounter 'libatb.so not found' errors due to missing NNAL installation in their environment. ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: hu-qi <huqi1024@gmail.com> Co-authored-by: zhangyiming <34808445+menogrey@users.noreply.github.com>	2026-01-05 19:40:26 +08:00
frankie	ec3563334b	Add the requirement of arctic-inference which speculative decoding with suffix_decode (#5045 ) ### Does this PR introduce _any_ user-facing change? suffix spec decode method rely on `arctic-inference` library. This PR add it into requirements to make sure the function works by default ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: frankie-ys <yongshengwang@cmbchina.com> Signed-off-by: frankie <wangyongsheng686@gmail.com>	2026-01-05 19:15:49 +08:00
Icey	e7b623b363	[BugFix][Fusion] Fix graph fusion failure problem (#5253 ) Currently, the vllm pull request (https://github.com/vllm-project/vllm/pull/24252) is causing operator fusion to fail. This issue was previously fixed by patching the backend. The root cause has been identified, and the problem can be resolved with this pull request. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-05 17:49:09 +08:00
wujinyuan1	4a3663327b	[Refactor]7/N Extract common code to common_cp (#5490 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： Eliminate duplicate code for two file(mla_cp.py attention_cp.py) to common_cp.py. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` vLLM version: release/v0.13.0 vLLM main: `5fbfa8d9ef` - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com>	2026-01-05 17:41:12 +08:00
Yizhou	755caeb06e	[Feat][Spec] Optimize token index calculation in spec decode with Triton kernel (#5356 ) ### What this PR does / why we need it? Replace multiple PyTorch operations with a fused Triton kernel to determine token indices for sampling during speculative decoding. This reduces kernel launch overhead and memory traffic, improving overall performance on Ascend hardware. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-05 16:51:29 +08:00
daniel	8ffe3f5d78	feat: implement high-performance Triton kernels for rejection sampling: optimization for rejection_random_sample_kernel (#5259 ) ### What this PR does / why we need it? This PR introduces optimized Triton implementations for the rejection_random_sample_kernel delivering superior performance compared to the existing Triton implementations. The new Triton kernels maintain full functional accuracy while delivering significant performance improvements across various batch sizes and MTP configurations. ### Does this PR introduce _any_ user-facing change? Yes, this PR modifies rejection_sampler.py to use optimized Triton kernels: rejection_random_sample_kernel is modified and optimized ### How was this patch tested? performance benchmark results: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=Generator content="Microsoft Excel"> <!--[if !mso]> </head> <body> <!--StartFragment--> Batch Size \| MTP \| origin implementation(us) \| optimized version(us) -- \| -- \| -- \| -- 1 \| 1 \| 2.934 \| 3.64 8 \| 1 \| 4.467 \| 4 32 \| 1 \| 6.98 \| 4.54 64 \| 1 \| 11.087 \| 6.42 128 \| 1 \| 13.414 \| 7.84 256 \| 1 \| 19.66 \| 8.487 512 \| 1 \| 39.908 \| 11.62 1024 \| 1 \| 81.781 \| 18.16 2048 \| 1 \| 137.923 \| 32.934 1 \| 2 \| 3.4 \| 4.02 8 \| 2 \| 3.74 \| 4.24 32 \| 2 \| 6.373 \| 7.394 64 \| 2 \| 9.747 \| 6.46 128 \| 2 \| 12.98 \| 7.76 256 \| 2 \| 20.834 \| 9.787 512 \| 2 \| 39.314 \| 13.56 1024 \| 2 \| 83.135 \| 22.387 2048 \| 2 \| 157.563 \| 40.607 <!--EndFragment--> </body> </html> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: 1024daniel <xxltju324@gmail.com>	2026-01-05 16:03:02 +08:00
Trunrain	91bf524364	[BugFix][kernel] fix matmul_allreduce_add_rmsnorm_kernel (#5335 ) ### What this PR does / why we need it? fix matmul_allreduce_add_rmsnorm_kernel, add hccl Init, SetCcTiling interface test case use multicard-4 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? pytest -sv tests/e2e/nightly/ops/test_matmul_allreduce_add_rmsnorm.py multicard-4 pass https://github.com/vllm-project/vllm-ascend/actions/runs/20502630658/job/58914474652?pr=5335 - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: tongrunze <t00574058@china.huawei.com> Co-authored-by: tongrunze <t00574058@china.huawei.com>	2026-01-05 15:19:54 +08:00
zhangmuzhi_yuwan	6c1a685b30	[Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415 ) ### What this PR does / why we need it? This documentation provides a comprehensive technical guide for deploying vLLM-Ascend using a Prefill-Decode (PD) colocated architecture integrated with Mooncake, a high-performance distributed KV Cache transfer engine. As Large Language Model (LLM) serving scales, managing KV Cache efficiently across distributed nodes is essential for reducing latency and optimizing hardware utilization. The tutorial focuses on a multi-instance setup using Huawei Atlas 800T A2 nodes. By leveraging Mooncake’s distributed memory pooling, vLLM instances can achieve seamless cross-node KV Cache reuse. This capability allows an instance to retrieve precomputed cache from a remote node's DRAM via high-speed RoCE networks, effectively bypassing redundant prefill computations. ### Does this PR introduce _any_ user-facing change? No - vLLM version: release/v0.13.0 - vLLM main: `0bfd7484fd` --------- Signed-off-by: zhangmuzhibangde <1037640609@qq.com> Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2026-01-05 14:19:57 +08:00
weiguihua2	549be94397	[Bugfix] fix pcp + eplb error (#5561 ) ### What this PR does / why we need it? Fix the bug in the PCP overlay feature 1、Fix the bug related to PCP and EPLB overlap by including PCP size in the word_size calculation. 2、In the PCP pooling scenario, a prompt has been added for setting the cp_kv_cache_interleave_size. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-05 14:08:11 +08:00
lilinsiman	52863c4165	[Refactor][EAGLE] 2/N: load model and generate token (#5437 ) ### What this PR does / why we need it? 1. Refactor eagle and mtp function: load_model and generate_token_ids 2. Remove redundant code in mtp and eagle file 3. Refactor the UT of file 2/N of Refactor and merge mtp and eagle Relational RFC: https://github.com/vllm-project/vllm-ascend/issues/5467 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut and tests - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-01-05 14:07:54 +08:00
pichangping	50e7934415	MLA prefill preformance optimization (#5456 ) ### What this PR does / why we need it? Since the _npu_ring_mla operator deteriorates in long-sequencescenarios, the long sequence is split into shorter sequences for input to improve performance. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: pichangping <1337510399@qq.com>	2026-01-05 11:41:59 +08:00
L4	c23cf30709	[Doc] eval-type not support service but server (#2920 ) ### What this PR does / why we need it? fix wrong eval-type in accuracy doc - vLLM version: v0.10.2 - vLLM main: `fec347dee1` Signed-off-by: root <root@liaolile-laptop.localdomain> Co-authored-by: root <root@liaolile-laptop.localdomain>	2026-01-05 11:17:39 +08:00
Magnus	2b5536362a	[CI] skip xlite-decode-only e2e test (#5407 ) ### What this PR does / why we need it? skip xlite-decode-only e2e test, since it's unstable - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` Signed-off-by: changdawei1 <changdawei3@huawei.com>	2026-01-05 11:05:26 +08:00
zhangxinyuehfad	a099b994b3	[Doc] update supported models (#5379 ) ### What this PR does / why we need it? 1. update supported models: Llama2 & Kimi-K2-Thinking & ERNIE-4.5 & Qwen3-Omni 2. update Supported Hardware - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-01-05 09:21:52 +08:00
panchao-hub	42774df744	[Bugfix] Fix weight transpose in RL scenarios (#5567 ) ### What this PR does / why we need it? In the training-inference switching scenario, there is no need to resume the model weights during KV cache resumption, as this would lead to format mismatch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com>	2026-01-05 09:17:26 +08:00
LookAround0301	d25a2c20c5	[Bugfix] Fix chunk prefill bug for long_sequence feature (#5444 ) ### What this PR does / why we need it? Fix chunk prefill bug for long_sequence feature When there are two requests with chunk prefill enabled in the long-sequence scenario, if one request has only 1 token during scheduling, it will be identified as a decode request and trigger an error. This PR fixes the issue. Closes: https://github.com/vllm-project/vllm-ascend/issues/5445 - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: LookAround <lixushi@huawei.com>	2026-01-05 09:16:36 +08:00
meihanc	fbb93ad8f2	[bugfix]update bishengir source envs (#5582 ) ### What this PR does / why we need it? Due to the update of the Bisheng version's installation path, the corresponding source path in the environment variables needs to be updated. - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-05 09:13:40 +08:00
InSec	7cf65d0581	[Doc]modify the quantization user guide and add a quantization adaptation developer guide (#5554 ) ### What this PR does / why we need it? This PR makes the following modifications: 1.delete the `user_guide/feature_guide/quantization-llm-compressor.md` and merge it into `user_guide/feature_guide/quantization.md`. 2.update the content of `user_guide/feature_guide/quantization.md`. 3.add guidance `developer_guide/feature_guide/quantization.md' on the adaptation of quantization algorithms and quantized models. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: IncSec <1790766300@qq.com> Signed-off-by: InSec <1790766300@qq.com>	2026-01-05 09:12:11 +08:00
Qiu	96775a27a8	[refactor](UT,PCP,DCP) refactor pcp&dcp patches in UTs (#5505 ) ### What this PR does / why we need it? Refactor PCP & DCP patches in UTs: Merge and reuse communication groups and communication function patches to reduce code duplication. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-05 09:05:45 +08:00
baxingpiaochong	46c2fc6a3c	[KVPOOL]decode save kvcache (#5168 ) ### What this PR does / why we need it? kvpool decode save kvcache now only support mla ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: baxingpiaochong <771405853@qq.com> Co-authored-by: Chao Lei <leichao139636@163.com>	2026-01-04 22:22:01 +08:00
wangqiankun13	350b95efcf	[BugFix]Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series (#5293 ) …w8a8 while main model uses w8a8 ### What this PR does / why we need it? Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series. More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2026-01-04 17:51:28 +08:00
Qiu	f15dc3fa02	[bugfix](pcp) expand max_num_tokens for pcp pad (#5478 ) ### What this PR does / why we need it? Since the [PR](https://github.com/vllm-project/vllm/pull/28988) for PCP modifications to `GPUModelRunner` has not yet been merged into vLLM, this PR temporarily requires adjustments to certain buffer sizes. These changes can be reverted once the original [PR](https://github.com/vllm-project/vllm/pull/28988) is merged. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-04 17:25:40 +08:00
Cao Yi	749c4a3deb	[Doc] Fix typo in ASCEND_RT_VISIBLE_DEVICES (#5581 ) Fixed a typo in the environment variable name. `ASCEBD_RT_VISIBLE_DEVICES` -> `ASCEND_RT_VISIBLE_DEVICES` Fixes #5580 Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-01-04 17:01:02 +08:00
lidenghui1110	d462577504	[Recover] [Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 ) (revert in #4981 ) (#5511 ) PR #4892 was revert in #4981, we recover it now. For the potential bug break deepseek3.2 in PD case, we will find it out and fix it. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-01-04 16:49:33 +08:00
Qiu	7c210225a2	[Perf][PCP][DCP] add multi-stream for GQA to enable computation-communication overlap (#5382 ) ### What this PR does / why we need it? This PR adds multi-stream for GQA to enable computation-communication overlap. For chunked prefill, we reduce TTFT by approximately 4%. ### Does this PR introduce _any_ user-facing change? No - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-04 16:33:18 +08:00

1 2 3 4 5 ...

2013 Commits