xc-llm-ascend

Author	SHA1	Message	Date
Mercykid-bash	29e2f9a43e	Bugfix: Align expert map shapes with redundant experts in EPLB adjustment (#5285 ) #### Overview This PR fixes a shape mismatch bug between `expert_placement_map` and `log2phy_expert_map` when redundant experts are enabled in the vLLM-Ascend platform. The issue occurred during the initialization of expert maps and their updates via EPLB (Expert Load Balancer) adjustment, leading to potential tensor shape errors and incorrect expert routing in distributed MoE deployments. #### Key Changes 1. Unify expert map shape calculation logic - Ensure the shape of `expert_placement_map` and `log2phy_expert_map` strictly aligns with the total number of experts (including redundant experts) during initialization. - Update the shape adjustment logic in EPLB dynamic update process to match the initial expert map dimensions. 2. Add shape consistency checks - Add assertion statements to verify the shape consistency of the two maps after initialization and EPLB adjustment, preventing silent shape mismatches in subsequent operations. #### Impact - Resolves tensor shape errors when using redundant experts with EPLB on Ascend platform. - Ensures correct expert routing and load balancing for MoE models with redundant expert configurations. - No breaking changes to existing functionality; compatible with non-redundant expert deployments. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Che Ruan <cr623@ic.ac.uk> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: Che Ruan <cr623@ic.ac.uk> Co-authored-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-06 17:22:36 +08:00
Shanshan Shen	b94d589769	[MM][Bugfix] Update `hf_config` to `hf_text_config` (#5319 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/5205, update `hf_config` to `hf_text_config`. Find more details at https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534 and https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-06 16:41:39 +08:00
ZixuanWang	8eae949d11	Revert "[Feat] enable hierarchical mc2 ops on A2 by default (#5545 )" (#5611 ) This reverts commit `fb9fdcdbe4`. ### What this PR does / why we need it? this pr breaks the smoke test because of that leads the error of aclnnNeScalar:Kernel Run failed. opType: 25, NotEqual launch failed for NotEqual, errno:361001 <img width="1149" height="166" alt="A6C9453D-4F0B-4256-DD80-A9C181DAB2D9" src="https://github.com/user-attachments/assets/cab9c4b8-3fd1-4c6b-b424-474b46042726" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: zxwang <1476209578@qq.com>	2026-01-05 22:39:05 +08:00
Angazenn	11e75494b1	[TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (#5267 ) ### What this PR does / why we need it? Add nightly test for triton split_rmsnorm_rope ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-05 21:35:37 +08:00
Debonet	d86021f7b4	[Bugfix] record cos and sin cache in AscendRotaryEmbedding (#5516 ) ### What this PR does / why we need it? In scenarios where models like [Moonlight](https://modelscope.cn/models/moonshotai/Moonlight-16B-A3B-Instruct) (using MLA but without `rope_scaling` in config.json) invoke `AscendRotaryEmbedding`. `_cos_cache` and `_sin_cache` are not recorded correctly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: Debonex <719893090@qq.com>	2026-01-05 20:12:41 +08:00
meihanc	16b1bee804	[bugfix] fix test_camem failed with triton-ascend (#5492 ) ### What this PR does / why we need it? This fixes a bug that occurred when running `test_camem.py` in the triton-ascend environment `NPU function error: aclrtGetMemInfo(ACL_HBM_MEM, &device_free, &device_total)` - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-05 20:10:54 +08:00
Yizhou	755caeb06e	[Feat][Spec] Optimize token index calculation in spec decode with Triton kernel (#5356 ) ### What this PR does / why we need it? Replace multiple PyTorch operations with a fused Triton kernel to determine token indices for sampling during speculative decoding. This reduces kernel launch overhead and memory traffic, improving overall performance on Ascend hardware. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-05 16:51:29 +08:00
daniel	8ffe3f5d78	feat: implement high-performance Triton kernels for rejection sampling: optimization for rejection_random_sample_kernel (#5259 ) ### What this PR does / why we need it? This PR introduces optimized Triton implementations for the rejection_random_sample_kernel delivering superior performance compared to the existing Triton implementations. The new Triton kernels maintain full functional accuracy while delivering significant performance improvements across various batch sizes and MTP configurations. ### Does this PR introduce _any_ user-facing change? Yes, this PR modifies rejection_sampler.py to use optimized Triton kernels: rejection_random_sample_kernel is modified and optimized ### How was this patch tested? performance benchmark results: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=Generator content="Microsoft Excel"> <!--[if !mso]> </head> <body> <!--StartFragment--> Batch Size \| MTP \| origin implementation(us) \| optimized version(us) -- \| -- \| -- \| -- 1 \| 1 \| 2.934 \| 3.64 8 \| 1 \| 4.467 \| 4 32 \| 1 \| 6.98 \| 4.54 64 \| 1 \| 11.087 \| 6.42 128 \| 1 \| 13.414 \| 7.84 256 \| 1 \| 19.66 \| 8.487 512 \| 1 \| 39.908 \| 11.62 1024 \| 1 \| 81.781 \| 18.16 2048 \| 1 \| 137.923 \| 32.934 1 \| 2 \| 3.4 \| 4.02 8 \| 2 \| 3.74 \| 4.24 32 \| 2 \| 6.373 \| 7.394 64 \| 2 \| 9.747 \| 6.46 128 \| 2 \| 12.98 \| 7.76 256 \| 2 \| 20.834 \| 9.787 512 \| 2 \| 39.314 \| 13.56 1024 \| 2 \| 83.135 \| 22.387 2048 \| 2 \| 157.563 \| 40.607 <!--EndFragment--> </body> </html> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: 1024daniel <xxltju324@gmail.com>	2026-01-05 16:03:02 +08:00
hwhaokun	fb9fdcdbe4	[Feat] enable hierarchical mc2 ops on A2 by default (#5545 ) ### What this PR does / why we need it? Previously, it was necessary to set the environment variables HCCL_INTRA_PCIE_ENABLE=1 and HCCL_INTRA_ROCE_ENABLE=0. This PR enables hierarchical MC2 operations on A2 by default. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: hwhaokun <haokun0405@163.com>	2026-01-04 14:44:20 +08:00
Chu Yuelin	d07d8a4535	[Model] Add LongCat-Flash (#3833 ) ### What this PR does / why we need it? Add LongCat-Flash support. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed - vLLM version: v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chuyuelin <923822139@qq.com> Co-authored-by: chuyuelin <chuyuelin1@huawei.com>	2025-12-31 17:06:55 +08:00
Jade Zheng	7d5242faca	[Refactor] Formatting output types related to FuseMoE (#5481 ) Currently in the Fused MoE module, functions of classes like MoECommMethod and MoETokenDispatcher output data in dictionary or tuple format, which hampers code maintainability, readability, and extensibility. This PR introduces dataclasses for these key output types to address these issues. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-31 14:24:37 +08:00
LI SHENGYONG	bdc721d35a	[smoke][bugfix] moe_init_routing_v2 active_expert_range use int type (#5521 ) ### What this PR does / why we need it? The float kernel of MOE_init_routing_v2 in the dispatch allgather operation does not support tensor format for active_expert_range; it only supports int. PR5311 To unify the variables `local_num_experts` and `self.local_num_experts`, `self.local_num_experts` was used consistently, which led to the subsequent integer type parameter being converted to a tensor type. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? gsm8k \| exact_match,strict-match: ground_truth=0.89 \| measured=0.8939 \| success=✅ gsm8k \| exact_match,flexible-extract: ground_truth=0.85 \| measured=0.856 \| success=✅ ceval-valid \| acc,none: ground_truth=0.84 \| measured=0.8373 \| success=✅ Model Parameters: {'pretrained': 'Qwen/Qwen3-30B-A3B', 'tensor_parallel_size': 2, 'dtype': 'auto', 'trust_remote_code': False, 'max_model_len': 4096, 'gpu_memory_utilization': 0.6, 'enable_expert_parallel': True} - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-12-31 09:19:04 +08:00
zzzzwwjj	71f729a661	Revert "moe_gating_top_k" (#5512 ) Reverts vllm-project/vllm-ascend#5271 It breaks e2e test - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1`	2025-12-30 15:05:47 +08:00
ZCG12345	45c3c279e2	moe_gating_top_k (#5271 ) 1. What this PR does / why we need it? This PR supports the moe_gating_top_k operator, which enables post-positioned renormalization (renorm) on the basis of softmax. 2. Does this PR introduce any user-facing change? No user-facing changes are required. 3. How was this patch tested? This patch was tested with the test_npu_moe_gating_top_k test case. vLLM version: release/v0.13.0 vLLM main: `ad32e3e19c` --------- Signed-off-by: ZCG12345 <2097562023@qq.com> Signed-off-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-30 09:28:01 +08:00
whx	28b7614322	[Refactor][Triton] Move reject sample triton kernels into ops/triton (#5324 ) ### What this PR does / why we need it? This PR moves reject sample related triton kernels into `ops/triton`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-12-29 16:15:41 +08:00
Ronald	e7e1a7dc05	[Feature] support eager mode in model runner v2 (#5210 ) ### What this PR does / why we need it? #5051 only implement a basic framework for model runner v2, but there are still some bugs for e2e functionality, this PR aim to enable basic functionality. model runner v2 plans: https://github.com/vllm-project/vllm-ascend/issues/5208 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-12-29 15:28:34 +08:00
LI SHENGYONG	f81cf694b2	[EPLB][refactor] Modification of the initialization logic for expert_map and log2phy（depend on pr5285） (#5311 ) ### What this PR does / why we need it? Unify the loading logic for expert_map and log2phy. 1. The map generated when enabling the redundancy expert is incorrect. The community generation map function only accepts the number of global experts. When we pass in the number of logical experts plus redundant experts, the local expert ID of the last card will index to an expert ID that does not exist. Now we ensure that the index points to a real existing expert ID, and each expert can be accessed. Moreover, when redundant experts are not enabled, the output of our function remains consistent with the community's function. 2. The map we generate is based on the length of the physical expert, but in reality, we only need to use the length of the logical expert. Later on, we will need to pad it accordingly, so we can simply generate a map with the length of the logical [expert.] 3. Unify the initialization logic across different scenarios and simplify the code for fused_moe. Before refactoring - map path is not None： expert map: get_rank_placement_map from _'expert_load_balancer.py'_, maintains the map for all ranks and all layers. log2phy: get_rank_log2phy_map from _'expert_load_balancer.py'_, maintains the map for all ranks and all layers. - map path is None: expert map: determine_expert_map from '_vllm.laye_r', The function does not support the redundant experts of vllm-ascend. log2phy: determine_default_log2phy_map from _'eplb_utils.py'_. The function does not support the redundant experts of vllm-ascend. Refactoring eplb_utils.py     init_eplb_config          generate placement          generate expert map          generate log2phy ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Expert Mapping Test Generation: ep size: 16, num of experts: 256, num of redundant experts: 16 +++++++++++++++++++++++++++++++++++++++++ Expert Mapping (Non-1 indicates the expert responsible for this rank) for Rank 15: vllm map： [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] +++++++++++++++++++++++++++++++++++++++++ Improved map： [16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] Expert Mapping Test Generation: ep size: 16, num of experts: 256, num of redundant experts: 0 +++++++++++++++++++++++++++++++++++++++++ Expert Mapping (Non-1 indicates the expert responsible for this rank) for Rank 15: vllm map： [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] +++++++++++++++++++++++++++++++++++++++ Improved map： [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] dsr1 baselie： \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8k-lite \| 7cd45e \| accuracy \| gen \| 100.00 \| dsr1 eplb： \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8k-lite \| 7cd45e \| accuracy \| gen \| 100.00 \| - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-29 09:26:14 +08:00
weijinqian0	dbe4c338f2	[Refactor] cache cos/sin in mla & remove parameter model in builder. (#5277 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 1. Cache cos/sin in mla 2. AttentionBuilder inherits from the original class of vllm. version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-28 10:35:07 +08:00
Li Wang	58adf7c8ac	[Bugfix] Correctly handle the output shape in multimodal attention (#5443 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/5297, for `AscendMMEncoderAttention` forward, we should keep the output shape consistence with the input - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-27 18:42:46 +08:00
realliujiaxu	09f71c14a6	Revert "[feat] enable hierarchical mc2 ops on A2 by default (#5300 )" (#5434 ) We'll release 0.13.0 soon. The main branch is freeze. Let's revert the newest change and redo it once 0.13.0 is released. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-12-27 17:06:58 +08:00
hwhaokun	12da9f9460	[feat] enable hierarchical mc2 ops on A2 by default (#5300 ) ### What this PR does / why we need it? Previously, it was necessary to set the environment variables HCCL_INTRA_PCIE_ENABLE=1 and HCCL_INTRA_ROCE_ENABLE=0. This PR enables hierarchical MC2 operations on A2 by default. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: hwhaokun <haokun0405@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-12-27 15:45:25 +08:00
LeeWenquan	7685d0c239	rollback causal_conv1d_fn to torch ops & update qwen3Next doc (#5391 ) ### What this PR does / why we need it? Rollback causal_conv1d_fn ops from triton to torch version to fix hanging issues，meanwhile update Qwen3Next doc - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-12-26 19:57:38 +08:00
XiaoxinWang	320877d488	move contiguous in fused_sigmoid_gating_delta_rule_update to model_runner_v1 (#5274 ) ### What this PR does / why we need it? The contiguous() operation temporarily increases memory usage, leading to higher peak GPU memory, which necessitates reducing gpu_memory_utilization. However, making tensors contiguous in modelrunnerv1 significantly enhances operator performance, resulting in greater end-to-end model benefits despite the memory overhead. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-26 09:19:47 +08:00
Qi Mao	7372225bcb	[FIX] Update _causal_conv1d_update_kernel for Efficient Conv State Handling on NPU (#5322 ) Description: This PR updates the implementation of the Triton operator for deployment on NPU devices, focusing on optimizing grid size and memory handling based on NPU limitations. Design Plan: Grid Calculation: The grid size is now dynamically calculated by batch and dim to ensure that the number of programs executed does not exceed the NPU's vector core capacity. This ensures optimal parallelism without overloading the hardware. Data Block Handling: Due to the limited on-chip memory (UB) on Ascend NPUs, this implementation splits large data into smaller chunks of 32k or less per block. The kernel performs a for-loop to process the data in these smaller chunks, minimizing memory usage and avoiding potential overflows. Changes Compared to GPU Implementation: Grid and Block Sizing: For GPU, the grid and block size were determined based on available thread counts and memory size. In contrast, the NPU version dynamically adjusts these parameters using B_TILE and BLOCK_N to optimize for NPU’s architecture. Memory Chunking: The original GPU implementation did not require chunking due to the higher available memory and processing capacity. For the NPU, data is divided into smaller chunks (32k or smaller) to comply with memory constraints on the device. The kernel has been modified to handle this chunking mechanism inside a loop. Optimized Thread Usage: The NPU implementation takes into account the hardware-specific thread limit (24 threads per vector core), ensuring that the number of active programs is aligned with the NPU's vector core count, avoiding over-subscription that would lead to serial processing. This PR ensures that the operator functions efficiently on Ascend NPU, considering hardware limitations while maintaining the same functionality and input parameters as the GPU implementation. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: maoxx241 <maomaoyu870@gmail.com>	2025-12-26 09:12:30 +08:00
wangxiyuan	2ae0bad96d	Remove VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE (#5272 ) `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with `VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR remove it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-25 11:09:56 +08:00
Wang Kunpeng	13cd6362c6	[bugfix] fix Error 'ValueError: Duplicate layer name' (#5280 ) ### What this PR does / why we need it? When matmul_and_reduce is enabled, the prefix attribute is required. However, in some models, the prefix is not passed correctly, causing errors when starting the service. The issue of incorrect prefix passing will be fixed in vLLM in the future. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-25 10:43:24 +08:00
Ascendyh	a90482803d	[Kernel] add l2norm triton kernel (#4595 ) ### What this PR does / why we need it? This pull request introduces an L2 normalization kernel implemented in Triton, specifically optimized for Ascend NPUs. ### Does this PR introduce _any_ user-facing change? No, this PR does not introduce any user-facing changes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-25 06:06:18 +08:00
Nengjun Ma	42c989a437	Update vllm pin to 12.24 (#5307 ) ### What this PR does / why we need it? Fix vllm break in the pr: 1. [Add MiMo-V2-Flash support] (https://github.com/vllm-project/vllm/pull/30836) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Co-authored-by: zxwang [1476209578@qq.com](mailto:1476209578@qq.com) - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-24 17:24:31 +08:00
Nengjun Ma	3b59f20a28	update to vllm 12-19 (#5223 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (https://github.com/vllm-project/vllm/pull/30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: https://github.com/vllm-project/vllm-ascend/issues/5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-23 23:52:11 +08:00
weichen	ffe51eedd6	[Refactor][MoE] Reuse vLLM's all_reduce logic (#5189 ) ### What this PR does / why we need it? Move all_reduce logic to AscendFusedMoE.forward, reuse vLLM's logic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: weichen <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-23 18:53:48 +08:00
Shanshan Shen	6c478531f8	[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm/pull/29873, register `AscendApplyRotaryEmb` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### ✅ Test Qwen2.5-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d ``` #### ✅ Test Qwen3-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-12-23 10:04:37 +08:00
Wang Kunpeng	c3a8d13ca7	[refactor] Remove unnecessary attributes from set_ascend_forward_context (#5204 ) ### What this PR does / why we need it? Remove unnecessary attributes from set_ascend_forward_context 1.prefetch_stream 2.weight_prefetch_method ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-23 08:49:52 +08:00
Zhu Yi Lin	12d581605b	[Triton]support swiglu_quant triton in w4a8 (#5161 ) ### What this PR does / why we need it? support swiglu_quant triton in w4a8 ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-22 16:01:58 +08:00
Shanshan Shen	b84ad8c5d8	[CustomOp] Register AscendMMEncoderAttention CustomOp and remove related patch (#4750 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm/pull/30125, register `AscendMMEncoderAttention` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ✅ Run Qwen2.5-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b4e3053f30ab2442","object":"chat.completion","created":1764922950,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` ✅ Run Qwen3-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-97571fbda8267bd1","object":"chat.completion","created":1764923306,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-12-22 14:32:53 +08:00
Ascendyh	b2c121637f	[task] Add fused gdn gating triton kernel (#4304 ) ### What this PR does / why we need it? This commit introduces a Triton-based fused GDN gating kernel for Ascend NPU, aimed at improving performance in the Gated Delta Net workflow. ### Does this PR introduce _any_ user-facing change? It only adds and refactors internal Triton kernels and wrappers for Ascend. These are backend implementation details. There are no new APIs, flags, CLI options, or behavior changes visible to end users. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com>	2025-12-22 14:09:19 +08:00
wangqiankun13	904c18f929	[Feature]Use DispatchGmmCombineDecode operator to replace MC2(Optional) (#5040 ) ### What this PR does / why we need it? This PR adds model-side integration for the previously introduced experimental AscendC fused operator DispatchGmmCombineDecode, used in MoE decoding. The operator implementation itself was added in a prior PR[#4139 ](https://github.com/vllm-project/vllm-ascend/pull/4139). This change only adapts the model execution path to optionally use the fused operator. When the environment variable VLLM_ASCEND_ENABLE_FUSED_MC2=2 is set, the original MC2 path composed of multiple operators (A8W8 dispatch → GMM → SwiGLU → GMM → combine) might be replaced by the single fused operator DispatchGmmCombineDecode. By default, the existing multi-operator MC2 implementation is preserved. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2025-12-21 15:23:59 +08:00
Angazenn	67a0325cf2	[BugFix]Fix wrong _cos, _sin instantiation (#5154 ) ### What this PR does / why we need it? This PR add additional check on creating global `_cos` and `_sin`, avoid creating them when using `mrope` or encoder-decoder model. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Angazenn <supperccell@163.com>	2025-12-20 22:52:50 +08:00
wangxiyuan	758d81dcb1	Drop 0.12.0 support (#5146 ) We decided to release v0.13.0 soon. So no need to support 0.12.0 now. Let's drop it. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-20 09:38:53 +08:00
XiaoxinWang	0cc3fc357f	[pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (#4818 ) ### What this PR does / why we need it? qwen3_next add fused_sigmoid_gating_delta_rule_update op which fused fused_gdn_gating+fused_recurrent_gated_delta_rule - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-19 16:34:11 +08:00
zzzzwwjj	cc23067f1e	[refactor] refactor weight trans nz and transpose (#4878 ) ### What this PR does / why we need it? Now `VLLM_ASCEND_ENABLE_NZ` will have three options: 0: disable nz; 1: only quant case enable nz; 2: enable nz as long as possible; And `VLLM_ASCEND_ENABLE_NZ`=1 by default. All cases are shown in the table below: \| \| W4A4 \| W4A8 \| W8A8 \| fp16/bf16 \| fp32 \| \|---\|---\|---\|---\|---\|---\| \| trans nz \| can't support nz \| trans nz by default \| trans nz by default \| trans nz when VLLM_ASCEND_ENABLE_NZ is 2 \| can't support nz \| \| transpose \| only support not transpose case \| only support transpose case \| only support transpose case \| linear: only support not transpose case<br>gmm: only support transpose case \| same to fp16/bf16 \| Some exceptional cases: 1. MLAPO op need to do some additional processing on the weights, including trans nz. If use MLAPO op, some weight will be transformed to nz forcely; 2. MLA/SFA's weight `W_UV` will be used by op `torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support nz currently; ### Does this PR introduce _any_ user-facing change? Now fp16/bf16 weight will not trans nz by default. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-19 14:27:24 +08:00
Chen Chen	1b47fca0e8	[bugfix] Use FUSED_MC2 MoE comm path for the op `dispatch_ffn_combine` (#5156 ) ### What this PR does / why we need it? - Renames the MoE comm enum value `MoECommType.FUSED_ALLTOALL` to `MoECommType.FUSED_MC2` and updates all call sites. - Updates `select_moe_comm_method` to optionally select `FUSED_MC2` on Ascend A3 when: - `enable_expert_parallel=True` - quantization is `w8a8_dynamic` - `EP <= 16` - `dynamic_eplb` is disabled - `is_mtp_model = False` - Replaces the old “fused all-to-all” comm implementation with `FusedMC2CommImpl`, using `TokenDispatcherWithMC2` / `PrepareAndFinalizeWithMC2` and `dispatch_ffn_combine`. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-12-18 23:34:31 +08:00
Angazenn	acc3578f58	[Graph][Fusion]Add new pattern for AddRmsnormQuant with SP. (#5077 ) ### What this PR does / why we need it? 1. In addition to [#4168](https://github.com/vllm-project/vllm-ascend/pull/4168), [#5011](https://github.com/vllm-project/vllm-ascend/pull/5011)， this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-18 20:25:44 +08:00
zzhxxx	a74a1196c5	[Feat] Support MLP_TP feature, exclude MOE layer (#4999 ) #4257 This PR implements the dense_ffn TP of the first three layers of the deepseek model, I have refactored this PR and used very little code to support the implementation of this feature. This PR adds a function `is_moe_layer` to mlp_tp, which supports MLP TP in models with both mlp and moe, such as deepseek or chat GLM. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: 子潜 <ziqian@U-DMKXH32D-2015.local> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-18 20:06:53 +08:00
AlvisGong	ef8157a5f2	fixed fused alltoall execute all reduce (#5109 ) ### What this PR does / why we need it? fixed fused alltoall execute all reduce, when moe_comm_type is MoECommType.FUSED_ALLTOALL if moe_comm_type in {MoECommType.ALLTOALL, MoECommType.MC2, MoECommType.FUSED_ALLTOALL} \ and not shared_expert_dp_enabled(): shared_out = tensor_model_parallel_all_reduce(shared_out) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: AlvisGong <gwly0401@163.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-18 15:07:40 +08:00
ZT-AIA	39fb9e7c83	qwen3_next add triton ops : fused_qkvzba_split_reshape (#4788 ) ### What this PR does / why we need it? add triton ops fused_qkvzba_split_reshape_cat for qwen3_next GatedDeltaNet ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2025-12-18 11:31:04 +08:00
weichen	f0060fc822	[Pangu][MoE] Remove PanguProMoEV1 related code (#5088 ) ### What this PR does / why we need it? PanguProMoEV1 is no longer supported in vllm-ascend, remove related code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: weichen <calvin_zhu0210@outlook.com>	2025-12-17 16:14:42 +08:00
zzzzwwjj	06b82e7503	[main] rename device type (#5099 ) ### What this PR does / why we need it? Rename `_910B` to `A2`; Rename `_910_93` to `A3`; Rename `_910_95` to `A5`; - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-17 14:08:19 +08:00
Icey	cadfa5ddc1	[Fusion] [Graph] Add qknorm rope fusion operator (#4711 ) ### What this PR does / why we need it? This PR add `qkv_rmsnorm_rope` operator and introduces a graph fusion pass for `qknorm_rope` operations. The implementation includes a new configuration flag, a pattern matching pass using `torch._inductor.pattern_matcher`, and a custom Triton kernel for the fused operation. Co-authored-by: Angazenn [supperccell@163.com](mailto:supperccell@163.com) ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-17 08:53:44 +08:00
Canlin Guo	bb3a826e08	[Refactor] Remove the process patches of Qwen2.5-VL and Qwen2.5-Omni (#5035 ) ### What this PR does / why we need it? Related to #4084. Before we add the patches temporarily for making `set_forward_context` patched by `set_ascend_forward_context` in the function `_process_image_input` and `_process_video_input` of `Qwen2.5-VL` and `Qwen2.5-Omni` models. After removing these patches, I met the `AttributeError` for `ForwardContext` missing `prefetch_mlp_enabled`. So we need to add the defensive check for `prefetch_mlp_enabled`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` vllm serve Qwen/Qwen2.5-VL-7B-Instruct \ --max-model-len 30000 \ --max-num-batched-tokens 50000 \ --max-num-seqs 30 \ --no-enable-prefix-caching \ --trust-remote-code \ --dtype bfloat16 ``` ``` {"id":"chatcmpl-b66d8acb76905c49","object":"chat.completion","created":1765796863,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration reads \"TONGYI Qwen.\"","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":73,"total_tokens":88,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-16 11:43:52 +08:00
Clorist33	d43cabc2b1	[Bugfix] Fix precision issues in moe_mlp (vllm-ascend main) (#5025 ) ### What this PR does / why we need it? Use group_list[0] to replace group_diff[0] in function "cumsum_group_list" (moe_mlp.py). The purpose is to modify it to the correct logic of converting cumsum to count. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>	2025-12-16 08:39:54 +08:00

1 2 3 4 5 ...

421 Commits