xc-llm-ascend

Author	SHA1	Message	Date
Jade Zheng	38570cfeb6	[Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072 ) By converting the KV cache from ND to NZ format when the decode node receives it, this PR ensures that the KV NZ feature works correctly during the decoding phase in disagg-prefill scenario. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Co-authored-by: ghphotoframe <854746559@qq.com> Co-authored-by: alex101-ops <alex1015718386@gmail.com>	2025-12-31 14:24:04 +08:00
wangxiaochao6	a539ae753a	[feature] mooncake support pcp/dcp in common conditions (#5224 ) ### What this PR does / why we need it? 1. This PR is proposed to support complicated pcp/dcp parallelisms in Prefill and Decode nodes in Mooncake, such as Prefill: TP8/PCP2DCP8 and Decode: TP8/DCP4/DP2, which is not supported now. We establish the link mappings to transfer KVCache between prefill and decode nodes. The main function is realized in Function of `_get_kv_split_metadata` in Mooncake_connector.py 2. After a prefill rank is pulled KVCache by a decode rank, the decode rank will send `DONE_RECVING_MSG` to the prefill rank and the prefill rank will free its KVCache blocks. If a prefill rank is pulled KVCache more than one time by several decode ranks and it surely could happen in complicated pcp/dcp parallelisms, it will cause the prefill rank free its KVCache blocks for several times, which could cause memory issue. This PR solve this issue by counting the times of prefill rank would be pulled KVCache and in the last time, it will free the prefill rank KVCache blocks. The related code is in Function of `run_busy_loop` in Mooncake_connector.py 3. If a prefill rank is not pulled KVCache by any decode ranks, the first rank in decode node will send "DONE_RECVING_MSG" to free its blocks. The related code is in Function of `_send_done_signal_to_free_remote_port` in Mooncake_connector.py ### How was this patch tested? This PR is tested in many pcp/dcp parallelisms, and the accuracy are all correct. MLA model: Prefill node: TP8/DP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP8, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP8, Decode node: TP8/DCP4/DP2 Prefill node: TP8/PCP2/DCP4, Decode node: TP4/DCP2/DP4 Prefill node: TP8/PCP2/DCP2, Decode node: TP4/DCP4/DP4 Prefill node: TP8/PCP2, Decode node: TP4/DCP2 GQA model: Prefill node: TP8/DP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP8/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP8/DCP2/DP2 Prefill node: TP8/PCP2/DCP2, Decode node: TP4/DP4 Prefill node: TP16/DCP2/PCP1, Decode node: TP8/DCP2/DP2 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` - Co-author by: Daishixun dsxtsteven@sina.com --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-31 09:53:03 +08:00
Li Wang	a5ae07a5d2	[Bugfix] Fix mm_merge (#5249 ) ### What this PR does / why we need it? We should transfer the mm_embed to the dtype of input_embed before performing the in-place assignment - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-31 09:49:55 +08:00
zhenwenqi2024	5d9fde9819	[Feature] Refactor PCP &DCP related code (#5214 ) ### What this PR does / why we need it? Refactor pcp& dcp related code. we use pcp_manager class to Unifiy Manage pcp & dcp . as we do this , many code can be deleted from model_runner, and can avoid break pcp & dcp by other developments. RFC：https://github.com/vllm-project/vllm-ascend/issues/5449 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-31 09:29:57 +08:00
LI SHENGYONG	bdc721d35a	[smoke][bugfix] moe_init_routing_v2 active_expert_range use int type (#5521 ) ### What this PR does / why we need it? The float kernel of MOE_init_routing_v2 in the dispatch allgather operation does not support tensor format for active_expert_range; it only supports int. PR5311 To unify the variables `local_num_experts` and `self.local_num_experts`, `self.local_num_experts` was used consistently, which led to the subsequent integer type parameter being converted to a tensor type. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? gsm8k \| exact_match,strict-match: ground_truth=0.89 \| measured=0.8939 \| success=✅ gsm8k \| exact_match,flexible-extract: ground_truth=0.85 \| measured=0.856 \| success=✅ ceval-valid \| acc,none: ground_truth=0.84 \| measured=0.8373 \| success=✅ Model Parameters: {'pretrained': 'Qwen/Qwen3-30B-A3B', 'tensor_parallel_size': 2, 'dtype': 'auto', 'trust_remote_code': False, 'max_model_len': 4096, 'gpu_memory_utilization': 0.6, 'enable_expert_parallel': True} - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-12-31 09:19:04 +08:00
zzzzwwjj	71f729a661	Revert "moe_gating_top_k" (#5512 ) Reverts vllm-project/vllm-ascend#5271 It breaks e2e test - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1`	2025-12-30 15:05:47 +08:00
ZCG12345	45c3c279e2	moe_gating_top_k (#5271 ) 1. What this PR does / why we need it? This PR supports the moe_gating_top_k operator, which enables post-positioned renormalization (renorm) on the basis of softmax. 2. Does this PR introduce any user-facing change? No user-facing changes are required. 3. How was this patch tested? This patch was tested with the test_npu_moe_gating_top_k test case. vLLM version: release/v0.13.0 vLLM main: `ad32e3e19c` --------- Signed-off-by: ZCG12345 <2097562023@qq.com> Signed-off-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-30 09:28:01 +08:00
weiguihua2	15d73f248e	[refactor] refactor model runner capture model (#5230 ) ### What this PR does / why we need it? Refactor the `capture_model` method in model_runner to directly reuse the method from vLLM. Currently, most of the logic in the capture_model method is similar to that in the vllm code. Directly using the vllm method can reduce the maintenance cost of the vllm-ascend code. Modify as follows: 1、refactor capture_model function, directly inheriting community methods 2、refactor initialize_aclgraph_capture function, move to initialize_attn_backend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-30 08:32:14 +08:00
Nengjun Ma	5e96f94d2a	Update corresponding vllm commit ID to 12 29 (#5475 ) ### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (https://github.com/vllm-project/vllm/pull/31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-12-29 22:48:05 +08:00
Zetong Li	92353c0643	[Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (#5176 ) ### What this PR does / why we need it? This PR aims to refactor eagle-related modules in vllm-ascend. This is the starting PR of eagle refactoring. Provided with vllm-eagle, ascend-eagle and ascend-mtp, we first let ascend-mtp inherit from ascend-eagle and let ascend-eagle inherit from vllm-eagle. As a initialization, we just delete `__init__` in mtp_proposer and simplify the corresponding logic in eagle_proposer. Based on "vllm-eagle <----- ascend-eagle <----- ascend-mtp", our target is to gradually delete ascend-mtp and enable ascend-eagle to converge to vllm-eagle. So the main workspace is eagle_proposer. In this way, we hope that contributors can concurrently refactor eagle. Incoming changes: 1. delete common methods in vllm-eagle & ascend-eagle & ascend-mtp 2. delete `load_model` in mtp_proposer 3. delete `dummy_run` and `propose` in mtp_proposer 4. ...... RFC: #5467 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-12-29 16:25:52 +08:00
whx	28b7614322	[Refactor][Triton] Move reject sample triton kernels into ops/triton (#5324 ) ### What this PR does / why we need it? This PR moves reject sample related triton kernels into `ops/triton`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-12-29 16:15:41 +08:00
Ronald	e7e1a7dc05	[Feature] support eager mode in model runner v2 (#5210 ) ### What this PR does / why we need it? #5051 only implement a basic framework for model runner v2, but there are still some bugs for e2e functionality, this PR aim to enable basic functionality. model runner v2 plans: https://github.com/vllm-project/vllm-ascend/issues/5208 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-12-29 15:28:34 +08:00
yeyifan	4da46da9bf	[feature] fia support sliding windows (#5239 ) Enable fia to support sliding window function and adapt to the Gemma3 model. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: nsdie <yeyifan@huawei.com>	2025-12-29 14:56:25 +08:00
ZongYuan Zhan	d8e15dae6c	Optimize some rejectsampler functions to make npu op launch non-blocking (#4587 ) ### What this PR does / why we need it? - Vetorize the loop (but change not output) in some rejectsampler functions include: `expand_pytorch`, `sample_recovered_tokens_pytorch`, `rejection_random_sample_pytorch`, `sample_recovered_tokens`. - Remove synchronize-launch torchnpu operator in them to accelerate sampling + MTP postprocess. ### Does this PR introduce _any_ user-facing change? - No ### How was this patch tested? - We tested this change with the serve&bench command: ``` ===== serve ===== vllm serve $LOCAL_CKPT_DIR \ --host 0.0.0.0 \ --port 8000 \ --data-parallel-size 4 \ --data-parallel-size-local 2 \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-start-rank $((2*VC_TASK_INDEX)) \ --data-parallel-rpc-port 13387 \ --tensor-parallel-size 8 \ --seed 1024 \ --enable-expert-parallel \ --served-model-name $NAME \ --max-model-len 4096 \ --max-num-seqs 16 \ --trust-remote-code \ --gpu-memory-utilization 0.90 \ $headless \ --speculative_config '{"method": "deepseek_mtp", "num_speculative_tokens": 1}' \ --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true}}' ==== bench ===== vllm bench serve --model $LOCAL_CKPT_DIR --served-model-name DeepseekV3ForCausalLM \ --dataset-name spec_bench --spec-bench-output-len 2048 \ --dataset-path question.jsonl \ --top-p 1.0 --temperature 0.8 \ --ignore-eos \ --num-prompts 64 --trust-remote-code --base-url "http://0.0.0.0:8000" --request-rate 64 ``` - In this case, our rj optimization can reduce TPOT from 84.94ms to 64.61ms, about 23% gain. ## before <img width="1068" height="830" alt="image" src="https://github.com/user-attachments/assets/278ac878-b49d-4588-b87c-316ca4d537f5" /> ## after <img width="781" height="756" alt="image" src="https://github.com/user-attachments/assets/0c6d37ad-ed77-40b3-a1be-4933c468365c" /> - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ZongYuan Zhan <zhanzy178@gmail.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>	2025-12-29 14:10:39 +08:00
anon189Ty	3e67e8276c	[Feature] Support to use fullgraph with eagle (#5118 ) ### What this PR does / why we need it? We support to use full graph with eagle. Change list: 1. Distinguish between processing graph_params and draft_graph_params in attention_v1. 2. Adapt the full-graph mode in eagle_proposer, include: 1). If use full graph, make Fullgraph Wrapper when load model. 2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode. 3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping. 4). Add a descriptor. 5). Set running mode and triggered update metadata. 3. Trans is_mtp_model to is_draft_model, and add the update of workspace. NOTE: When set async_scheduling=True, the draft model will enforce execution in eager mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>	2025-12-29 09:54:51 +08:00
LI SHENGYONG	f81cf694b2	[EPLB][refactor] Modification of the initialization logic for expert_map and log2phy（depend on pr5285） (#5311 ) ### What this PR does / why we need it? Unify the loading logic for expert_map and log2phy. 1. The map generated when enabling the redundancy expert is incorrect. The community generation map function only accepts the number of global experts. When we pass in the number of logical experts plus redundant experts, the local expert ID of the last card will index to an expert ID that does not exist. Now we ensure that the index points to a real existing expert ID, and each expert can be accessed. Moreover, when redundant experts are not enabled, the output of our function remains consistent with the community's function. 2. The map we generate is based on the length of the physical expert, but in reality, we only need to use the length of the logical expert. Later on, we will need to pad it accordingly, so we can simply generate a map with the length of the logical [expert.] 3. Unify the initialization logic across different scenarios and simplify the code for fused_moe. Before refactoring - map path is not None： expert map: get_rank_placement_map from _'expert_load_balancer.py'_, maintains the map for all ranks and all layers. log2phy: get_rank_log2phy_map from _'expert_load_balancer.py'_, maintains the map for all ranks and all layers. - map path is None: expert map: determine_expert_map from '_vllm.laye_r', The function does not support the redundant experts of vllm-ascend. log2phy: determine_default_log2phy_map from _'eplb_utils.py'_. The function does not support the redundant experts of vllm-ascend. Refactoring eplb_utils.py     init_eplb_config          generate placement          generate expert map          generate log2phy ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Expert Mapping Test Generation: ep size: 16, num of experts: 256, num of redundant experts: 16 +++++++++++++++++++++++++++++++++++++++++ Expert Mapping (Non-1 indicates the expert responsible for this rank) for Rank 15: vllm map： [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] +++++++++++++++++++++++++++++++++++++++++ Improved map： [16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] Expert Mapping Test Generation: ep size: 16, num of experts: 256, num of redundant experts: 0 +++++++++++++++++++++++++++++++++++++++++ Expert Mapping (Non-1 indicates the expert responsible for this rank) for Rank 15: vllm map： [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] +++++++++++++++++++++++++++++++++++++++ Improved map： [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] dsr1 baselie： \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8k-lite \| 7cd45e \| accuracy \| gen \| 100.00 \| dsr1 eplb： \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8k-lite \| 7cd45e \| accuracy \| gen \| 100.00 \| - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-29 09:26:14 +08:00
wujinyuan1	23169021d9	[Refactor]6/N Extract common code of class AscendMLAImpl (#5314 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： Eliminate duplicate code for two file(mla_v1.py mla_cp.py) of IMPL classes. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-28 10:40:45 +08:00
weijinqian0	dbe4c338f2	[Refactor] cache cos/sin in mla & remove parameter model in builder. (#5277 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 1. Cache cos/sin in mla 2. AttentionBuilder inherits from the original class of vllm. version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-28 10:35:07 +08:00
Li Wang	58adf7c8ac	[Bugfix] Correctly handle the output shape in multimodal attention (#5443 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/5297, for `AscendMMEncoderAttention` forward, we should keep the output shape consistence with the input - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-27 18:42:46 +08:00
jiangkuaixue123	e91e11d3b0	[bugfix] fix typo of _skip_all_reduce_across_dp_group (#5435 ) ### What this PR does / why we need it? fix typo of _skip_all_reduce_across_dp_group ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com>	2025-12-27 17:50:04 +08:00
realliujiaxu	09f71c14a6	Revert "[feat] enable hierarchical mc2 ops on A2 by default (#5300 )" (#5434 ) We'll release 0.13.0 soon. The main branch is freeze. Let's revert the newest change and redo it once 0.13.0 is released. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-12-27 17:06:58 +08:00
realliujiaxu	2add3dc3e0	[Bugfix] fix greedy temperature detection (#5417 ) ### What this PR does / why we need it? fix greedy temperature detection from https://github.com/vllm-project/vllm/pull/27077 - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-12-27 17:04:10 +08:00
hwhaokun	12da9f9460	[feat] enable hierarchical mc2 ops on A2 by default (#5300 ) ### What this PR does / why we need it? Previously, it was necessary to set the environment variables HCCL_INTRA_PCIE_ENABLE=1 and HCCL_INTRA_ROCE_ENABLE=0. This PR enables hierarchical MC2 operations on A2 by default. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: hwhaokun <haokun0405@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-12-27 15:45:25 +08:00
hwhaokun	cb2fbf7df2	[bugfix] solve dp scenario Host-Device sync (#5298 ) ### What this PR does / why we need it? In the speculative decoding scenario, the original code performs Host-Device synchronization, which slows down the main model's execution speed. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: hwhaokun <haokun0405@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-12-27 10:36:59 +08:00
whx	3f33ad23fe	[BugFix] Fix npu-cpu offloading interface change bug. (#5290 ) ### What this PR does / why we need it? Last month the interface of `OffloadingSpec` has changed(https://github.com/vllm-project/vllm/pull/27743). This PR fixes this bug and adds e2e test for cpu offloading. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-12-27 10:21:20 +08:00
fems14	2ef4d1979e	[bugfix][main]KV Pool for KV Transfer in PD Disaggregation Scenarios (#5398 ) ### What this PR does / why we need it? 1.KV Pool for KV Transfer in PD Disaggregation Scenarios Error Resolution 2.Update KV Pool Documentation ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-12-27 09:53:57 +08:00
wangxiyuan	d1f0df7b4b	Revert "MLA prefill preformance optimization (#5275 )" (#5410 ) We'll release 0.13.0 soon. The main branch is freeze. Let's revert the newest change and redo it once 0.13.0 is released - vLLM version: release/v0.13.0 - vLLM main: `81786c8774`	2025-12-27 09:48:56 +08:00
pichangping	711f1861e4	MLA prefill preformance optimization (#5275 ) ### What this PR does / why we need it? Since the _npu_ring_mla operator deteriorates in long-sequencescenarios, the long sequence is split into shorter sequences for input to improve performance. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: pichangping <1337510399@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-27 09:19:45 +08:00
Jade Zheng	8b9ca86827	[Feature] Remove the transpose step after attention and switch to transpose_batchmatmul (#5390 ) 1. The `npu_fused_infer_attention_score` kernel supports specifying the output layout. By selecting the appropriate layout, we can avoid the transpose operation typically required after the attention. 2. The `transpose_batchmatmul` function allows us to control whether the output tensor is transposed. If we configure `perm_y`, an additional transpose after executing `v_up` becomes unnecessary. - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-26 22:03:46 +08:00
Wang Kunpeng	bc5b7a5fb5	[bugfix] Fix MHA model runtime error in aclgraph mode (#5397 ) ### What this PR does / why we need it? Currently, MHA models (eg: minicpm-2b, Baichuan-7b) will encounter errors when running in piecewise graph mode, with error messages similar to: ``` (E89999): When layout is TND and PA not enabled, keyT(8) and valueT(8) must be equal to the last element of actualSeqenceLengthKV(5)[FUNC:CheckInputShapeWhenLayoutIsTND][FILE:prompt_flash_attention_tiling.cpp][LINE:3618] ``` The error occurs because the qkv in the Prefill stage is also padded, causing the shape to be inconsistent with actual_seq_lengths. Add unpadding logic for kv. - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-26 21:37:28 +08:00
LeeWenquan	7685d0c239	rollback causal_conv1d_fn to torch ops & update qwen3Next doc (#5391 ) ### What this PR does / why we need it? Rollback causal_conv1d_fn ops from triton to torch version to fix hanging issues，meanwhile update Qwen3Next doc - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-12-26 19:57:38 +08:00
Jade Zheng	0dfdfa9526	[Feature] Enhance all-reduce skipping logic for MoE models in NPUModelRunner (#5329 ) Besides enabling `recompute_scheduler_enable`, we can skip all_reduce when max_num_batched_tokens is below mc2's requirement. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-26 17:39:44 +08:00
Zetong Li	09390eaf32	[Bugfix] Fix unsuitable moe_comm_type under ep=1 scenario (#5388 ) ### What this PR does / why we need it? This PR aims to fix unsuitable `moe_comm_type` under `ep=1` scenario. The related issue #5375 have reported that `ep=1` can cause errors in local environment, but those cases work well on ci. The point is the difference between machines and `moe_comm_type` may not be chosen correctly. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: Zetong Li <slippersss@126.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-26 16:45:45 +08:00
Zhu Yi Lin	18302c8467	Revert "Add MagicMTP(block verify) and Triton optimization (#4443 )" (#5380 ) ### What this PR does / why we need it? #4443 introduces a precision issue in scenarios where MTP >= 3 + deepseek v3.1, and this pr reverts it - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-26 15:06:13 +08:00
wangxiyuan	29d2fe653d	cleanup ascend config (#5296 ) 1. refresh additional config doc 2. move kv config logic to platform. 3. improve `dump_config` init logic and rename it to `dump_config_path` this change is user impacted. dump_config is changed from dict to string. 4. correct `enable_async_exponential` type 5. remove useless `chunked_prefill_for_mla` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-26 14:07:37 +08:00
XiaoxinWang	320877d488	move contiguous in fused_sigmoid_gating_delta_rule_update to model_runner_v1 (#5274 ) ### What this PR does / why we need it? The contiguous() operation temporarily increases memory usage, leading to higher peak GPU memory, which necessitates reducing gpu_memory_utilization. However, making tensors contiguous in modelrunnerv1 significantly enhances operator performance, resulting in greater end-to-end model benefits despite the memory overhead. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-26 09:19:47 +08:00
Icey	9b2a7d8866	[BugFix][Fusion] Patch compile backend to make fusion available (#5308 ) Currently, the vllm pr: https://github.com/vllm-project/vllm/pull/24252 is causing operator fusion to fail, which can be mitigated by patching the backend. Once the problem is completely resolved, I will submit a new pull request to remove the patch. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-26 09:18:16 +08:00
Qi Mao	7372225bcb	[FIX] Update _causal_conv1d_update_kernel for Efficient Conv State Handling on NPU (#5322 ) Description: This PR updates the implementation of the Triton operator for deployment on NPU devices, focusing on optimizing grid size and memory handling based on NPU limitations. Design Plan: Grid Calculation: The grid size is now dynamically calculated by batch and dim to ensure that the number of programs executed does not exceed the NPU's vector core capacity. This ensures optimal parallelism without overloading the hardware. Data Block Handling: Due to the limited on-chip memory (UB) on Ascend NPUs, this implementation splits large data into smaller chunks of 32k or less per block. The kernel performs a for-loop to process the data in these smaller chunks, minimizing memory usage and avoiding potential overflows. Changes Compared to GPU Implementation: Grid and Block Sizing: For GPU, the grid and block size were determined based on available thread counts and memory size. In contrast, the NPU version dynamically adjusts these parameters using B_TILE and BLOCK_N to optimize for NPU’s architecture. Memory Chunking: The original GPU implementation did not require chunking due to the higher available memory and processing capacity. For the NPU, data is divided into smaller chunks (32k or smaller) to comply with memory constraints on the device. The kernel has been modified to handle this chunking mechanism inside a loop. Optimized Thread Usage: The NPU implementation takes into account the hardware-specific thread limit (24 threads per vector core), ensuring that the number of active programs is aligned with the NPU's vector core count, avoiding over-subscription that would lead to serial processing. This PR ensures that the operator functions efficiently on Ascend NPU, considering hardware limitations while maintaining the same functionality and input parameters as the GPU implementation. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: maoxx241 <maomaoyu870@gmail.com>	2025-12-26 09:12:30 +08:00
Feng Liu	1858f3d36e	[Bugfix] Fix Qwen P/D Disaggregation accuracy issue (#5340 ) ### What this PR does / why we need it? Fix Qwen P/D Disaggregation accuracy issue - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: F.Liu <liufeng248@huawei.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2025-12-25 22:46:08 +08:00
weiguihua2	d752c030e9	[Bugfix] fix pcp 128K break (#5266 ) ### What this PR does / why we need it? [Bugfix] Fixing the issue where 128K context does not work in long sequence scenarios. This issue is caused by not splitting num_token according to pcp_size during profile_run. During `profile_run`, a warm-up is performed based on `self.max_num_tokens`. When PCP is enabled, each PCP group will only schedule up to `self.max_num_tokens / pcp_size`. After `profile_run` is completed, the original scheduling size needs to be restored. This is a temporary workaround; once https://github.com/vllm-project/vllm/pull/28988/files is implemented, this part can be removed. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-25 11:58:52 +08:00
wangxiyuan	2ae0bad96d	Remove VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE (#5272 ) `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with `VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR remove it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-25 11:09:56 +08:00
Wang Kunpeng	13cd6362c6	[bugfix] fix Error 'ValueError: Duplicate layer name' (#5280 ) ### What this PR does / why we need it? When matmul_and_reduce is enabled, the prefix attribute is required. However, in some models, the prefix is not passed correctly, causing errors when starting the service. The issue of incorrect prefix passing will be fixed in vLLM in the future. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-25 10:43:24 +08:00
dsxsteven	30778f371b	[BugFix] Fix num_pcp_pads Assignment Issues (#5273 ) ### What this PR does / why we need it? The variable `self.num_pcp_pads` was incorrectly truncated during assignment, causing errors in certain scenarios such as PD disaggregated. This issue has now been resolved. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Co-author by: QiuChunshuo <qiuchunshuo@huawei.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: daishixun <dsxsteven@sina.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-25 10:38:09 +08:00
Magnus	a9fccbeb30	[CI] add xlite e2e test (#5305 ) ### What this PR does / why we need it? add xlite e2e test - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: DaweiChang <405739598@qq.com>	2025-12-25 09:17:06 +08:00
Aoxuan Chen	6d25372baa	Add MagicMTP(block verify) and Triton optimization (#4443 ) ### What this PR does / why we need it? 1. MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy. 2. The rejection sampling logic in rejection_sampler.py was restructured using Triton-Ascend, enabling it to operate under high concurrency, thus resolving CPU and NPU operator bottlenecks and enhancing throughput. ### Does this PR introduce _any_ user-facing change? MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: chenaoxuan <cax1165@163.com>	2025-12-25 09:00:25 +08:00
Ascendyh	a90482803d	[Kernel] add l2norm triton kernel (#4595 ) ### What this PR does / why we need it? This pull request introduces an L2 normalization kernel implemented in Triton, specifically optimized for Ascend NPUs. ### Does this PR introduce _any_ user-facing change? No, this PR does not introduce any user-facing changes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-25 06:06:18 +08:00
Mengqing Cao	e54630e01c	Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) (#5317 ) ### What this PR does / why we need it? Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as it causes deepseek v3.2 hang error - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-24 22:24:17 +08:00
wangxiyuan	fb3d6ca08c	Cleanup uesless env (#5270 ) `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is not used anywhere, let's remove it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-24 22:07:59 +08:00
TmacAaron	5018f2d8fd	[quantization] Add w8a16 quantization support (#4541 ) ### What this PR does / why we need it? related to https://github.com/vllm-project/vllm-ascend/issues/4267 ### Does this PR introduce _any_ user-facing change? support w8a16 quantization now ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` ### Test tested using [aisbench](https://gitee.com/aisbench/benchmark/) with tp2 #### Precision \| ceval \| mmlu \| gsm8k -- \| -- \| -- \| -- bf16 \| 90.46 \| 89.17 \| 96.21 w8a16 \| 89.51 \| 89.29 \| 95.98 #### Performance \| input_len \| output_len \| concurrency \| TTFT (ms) \| TPOT (ms) \| TPS (Total) (tokens/s) -- \| -- \| -- \| -- \| -- \| -- \| -- bf16 \| 2048 \| 2048 \| 10 \| 1911.7136 \| 77.988 \| 253.9866 w8a16 \| 2048 \| 2048 \| 10 \| 2128.6334 \| 67.1633 \| 293.9117 bf16 \| 3500 \| 1024 \| 10 \| 3076.2509 \| 84.3525 \| 506.949 w8a16 \| 3500 \| 1024 \| 10 \| 2685.2031 \| 73.015 \| 585.4717 --------- Signed-off-by: yyt <yangyit139@gmail.com> Signed-off-by: TmacAaron <yangyit139@gmail.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-12-24 19:49:32 +08:00
linfeng-yuan	515267de22	[perf][bugfix] improve performance of rejection sampler and eliminate HD synchronize in TopKTopPSampler (#4154 ) ### What this PR does / why we need it? 1. Use optimized apply_top_k_top_p for NPU platfrom in rejection sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24 per DP) 2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p introduced by parameter validation which improves inference speed with `async_scheduling` enabled;</del> In order to elminate the D2H synchronization introduced by parameter validation before calling `npu_top_k_top_p`, we directly drop this fused operator since the performance improvement is not significant compared to async_scheduling and may bring potential accuracy problem. 3. Refactor the implementation of AscendTopKTopPSampler to align that of vLLM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E serving test with combinations of `k=500` and `p=0.95` with async_scheduling in single node and wide-EP scenarios. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-12-24 19:10:33 +08:00

... 10 11 12 13 14 ...

1665 Commits