xc-llm-ascend

Author	SHA1	Message	Date
zxr2333	78b554dda9	[P/D] layerwise connector supports DeepSeek-V3.2 sparse attention && Distribute transfer tasks to redundant kv_head cards (#5722 ) ### What this PR does / why we need it? Add new function to mooncake layerwise connector, including: 1. supports sparse attention, for DeepSeek-V3.2 2. Distribute transfer tasks to redundant kv_head cards This PR is related to [[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support](https://github.com/vllm-project/vllm-ascend/issues/4842) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-01-10 23:04:16 +08:00
Levi	ecd4232698	[Feat] flashcomm2+oshard Generalized (#4723 ) ### What this PR does / why we need it? [FlashComm2](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf) introduces redundant storage of the o_proj matrix, which imposes pressure on GPU memory. We propose the FlashComm2+Oshard approach by integrating the shared linear layer feature (#2931). This approach distributes weights layer-by-layer to each GPU and accesses the o_proj of each layer via asynchronous broadcast operations, thereby alleviating memory pressure while achieving nearly lossless performance compared to the original FlashComm2. This PR implements a generalized FlashComm2+Oshard solution. Using following env to support flashcomm2 with oshard ```shell export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 --additional-config '{ "layer_sharding": ["o_proj"] }' ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2026-01-10 22:57:57 +08:00
zzhxxx	64d29875f9	[Refactor] Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for sharded CP (#5698 ) ### What this PR does / why we need it? Based on the Sharded-CP feature PR:https://github.com/vllm-project/vllm-ascend/pull/4702; RFC:https://github.com/vllm-project/vllm/issues/30055 This PR officially integrates Deepseek V3.2's DSA-CP support on the basis of https://github.com/vllm-project/vllm-ascend/pull/4702, improving inference efficiency and scalability under mixed prefill-decode workloads. The main improvements include: - Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for TP=1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <jaychou1620@gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2026-01-09 15:58:40 +08:00
whx	ee2ed573f1	[BugFix][DS 3.2] Fix ds indexer accuracy problem caused by rope. (#4641 ) ### What this PR does / why we need it? The rotary algorithm in deepseek indexer should be neox-style instead of gptj style. PR #4413 fix this accuracy bug with new triton kernel. This PR fixes original pytorch version. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with existing test. - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-01-09 14:11:44 +08:00
zhenwenqi2024	97f6be8108	[feature]dcp&pcp support mlapo (#5672 ) ### What this PR does / why we need it? mlapo in deepseek is a huge performance improvement in decode, this pr support pcp & dcp with mlapo ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-08 23:49:23 +08:00
Yizhou	f4605c2b3c	[Fix] Fixes speculative decode indexing and unpad condition for attention metadata (#5626 ) ### What this PR does / why we need it? This addresses the issue brought up by #5356 and #4963, and we believe the unnecessary conditions are the root cause. Change the unpad trigger to be driven by actual size mismatches (num_reqs vs base_num_reqs or scheduled vs input token counts) rather than specific speculative-method flags. Then remove brittle workarounds that forced request counts and sliced query start locations. This prevents incorrect indexing and length mismatches during speculative decoding and makes metadata unpadding more robust across scheduling modes. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Tested by existing cases. - vLLM version: v0.13.0 - vLLM main: `8be6432bda` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-08 19:41:08 +08:00
cookieyyds	8b3a7a9e87	[bugfix] Support dsv3.2 enable both mtp and full_decode_only (#5679 ) ### What this PR does / why we need it? #5230 this PR introduced a problem when both mtp and full_decode_only are enabled for the DSV32 model, the operators cannot be compiled into the graph. This PR fixes that issue. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>	2026-01-08 15:47:31 +08:00
zzhxxx	f7db812ed7	[refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181 ) ### What this PR does / why we need it? - Delete the environment variable `VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED` - Introduce layer_sharding as a configurable feature in additional_config - Revise the term "shared weight" to "shard weight." Configuration : The feature is opt-in via the additional_config argument: ``` --additional-config '{ "layer_sharding": ["o_proj", "q_b_proj"] }' ``` This is orthogonal to standard tensor parallelism and weight replication strategies. It is treated as a separate, explicit feature.It can be used in any scenario, combined with the flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature or the ShardedCP #4702 feature, to achieve significant performance. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2026-01-08 09:05:02 +08:00
LICO67373	380f089fbf	[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870 ) ## What this PR does / why we need it? This PR fixes the `AttentionMaskBuilder` singleton initialization issue introduced in PR #4779 and removes the unused `pcp_prefill_mask` field. ### Background After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton` decorator, the class constructor now requires a `device` parameter. However, two initialization sites were still using the old parameterless constructor, causing failures. ### Changes 1. Fix singleton initialization - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendMLAMetadataBuilder.__init__()` - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendAttentionMetadataBuilder.__init__()` 2. Remove unused field - Removed `pcp_prefill_mask` field from `AscendPrefillContextParallelMetadata` (never used in codebase) - Updated related test assertions ### Related - Issue #5463 - PR #4779 (Unify all mask generation methods) - PR #5389 (Make AttentionMaskBuilder singleton) ## Does this PR introduce _any_ user-facing change? No. This is an internal refactoring. ## How was this patch tested? - ✅ Local testing: No linter errors - ✅ Unit tests for attention modules verified - ⏳ CI pipeline Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-07 17:09:52 +08:00
yeyifan	cc0110abb4	[Bugfix] Remove swa parameter of fia (#5602 ) ### What this PR does / why we need it? When using the swa parameter in fia, headDim does not currently support 256, and when gemma3's headDim is equal to 256, an error will occur. Therefore, code rollback is required, and it will be incorporated after cann supports it. ### Does this PR introduce _any_ user-facing change? Remove swa parameter of fia. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: nsdie <yeyifan@huawei.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-01-06 17:24:43 +08:00
Shanshan Shen	b94d589769	[MM][Bugfix] Update `hf_config` to `hf_text_config` (#5319 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/5205, update `hf_config` to `hf_text_config`. Find more details at https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534 and https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-06 16:41:39 +08:00
wjunLu	3cf059a72b	[Main2Main] Upgrade vllm commit to 0105 (#5595 ) ### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since https://github.com/vllm-project/vllm/pull/31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to https://github.com/vllm-project/vllm/pull/30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to https://github.com/vllm-project/vllm/pull/31584 4. Adapt `self.block_size` calling due to https://github.com/vllm-project/vllm/pull/31540 5. Modify `test_mla_v1.py` due to https://github.com/vllm-project/vllm/pull/28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-06 08:44:29 +08:00
Chen Chen	a2daacbd71	[perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... (#5192 ) ### What this PR does / why we need it? - Problem: In MLA+MLAPO, KV-consumer deployments keep fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses the prepacked buffers, increasing memory footprint on decode nodes. - Fix: Conditionally drop those tensors only when `kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with the SFA behavior #4774 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2026-01-05 21:29:45 +08:00
wujinyuan1	4a3663327b	[Refactor]7/N Extract common code to common_cp (#5490 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： Eliminate duplicate code for two file(mla_cp.py attention_cp.py) to common_cp.py. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` vLLM version: release/v0.13.0 vLLM main: `5fbfa8d9ef` - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com>	2026-01-05 17:41:12 +08:00
pichangping	50e7934415	MLA prefill preformance optimization (#5456 ) ### What this PR does / why we need it? Since the _npu_ring_mla operator deteriorates in long-sequencescenarios, the long sequence is split into shorter sequences for input to improve performance. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: pichangping <1337510399@qq.com>	2026-01-05 11:41:59 +08:00
Qiu	7c210225a2	[Perf][PCP][DCP] add multi-stream for GQA to enable computation-communication overlap (#5382 ) ### What this PR does / why we need it? This PR adds multi-stream for GQA to enable computation-communication overlap. For chunked prefill, we reduce TTFT by approximately 4%. ### Does this PR introduce _any_ user-facing change? No - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-04 16:33:18 +08:00
drslark	363ac1b80f	[Feat][main] Supported to use full-graph with Qwen3-Next-MTP (#5477 ) ### What this PR does / why we need it? Supported to use full-graph with Qwen3-Next-MTP. In detail, we adatpted `AscendAttentionState.ChunkedPrefill` in main model, and also adapted `AscendAttentionState.ChunkedPrefill` in mtp model. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? We changed the test of Qwen3-Next-MTP in `tests/e2e/multicard/test_qwen3_next.py` to make it a test of `FULL_DECODE_ONLY`. Then run `pytest -s tests/e2e/multicard/test_qwen3_next.py::test_qwen3_next_distributed_mp_eager_mtp_similarity_tp4`. And this test passed. ```text . ================================================================================================================================= warnings summary ================================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ==================================================================================================================== 1 passed, 2 warnings in 271.89s (0:04:31) ===================================================================================================================== ``` - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-04 12:03:21 +08:00
无脸男	03679cf1d3	[Bugfix] fix the precision issues that may raise from the inter-layer reuse of the workspace in certain scenarios (#5522 ) ### What this PR does / why we need it? In the current process of implementing attention updates, the FIA operator shares a single workspace among different layers within the same computation graph. To enable memory reuse, we adopt the weak_ref_tensor mechanism. However, this approach may lead to precision anomalies in certain scenarios. To address this issue, different layers in the same computation graph are assigned independent workspaces. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: WithHades <244036962@qq.com>	2025-12-31 16:54:04 +08:00
zxr2333	46a1614387	[P/D] Improve the performance of Layerwise Connector (#5303 ) ### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-12-31 15:09:01 +08:00
Jade Zheng	38570cfeb6	[Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072 ) By converting the KV cache from ND to NZ format when the decode node receives it, this PR ensures that the KV NZ feature works correctly during the decoding phase in disagg-prefill scenario. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Co-authored-by: ghphotoframe <854746559@qq.com> Co-authored-by: alex101-ops <alex1015718386@gmail.com>	2025-12-31 14:24:04 +08:00
weiguihua2	15d73f248e	[refactor] refactor model runner capture model (#5230 ) ### What this PR does / why we need it? Refactor the `capture_model` method in model_runner to directly reuse the method from vLLM. Currently, most of the logic in the capture_model method is similar to that in the vllm code. Directly using the vllm method can reduce the maintenance cost of the vllm-ascend code. Modify as follows: 1、refactor capture_model function, directly inheriting community methods 2、refactor initialize_aclgraph_capture function, move to initialize_attn_backend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-30 08:32:14 +08:00
Ronald	e7e1a7dc05	[Feature] support eager mode in model runner v2 (#5210 ) ### What this PR does / why we need it? #5051 only implement a basic framework for model runner v2, but there are still some bugs for e2e functionality, this PR aim to enable basic functionality. model runner v2 plans: https://github.com/vllm-project/vllm-ascend/issues/5208 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-12-29 15:28:34 +08:00
yeyifan	4da46da9bf	[feature] fia support sliding windows (#5239 ) Enable fia to support sliding window function and adapt to the Gemma3 model. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: nsdie <yeyifan@huawei.com>	2025-12-29 14:56:25 +08:00
anon189Ty	3e67e8276c	[Feature] Support to use fullgraph with eagle (#5118 ) ### What this PR does / why we need it? We support to use full graph with eagle. Change list: 1. Distinguish between processing graph_params and draft_graph_params in attention_v1. 2. Adapt the full-graph mode in eagle_proposer, include: 1). If use full graph, make Fullgraph Wrapper when load model. 2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode. 3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping. 4). Add a descriptor. 5). Set running mode and triggered update metadata. 3. Trans is_mtp_model to is_draft_model, and add the update of workspace. NOTE: When set async_scheduling=True, the draft model will enforce execution in eager mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>	2025-12-29 09:54:51 +08:00
wujinyuan1	23169021d9	[Refactor]6/N Extract common code of class AscendMLAImpl (#5314 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： Eliminate duplicate code for two file(mla_v1.py mla_cp.py) of IMPL classes. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-28 10:40:45 +08:00
weijinqian0	dbe4c338f2	[Refactor] cache cos/sin in mla & remove parameter model in builder. (#5277 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 1. Cache cos/sin in mla 2. AttentionBuilder inherits from the original class of vllm. version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-28 10:35:07 +08:00
wangxiyuan	d1f0df7b4b	Revert "MLA prefill preformance optimization (#5275 )" (#5410 ) We'll release 0.13.0 soon. The main branch is freeze. Let's revert the newest change and redo it once 0.13.0 is released - vLLM version: release/v0.13.0 - vLLM main: `81786c8774`	2025-12-27 09:48:56 +08:00
pichangping	711f1861e4	MLA prefill preformance optimization (#5275 ) ### What this PR does / why we need it? Since the _npu_ring_mla operator deteriorates in long-sequencescenarios, the long sequence is split into shorter sequences for input to improve performance. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: pichangping <1337510399@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-27 09:19:45 +08:00
Jade Zheng	8b9ca86827	[Feature] Remove the transpose step after attention and switch to transpose_batchmatmul (#5390 ) 1. The `npu_fused_infer_attention_score` kernel supports specifying the output layout. By selecting the appropriate layout, we can avoid the transpose operation typically required after the attention. 2. The `transpose_batchmatmul` function allows us to control whether the output tensor is transposed. If we configure `perm_y`, an additional transpose after executing `v_up` becomes unnecessary. - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-26 22:03:46 +08:00
Wang Kunpeng	bc5b7a5fb5	[bugfix] Fix MHA model runtime error in aclgraph mode (#5397 ) ### What this PR does / why we need it? Currently, MHA models (eg: minicpm-2b, Baichuan-7b) will encounter errors when running in piecewise graph mode, with error messages similar to: ``` (E89999): When layout is TND and PA not enabled, keyT(8) and valueT(8) must be equal to the last element of actualSeqenceLengthKV(5)[FUNC:CheckInputShapeWhenLayoutIsTND][FILE:prompt_flash_attention_tiling.cpp][LINE:3618] ``` The error occurs because the qkv in the Prefill stage is also padded, causing the shape to be inconsistent with actual_seq_lengths. Add unpadding logic for kv. - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-26 21:37:28 +08:00
Feng Liu	1858f3d36e	[Bugfix] Fix Qwen P/D Disaggregation accuracy issue (#5340 ) ### What this PR does / why we need it? Fix Qwen P/D Disaggregation accuracy issue - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: F.Liu <liufeng248@huawei.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2025-12-25 22:46:08 +08:00
Mengqing Cao	e54630e01c	Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) (#5317 ) ### What this PR does / why we need it? Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as it causes deepseek v3.2 hang error - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-24 22:24:17 +08:00
wujinyuan1	7ff1db4b84	[Refactor]5/N Extract common code of mla_v1.py & extract mla_cp (#5097 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： The functions related to Cp differ significantly from those of normal MLA-Attention, but the coupling is quite severe. Steps： 1)Extract common code AscendMLAMetadataBuilder.build to 4 functions: build_prefill_metadata, build_decode_metadata,build_cp_metadata, build_chunked_metadata todo： 1)refactor function _compute_prefill_context; 2)refactor function _mla_preprocess,_mla_decode_preprocess 3）Extract public data and processing functions from the attention_cp.py and mla_cp.py files to the common_cp file. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` - vLLM version: 0.13.0rc3 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-24 10:25:19 +08:00
Slightwind	22138e2727	[main][Refactor] Remove `with_prefill` parameter from `set_ascend_forward_context` (#5094 ) Removes the redundant `with_prefill` parameter from `set_ascend_forward_context` to align the interface with vLLM's `set_forward_context` for future refactoring. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: Slightwind <slightwindsec@gmail.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-23 14:30:50 +08:00
Mengqing Cao	449f8f65a7	[KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) ### What this PR does / why we need it? Support KV-Sharing feature in CLA (cross layer attention) models, which sharing kv cache in some layers. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-23 10:48:31 +08:00
weijinqian0	95e8a52156	[Refactor] move the metadata from attention_v1 to util(ready for extract common_cp) & realize Ascendmetadata inherit from the parent class. (#5203 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 1. Remove the pcp-related code from attention_v1. 2. Establish the inheritance relationship of CommonAttentionMetadata. TODO 1. extract common_cp 2. move cp metadata to common_cp. 3. remove commonAttentionMetadata for aclgraph. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-23 00:10:52 +08:00
zhangsicheng5	78aa7f2693	[feature] support pcp + mtp in full graph (#4572 ) 1. support pcp + mtp in full graph 2. pcp/dcp related mtp bugfix 3. support pcp + mtpx - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-12-22 16:13:39 +08:00
Qiu	64669c4243	[misc][FlashComm1][ACLGraph] Incompatibility between Flashcomm1 and FULL_DECODE_ONLY. (#5200 ) ### What this PR does / why we need it? Currently, Flashcomm1 and FULL_DECODE_ONLY are incompatible. When both features are enabled, graph capture errors occur without clear error messages. After discussion, it has been determined that enabling FULL_DECODE_ONLY with Flashcomm1 in mixed deployment scenarios provides almost no TPOT benefit. Additionally, a reconstruction of the decode phase for flashcomm1 is currently underway. Therefore, related adaptation work is temporarily postponed and will be addressed after the decode phase reconstruction plan is finalized. For now, an assert will be added to provide clear error messages and correct deployment recommendations. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? NO - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2025-12-22 14:33:32 +08:00
Feng Liu	e117b3d693	[Perf] vectorize PCP/DCP loops in mla_v1.py (#5003 ) ### What this PR does / why we need it? - Replace nested PCP/DCP Python loops with fully vectorized tensor operations - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2025-12-22 11:06:30 +08:00
Feng Liu	49838d4bec	[Perf] vectorize PCP/DCP loops in attention_cp.py (#4944 ) ### What this PR does / why we need it? - Add explicit .contiguous() after permute/view to ensure mem-friendly layout - Replace nested PCP/DCP Python loops with fully vectorized tensor operations - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2025-12-22 11:06:19 +08:00
weijinqian0	35ad11b637	[Refactor] remove some metadata variables in attention_v1. (#5160 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason: The metadata data class contains an excessive number of variables. We will inherit the metadata of the community and simultaneously remove some variables that are no longer needed at present. Todo: 1. remove attn_state partly. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-19 14:57:09 +08:00
zzzzwwjj	cc23067f1e	[refactor] refactor weight trans nz and transpose (#4878 ) ### What this PR does / why we need it? Now `VLLM_ASCEND_ENABLE_NZ` will have three options: 0: disable nz; 1: only quant case enable nz; 2: enable nz as long as possible; And `VLLM_ASCEND_ENABLE_NZ`=1 by default. All cases are shown in the table below: \| \| W4A4 \| W4A8 \| W8A8 \| fp16/bf16 \| fp32 \| \|---\|---\|---\|---\|---\|---\| \| trans nz \| can't support nz \| trans nz by default \| trans nz by default \| trans nz when VLLM_ASCEND_ENABLE_NZ is 2 \| can't support nz \| \| transpose \| only support not transpose case \| only support transpose case \| only support transpose case \| linear: only support not transpose case<br>gmm: only support transpose case \| same to fp16/bf16 \| Some exceptional cases: 1. MLAPO op need to do some additional processing on the weights, including trans nz. If use MLAPO op, some weight will be transformed to nz forcely; 2. MLA/SFA's weight `W_UV` will be used by op `torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support nz currently; ### Does this PR introduce _any_ user-facing change? Now fp16/bf16 weight will not trans nz by default. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-19 14:27:24 +08:00
weichen	ca6f631cba	[2/N][Pangu][MoE] Remove Pangu Related Code (#5130 ) ### What this PR does / why we need it? Remove Pangu Related Code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weichen <calvin_zhu0210@outlook.com>	2025-12-19 09:00:07 +08:00
Angazenn	632eab28b7	[BugFix]Fix incorrect get_current_vllm_config (#5121 ) ### What this PR does / why we need it? This PR fixes some incorrect `get_current_vllm_config` calling, which creates empty vllm_config instead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-18 22:21:36 +08:00
LICO67373	9fcaf66646	fix: use batch_matmul_transpose operator in MLA _v_up_proj for better performance (#5142 ) ### What this PR does / why we need it? This PR fixes a bug in the `AscendMLAImpl._v_up_proj` method where the optimized `batch_matmul_transpose` operator was not being utilized. Changes: - Modified `_v_up_proj` method to use `torch.ops._C_ascend.batch_matmul_transpose` operator for FP16/BF16 dtypes when available - Added fallback path using the original `torch.bmm` implementation for other cases - This avoids unnecessary transpose operations and improves performance Why needed: - The previous implementation only used `torch.bmm` with multiple transpose operations, which is less efficient - The Ascend backend provides an optimized `batch_matmul_transpose` operator that can handle the computation more efficiently - This fix improves inference performance for MLA (Multi-head Latent Attention) models on Ascend NPU ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization that maintains the same functionality and output. Users will experience faster inference for MLA-based models, but no API or interface changes are introduced. The changes maintain backward compatibility with the fallback path, ensuring correct behavior when the operator is not available or for unsupported dtypes. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: hwhaokun <haokun0405@163.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-18 16:48:55 +08:00
weijinqian0	98e6e57622	[Refactor] 4/N Distinguish the branches based on the applicable scenarios of PA and FIA Ops. (#5081 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason: We distinguish the branches based on the applicable scenarios of pagedAttention and fusedInferAttention, making the code more clear. At the same time, it is convenient for the subsequent iterations of sliding_window and sinks and removePA ops after FIA is ready. Todo: remove PA ops after FIA is ready add slidingwindow and ops for gpt_oss replace FIA with FIA_v2 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-17 23:14:02 +08:00
zzzzwwjj	06b82e7503	[main] rename device type (#5099 ) ### What this PR does / why we need it? Rename `_910B` to `A2`; Rename `_910_93` to `A3`; Rename `_910_95` to `A5`; - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-17 14:08:19 +08:00
Icey	cadfa5ddc1	[Fusion] [Graph] Add qknorm rope fusion operator (#4711 ) ### What this PR does / why we need it? This PR add `qkv_rmsnorm_rope` operator and introduces a graph fusion pass for `qknorm_rope` operations. The implementation includes a new configuration flag, a pattern matching pass using `torch._inductor.pattern_matcher`, and a custom Triton kernel for the fused operation. Co-authored-by: Angazenn [supperccell@163.com](mailto:supperccell@163.com) ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-17 08:53:44 +08:00
anon189Ty	5b1da4e914	[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle (#4893 ) ### What this PR does / why we need it? We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: drslark <slarksblood@qq.com>	2025-12-16 22:06:40 +08:00
whx	a9625851ef	[Attention] Temporarily add back pa for small batch sizes. (#4765 ) ### What this PR does / why we need it? This PR adds back pa in scenarios of small batch sizes due to performance consideration. Will remove pa once fia performs better than pa in all scenarios. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-15 20:35:50 +08:00

1 2 3 4 5 ...

308 Commits