xc-llm-ascend

Author	SHA1	Message	Date
LICO67373	380f089fbf	[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870 ) ## What this PR does / why we need it? This PR fixes the `AttentionMaskBuilder` singleton initialization issue introduced in PR #4779 and removes the unused `pcp_prefill_mask` field. ### Background After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton` decorator, the class constructor now requires a `device` parameter. However, two initialization sites were still using the old parameterless constructor, causing failures. ### Changes 1. Fix singleton initialization - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendMLAMetadataBuilder.__init__()` - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendAttentionMetadataBuilder.__init__()` 2. Remove unused field - Removed `pcp_prefill_mask` field from `AscendPrefillContextParallelMetadata` (never used in codebase) - Updated related test assertions ### Related - Issue #5463 - PR #4779 (Unify all mask generation methods) - PR #5389 (Make AttentionMaskBuilder singleton) ## Does this PR introduce _any_ user-facing change? No. This is an internal refactoring. ## How was this patch tested? - ✅ Local testing: No linter errors - ✅ Unit tests for attention modules verified - ⏳ CI pipeline Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-07 17:09:52 +08:00
wangxiyuan	1112208052	[Refactor] Cleanup platform (#5566 ) ### What this PR does / why we need it? 1. add `COMPILATION_PASS_KEY` constant 2. clean up useless platform interface `empty_cache`, `synchronize`, `mem_get_info`, `clear_npu_memory` 3. rename `CUSTOM_OP_REGISTERED` to `_CUSTOM_OP_REGISTERED` 4. remove uesless env `VLLM_ENABLE_CUDAGRAPH_GC` NPUPlatform is the interface called by vLLM. Do not call it inner vllm-ascend. ### Does this PR introduce _any_ user-facing change? This PR is just a cleanup. All CI should pass. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 09:25:55 +08:00
wujinyuan1	4a3663327b	[Refactor]7/N Extract common code to common_cp (#5490 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： Eliminate duplicate code for two file(mla_cp.py attention_cp.py) to common_cp.py. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` vLLM version: release/v0.13.0 vLLM main: `5fbfa8d9ef` - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com>	2026-01-05 17:41:12 +08:00
pichangping	50e7934415	MLA prefill preformance optimization (#5456 ) ### What this PR does / why we need it? Since the _npu_ring_mla operator deteriorates in long-sequencescenarios, the long sequence is split into shorter sequences for input to improve performance. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: pichangping <1337510399@qq.com>	2026-01-05 11:41:59 +08:00
Qiu	96775a27a8	[refactor](UT,PCP,DCP) refactor pcp&dcp patches in UTs (#5505 ) ### What this PR does / why we need it? Refactor PCP & DCP patches in UTs: Merge and reuse communication groups and communication function patches to reduce code duplication. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-05 09:05:45 +08:00
wujinyuan1	23169021d9	[Refactor]6/N Extract common code of class AscendMLAImpl (#5314 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： Eliminate duplicate code for two file(mla_v1.py mla_cp.py) of IMPL classes. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-28 10:40:45 +08:00
wangxiyuan	d1f0df7b4b	Revert "MLA prefill preformance optimization (#5275 )" (#5410 ) We'll release 0.13.0 soon. The main branch is freeze. Let's revert the newest change and redo it once 0.13.0 is released - vLLM version: release/v0.13.0 - vLLM main: `81786c8774`	2025-12-27 09:48:56 +08:00
pichangping	711f1861e4	MLA prefill preformance optimization (#5275 ) ### What this PR does / why we need it? Since the _npu_ring_mla operator deteriorates in long-sequencescenarios, the long sequence is split into shorter sequences for input to improve performance. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: pichangping <1337510399@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-27 09:19:45 +08:00
wujinyuan1	7ff1db4b84	[Refactor]5/N Extract common code of mla_v1.py & extract mla_cp (#5097 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： The functions related to Cp differ significantly from those of normal MLA-Attention, but the coupling is quite severe. Steps： 1)Extract common code AscendMLAMetadataBuilder.build to 4 functions: build_prefill_metadata, build_decode_metadata,build_cp_metadata, build_chunked_metadata todo： 1)refactor function _compute_prefill_context; 2)refactor function _mla_preprocess,_mla_decode_preprocess 3）Extract public data and processing functions from the attention_cp.py and mla_cp.py files to the common_cp file. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` - vLLM version: 0.13.0rc3 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-24 10:25:19 +08:00
Feng Liu	e117b3d693	[Perf] vectorize PCP/DCP loops in mla_v1.py (#5003 ) ### What this PR does / why we need it? - Replace nested PCP/DCP Python loops with fully vectorized tensor operations - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2025-12-22 11:06:30 +08:00
Feng Liu	eda3cabf5b	[UT] add pcp&dcp UT for mla_cp (#4953 ) ### What this PR does / why we need it? Add UT of mla_cp, which include: - test_compute_prefill_context_with_dcp_pcp - test_reorg_kvcache_with_dcp_pcp - test_out_lse_reshape - test_npu_attention_update_with_dcp_pcp - test_attention_with_mask_and_nomask_with_dcp_pcp - test_process_attn_out_lse_with_dcp_pcp - test_forward_prefill_cp_with_dcp_pcp - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2025-12-17 16:19:27 +08:00
zengzengran	6029bea480	[UT]add pcp dcp ut (#4949 ) ### What this PR does / why we need it? Adding UT for DCP/PCP -vLLM version: v0.12.0 -vLLM main: `ad32e3e19c` Signed-off-by: zengran <zengran2@huawei.com>	2025-12-15 18:41:38 +08:00

12 Commits