xc-llm-ascend

Author	SHA1	Message	Date
dsxsteven	8378bc28b0	[Misc] Remove CP Redundant Variables after FIA operator enables for CANN 8.5 (#6013 ) ### What this PR does / why we need it? PCP/DCP splits the kv-cache onto different cards. After introducing the parameter cp-kv-cache-interleave-size, the first size tokens will be cached at Card 0, and so on. However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem. After we integrate FIA operator in mla_cp._forward_decode and CANN updates to 8.5.0, we now can remove these additional operations. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? passed all CI by CANN 8.5.0 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: dsxsteven <dsxsteven@sina.com> Signed-off-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>	2026-01-23 14:13:12 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
Bai Yongbin	7f91ac2649	[CP&SP] Integrate FIA operator in mla_cp._forward_decode (#5641 ) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp.py _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>	2026-01-22 20:02:30 +08:00
SILONG ZENG	329961b375	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #2 ) (#5977 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/attention/attention_mask.py` \| \| `vllm_ascend/attention/attention_v1.py` \| \| `vllm_ascend/attention/context_parallel/attention_cp.py` \| \| `vllm_ascend/attention/context_parallel/common_cp.py` \| \| `vllm_ascend/attention/context_parallel/mla_cp.py` \| \| `vllm_ascend/attention/utils.py` \| \| `vllm_ascend/batch_invariant.py` \| \| `vllm_ascend/device/device_op.py` \| \| `vllm_ascend/device_allocator/camem.py` \| \| `vllm_ascend/envs.py` \| - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-19 08:59:46 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
Qiu	5f4b13ab3d	[bugfix](cp) align max_context_chunk to cp_virtual_block_size (#5767 ) ### What this PR does / why we need it? In the chunked prefill scenario, CP needs to align the `max_context_chunk` to the `cp_virtual_block_size`, but the current implementation only aligns it to the `block_size`. For PD-disaggregation, `cp_kv_cache_interleave_size` is typically set equal to `block_size`, in which case `cp_virtual_block_size=block_size * dcp_size * pcp_size`. Under specific conditions, this can lead to misalignment of certain chunks, subsequently triggering assertion check errors. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-12 20:11:46 +08:00
zhenwenqi2024	97f6be8108	[feature]dcp&pcp support mlapo (#5672 ) ### What this PR does / why we need it? mlapo in deepseek is a huge performance improvement in decode, this pr support pcp & dcp with mlapo ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-08 23:49:23 +08:00
LICO67373	380f089fbf	[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870 ) ## What this PR does / why we need it? This PR fixes the `AttentionMaskBuilder` singleton initialization issue introduced in PR #4779 and removes the unused `pcp_prefill_mask` field. ### Background After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton` decorator, the class constructor now requires a `device` parameter. However, two initialization sites were still using the old parameterless constructor, causing failures. ### Changes 1. Fix singleton initialization - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendMLAMetadataBuilder.__init__()` - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendAttentionMetadataBuilder.__init__()` 2. Remove unused field - Removed `pcp_prefill_mask` field from `AscendPrefillContextParallelMetadata` (never used in codebase) - Updated related test assertions ### Related - Issue #5463 - PR #4779 (Unify all mask generation methods) - PR #5389 (Make AttentionMaskBuilder singleton) ## Does this PR introduce _any_ user-facing change? No. This is an internal refactoring. ## How was this patch tested? - ✅ Local testing: No linter errors - ✅ Unit tests for attention modules verified - ⏳ CI pipeline Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-07 17:09:52 +08:00
wujinyuan1	4a3663327b	[Refactor]7/N Extract common code to common_cp (#5490 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： Eliminate duplicate code for two file(mla_cp.py attention_cp.py) to common_cp.py. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` vLLM version: release/v0.13.0 vLLM main: `5fbfa8d9ef` - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com>	2026-01-05 17:41:12 +08:00

9 Commits