xc-llm-ascend

Author	SHA1	Message	Date
weiguihua2	bc8e87f3db	[v0.18.0][Bugfix] fix ds3.2 dcp mtp (#7681 ) ### What this PR does / why we need it? Fixed the issue where the DCP overlaps the MTP scenario in the ds3.2 scenario. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? cherry-pick from: https://github.com/vllm-project/vllm-ascend/pull/7617 Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-03-27 14:24:53 +08:00
Mengqing Cao	29308ac3a9	[v0.18.0][Bugfix] Fixed wrong class attribute assignment (#7586 ) (#7655 ) ### What this PR does / why we need it? Fixed incorrect class attribute assignment and corrected it to instance attribute assignment. Ensured reorder_batch_threshold only applies to the current instance to avoid global pollution and multi-instance conflicts. Backport of #7586 Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: LookAround0301 <lixushi@huawei.com>	2026-03-27 11:20:59 +08:00
Ronald	c980e68d40	[Feature] support aclgraph for model runner v2 (#7110 ) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-13 09:11:46 +08:00
lilinsiman	a5ea699e29	[eagle][cp] fix eagle_cp enable bug2 (#7079 ) ### What this PR does / why we need it? Fix acceptance and high-concurrency bug in eagle3 and cp enabled ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests and ut - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-10 16:32:49 +08:00
Qiu	13adcbe44b	feat(attention_cp): support chunked prefill for Qwen3Next with PCP&DCP (#6900 ) ### What this PR does / why we need it? Support chunked prefill for Qwen3Next with PCP&DCP - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-09 17:55:09 +08:00
dsxsteven	91c39ebae6	[BugFix] [dcp] Fix GQA Model Error when Enable both DP and DCP (#7012 ) ### What this PR does / why we need it? For GQA model, when we enable both dp and dcp (disable pcp), the key-value pairs were not being captured correctly; we have now fixed it. Signed-off-by: dsxsteven <dsxsteven@sina.com>	2026-03-05 16:51:08 +08:00
Bai Yongbin	9d09488b4a	[Feat] support basic pcp&dcp for qwen3next (#6091 ) ### What this PR does / why we need it? This PR implements Context Parallelism (CP) support for the Qwen3-Next model, including PCP (Parallel Context Parallelism) and DCP (Dynamic/Data Context Parallelism). - vLLM version: v0.15.0 - vLLM main: `f176443446` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Co-authored-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-02-28 21:44:08 +08:00
lilinsiman	c13d90b766	[Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (#6811 ) ### What this PR does / why we need it? [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface into eagle_proposer.py This pull request significantly refactors the speculative decoding mechanism by merging Parallel Context Processing (PCP) and Multi-Token Prediction (MTP) functionalities directly into the eagle_proposer.py. The changes aim to enhance the efficiency and correctness of distributed speculative decoding, particularly by enabling the Eagle feature to work seamlessly with the disable_padded interface. This involves detailed adjustments to attention metadata, input/output processing, and state management to ensure proper operation in parallel environments. 1. The PCP and MTP features are migrated to the eagle_proposer.py 2. The Eagle and PCP features are integrated 3. Enable the eagle feature to use the disable_padded interface ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests and UT - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-02-27 16:06:56 +08:00
wangxiaoteng888	b881fab416	[P/D][PCP] mooncake layerwise support pcp function (#6627 ) ### What this PR does / why we need it? mooncake layerwise support pcp function PCP (Prefill Context Parallelism) Support: Introduced explicit support for Prefill Context Parallelism (PCP) and Decode Context Parallelism (DCP) in the Mooncake layerwise KV cache transfer mechanism, allowing for more granular control and awareness of parallel configurations during data transfer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-02-12 11:02:25 +08:00
Qiu	feab047084	[bugfix](pcp,gqa) set kv_inverse_idx_for_chunk and cp_kv_recover_idx_for_chunk to None when dcp only (#6317 ) ### What this PR does / why we need it? We only do restore and recover for pcp, so we should set `kv_inverse_idx_for_chunk` and `cp_kv_recover_idx_for_chunk` to `None` when only using dcp. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-29 19:35:52 +08:00
LICO67373	379ce599d0	[Bugfix] Add missing draft_attn_metadatas parameter to fix MTP test (#6232 ) ### What this PR does / why we need it? Fix the MTP test failure caused by accessing non-existent attribute `forward_context.draft_attn_metadatas`. Root cause: In `AscendAttentionBackendImpl.update_graph_params`, the code incorrectly accessed `forward_context.draft_attn_metadatas`, but `ForwardContext` class doesn't have this attribute. The original code passed this value via function parameter. Fix: Add `draft_attn_metadatas` parameter to the entire call chain: - `update_full_graph_params` function in `acl_graph.py` - All `update_graph_params` methods in attention backends - Pass the parameter correctly in `eagle_proposer.py` Also applied Gemini's suggestion to make `vllm_config=None` in `AscendAttentionCPImpl.update_graph_params` for API consistency. Related to item 9 in #5463 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This fixes the CI test failure: `test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` Signed-off-by: lico67373 <918688502@qq.com>	2026-01-28 14:41:18 +08:00
wangxiyuan	4e3919e965	Reapply "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) (#6231 ) This reverts commit `95649344aa`. The CI failure doesn't related to this change. Let's reapply it. - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-26 09:04:54 +08:00
wangxiyuan	95649344aa	Revert "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) This reverts commit `8966a99710`. It breaks the test `tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-25 15:25:38 +08:00
LICO67373	8966a99710	[Refactor] Unify full-graph parameter update logic (#6041 ) ### What this PR does / why we need it? Refactor: Unify full-graph parameter update logic This PR consolidates the scattered full-graph parameter update logic into a unified approach, improving code architecture and eliminating duplication. Key improvements: 1. Unified interface - Create `update_full_graph_params` as the single entry point for all full-graph updates - Replace multiple scattered update calls with one unified function - Remove ~50 lines of duplicated if-else logic across `model_runner_v1.py` and `eagle_proposer.py` 2. Better architecture - Move update logic to respective Backend classes (`AscendAttentionBackend`, `AscendMLABackend`) - Each Backend manages its own parameter update logic internally - Simplify caller code to just dispatch to the appropriate Backend 3. Cleaner parameter handling - Remove unnecessary `pcp_size` and `dcp_size` parameter passing - Get parallel configuration directly from distributed groups - Consistent with how other parts of the codebase obtain these values Why we need it: - Maintainability: Future changes only need to be made in one place per Backend - Code quality: Follows DRY principle and Single Responsibility Principle - Readability: Cleaner, more intuitive code structure ### Does this PR introduce _any_ user-facing change? No. This is a pure refactoring with no functional changes - same behavior, cleaner code. ### How was this patch tested? - All existing unit tests pass with updated mocks - No new tests needed (pure refactoring, no behavior changes) - CI validates correctness --- - vLLM version: v0.13.0 Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: drslark <slarksblood@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-24 20:12:57 +08:00
dsxsteven	8378bc28b0	[Misc] Remove CP Redundant Variables after FIA operator enables for CANN 8.5 (#6013 ) ### What this PR does / why we need it? PCP/DCP splits the kv-cache onto different cards. After introducing the parameter cp-kv-cache-interleave-size, the first size tokens will be cached at Card 0, and so on. However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem. After we integrate FIA operator in mla_cp._forward_decode and CANN updates to 8.5.0, we now can remove these additional operations. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? passed all CI by CANN 8.5.0 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: dsxsteven <dsxsteven@sina.com> Signed-off-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>	2026-01-23 14:13:12 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
Qiu	58ff465821	[bugfix] fix the complex and potentially problematic generate_kv_idx. (#5957 ) ### What this PR does / why we need it? In long-sequence scenarios, the chunked-prefill component may encounter dimension misalignment issues, which previously occurred during precision testing on the code_generate_lite dataset. This PR removes redundant computations and instead derives the value using existing results and straightforward calculations. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-21 14:21:02 +08:00
SILONG ZENG	329961b375	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #2 ) (#5977 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/attention/attention_mask.py` \| \| `vllm_ascend/attention/attention_v1.py` \| \| `vllm_ascend/attention/context_parallel/attention_cp.py` \| \| `vllm_ascend/attention/context_parallel/common_cp.py` \| \| `vllm_ascend/attention/context_parallel/mla_cp.py` \| \| `vllm_ascend/attention/utils.py` \| \| `vllm_ascend/batch_invariant.py` \| \| `vllm_ascend/device/device_op.py` \| \| `vllm_ascend/device_allocator/camem.py` \| \| `vllm_ascend/envs.py` \| - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-19 08:59:46 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
Qiu	a88937f5cb	[bugfix](cp) replace None with zeros/inf tensor to avoid TypeError (#5837 ) ### What this PR does / why we need it? When there is no kv cache in some devices, the `_compute_prefill_context func` will return `None`, which is unexecpted. This PR replaces None with full zeros/-inf tensors to avoid TypeError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py -k test_models_chunked_prefill_with_empty_kvcache ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-14 20:57:48 +08:00
LICO67373	c8a324ab73	[Refactor] Add comments for Metadata classes in attention module (#5789 ) ### What this PR does / why we need it? Add docstrings for Metadata and MetadataBuilder classes in the attention module to improve code readability. Related to #5463 (Item 11: Add some comments for CommonMetadata and others) Modified files: - `vllm_ascend/attention/context_parallel/common_cp.py`: Added comments for `AscendPCPMetadata`, `CPChunkedContextMetadata`, `AscendMetadataForPrefill`, `AscendMetadataForDecode` - `vllm_ascend/attention/utils.py`: Added comments for `AscendPrefillContextParallelMetadata` - `vllm_ascend/attention/mla_v1.py`: Added comments for `ChunkedContextMetadata`, `AscendMLADecodeMetadata` - `vllm_ascend/attention/attention_v1.py`: Added comments for `AscendMetadata`, `AscendAttentionMetadataBuilder` - `vllm_ascend/attention/context_parallel/attention_cp.py`: Added comments for `AscendAttentionCPMetadataBuilder` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation only, no functional changes. Signed-off-by: lico67373 <918688502@qq.com>	2026-01-13 08:46:50 +08:00
LICO67373	380f089fbf	[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870 ) ## What this PR does / why we need it? This PR fixes the `AttentionMaskBuilder` singleton initialization issue introduced in PR #4779 and removes the unused `pcp_prefill_mask` field. ### Background After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton` decorator, the class constructor now requires a `device` parameter. However, two initialization sites were still using the old parameterless constructor, causing failures. ### Changes 1. Fix singleton initialization - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendMLAMetadataBuilder.__init__()` - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendAttentionMetadataBuilder.__init__()` 2. Remove unused field - Removed `pcp_prefill_mask` field from `AscendPrefillContextParallelMetadata` (never used in codebase) - Updated related test assertions ### Related - Issue #5463 - PR #4779 (Unify all mask generation methods) - PR #5389 (Make AttentionMaskBuilder singleton) ## Does this PR introduce _any_ user-facing change? No. This is an internal refactoring. ## How was this patch tested? - ✅ Local testing: No linter errors - ✅ Unit tests for attention modules verified - ⏳ CI pipeline Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-07 17:09:52 +08:00
wujinyuan1	4a3663327b	[Refactor]7/N Extract common code to common_cp (#5490 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： Eliminate duplicate code for two file(mla_cp.py attention_cp.py) to common_cp.py. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` vLLM version: release/v0.13.0 vLLM main: `5fbfa8d9ef` - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com>	2026-01-05 17:41:12 +08:00

23 Commits