xc-llm-ascend

Author	SHA1	Message	Date
Yang Yuxi	e776d5c0f1	[Bugfix]v0.18.0 support FlashComm1 & DCP for Qwen (#7726 ) ### What this PR does / why we need it? This PR backports the changes from #7673 ([Bugfix] support FlashComm1 & DCP for Qwen) to the releases/v0.18.0 branch. -------- Signed-off-by: Yang Yuxi <907276627@qq.com>	2026-03-29 15:59:19 +08:00
lilinsiman	8f278fc101	[eagle3][pcp] fix bug for eagle3 and cp enable (#7309 ) ### What this PR does / why we need it? This PR fixes the bug for eagle3 and cp enable introduced by the parallel speculative inference PR. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests and ut - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-17 16:14:45 +08:00
Qiu	13adcbe44b	feat(attention_cp): support chunked prefill for Qwen3Next with PCP&DCP (#6900 ) ### What this PR does / why we need it? Support chunked prefill for Qwen3Next with PCP&DCP - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-09 17:55:09 +08:00
xiaocongtou6	bc0fd7ca72	[Feat]Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for DeepSeek v3.2. (#6940 ) ### What this PR does / why we need it? Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for DeepSeek v3.2. ### How was this patch tested? Test output: {"object":"text_completion","model":"deepeek_v3","choices":[{"index":0,"text":" the head of state and head of government of the United States, indirectly elected to a four-year term by the American people through the Electoral College. The officeholder leads the executive branch of the federal government and is the commander-in-chief of the United States","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":1,"text":" Paris. This is the largest city in France and its main political, cultural and commercial center. The modern location of the city is the north of the central part of the country, on the banks of the Seine River Seine River Seine in 3\n\n","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":2,"text":" now\n\n# AI future is now\n\nThe world is changing at a rapid pace, and artificial intelligence (AI) is at the forefront of this transformation. From self-driving cars to virtual assistants, AI is already making a significant impact on our daily lives","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":3,"text":" a 3rd year student at the University of Lincoln studying Media Production. This blog is about my work throughout my final year on the course.\n\n## Tuesday 3 May 2016\n### Final Major Project - Evaluation\n\nFor my final project I","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":27,"total_tokens":227,"completion_tokens":200,"prompt_tokens_details":null},"kv_transfer_params":null} - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: xiaocongtou6 <2066962956@qq.com> Signed-off-by: xiaocongtou6 <105542647+xiaocongtou6@users.noreply.github.com>	2026-03-06 16:10:24 +08:00
Bai Yongbin	9d09488b4a	[Feat] support basic pcp&dcp for qwen3next (#6091 ) ### What this PR does / why we need it? This PR implements Context Parallelism (CP) support for the Qwen3-Next model, including PCP (Parallel Context Parallelism) and DCP (Dynamic/Data Context Parallelism). - vLLM version: v0.15.0 - vLLM main: `f176443446` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Co-authored-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-02-28 21:44:08 +08:00
weiguihua2	db51a1b9b6	[Feat]ds3.2 support pcp (#6733 ) ### What this PR does / why we need it? The ds3.2 model adaptation supports the PCP feature. The solution is as follows: When saving the KV cache, first perform an allgather operation on the KVs, and then each node saves its own copy. When the attention or indexer performs calculations, they all gather the KV cache and then perform the calculations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 02/12 23:05:10 - AISBench - INFO - Running 1-th replica of evaluation 02/12 23:05:10 - AISBench - INFO - Task [vllm-api-general-chat/gsm8k]: {'accuracy': 96.35416666666667, 'type': 'GEN'} 02/12 23:05:10 - AISBench - INFO - time elapsed: 2.87s 02/12 23:05:12 - AISBench - INFO - Evaluation tasks completed. 02/12 23:05:12 - AISBench - INFO - Summarizing evaluation results... dataset version metric mode vllm-api-general-chat gsm8kdataset - accuracy gen 96.35 - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-25 09:46:57 +08:00
Feng Liu	03a18ad6fd	[E2E] add E2E for Prefix Caching cp & Chunked Prefill cp (#5149 ) ### What this PR does / why we need it? Add E2E for Prefix Caching cp & Chunked Prefill cp ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Signed-off-by: Feng Liu <46866849+ader47@users.noreply.github.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2026-02-03 15:04:14 +08:00
Qiu	638cae824d	[bugfix](CP) Fix and unify the PD request discrimination logic. (#5939 ) ### What this PR does / why we need it? Since the PR (https://github.com/vllm-project/vllm/pull/32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-31 10:26:02 +08:00
dsxsteven	8378bc28b0	[Misc] Remove CP Redundant Variables after FIA operator enables for CANN 8.5 (#6013 ) ### What this PR does / why we need it? PCP/DCP splits the kv-cache onto different cards. After introducing the parameter cp-kv-cache-interleave-size, the first size tokens will be cached at Card 0, and so on. However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem. After we integrate FIA operator in mla_cp._forward_decode and CANN updates to 8.5.0, we now can remove these additional operations. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? passed all CI by CANN 8.5.0 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: dsxsteven <dsxsteven@sina.com> Signed-off-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>	2026-01-23 14:13:12 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
Qiu	58ff465821	[bugfix] fix the complex and potentially problematic generate_kv_idx. (#5957 ) ### What this PR does / why we need it? In long-sequence scenarios, the chunked-prefill component may encounter dimension misalignment issues, which previously occurred during precision testing on the code_generate_lite dataset. This PR removes redundant computations and instead derives the value using existing results and straightforward calculations. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-21 14:21:02 +08:00
Li Wang	8cf1e8d8a7	[CI] Add wait logic for each individual case (#6036 ) ### What this PR does / why we need it? Wait until the NPU memory is clean ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2026-01-20 21:05:44 +08:00
weiguihua2	5892455f43	[Bugfix] fix bug of pcp+mtp+async scheduler (#5994 ) ### What this PR does / why we need it? Fixed the issue where the PCP and MTP services could not be started due to asynchronous scheduling. After the pcp, mtp, and asynchronous scheduling functions are enabled, the service is suspended because of a shape mismatch after a curl request is sent. This PR resolves this issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-20 15:24:05 +08:00
Qiu	72fee47cba	[CI](cp) skip bad UT test_models_chunked_prefill_with_empty_kvcache temporarily (#5919 ) Skip bad UT test_models_chunked_prefill_with_empty_kvcache temporarily, which is inadaptable with main2main 20260114. - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-15 15:50:06 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
Qiu	a88937f5cb	[bugfix](cp) replace None with zeros/inf tensor to avoid TypeError (#5837 ) ### What this PR does / why we need it? When there is no kv cache in some devices, the `_compute_prefill_context func` will return `None`, which is unexecpted. This PR replaces None with full zeros/-inf tensors to avoid TypeError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py -k test_models_chunked_prefill_with_empty_kvcache ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-14 20:57:48 +08:00
wangxiyuan	6f7a81cd9f	[CI] cleanup single/multi-card test (#5623 ) 1. speed up e2e light test. 2. create `2-cards` and `4-cards` folder in multicard 3. move ops to nightly 4. run test in Alphabetical Order - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 14:13:34 +08:00

17 Commits