xc-llm-ascend

Author	SHA1	Message	Date
drslark	48ec97821a	[Bugfix] Fixed an accuracy problem of sp with eagle3 (#5816 ) ### What this PR does / why we need it? Fixed an accuracy problem when using eagle3 with sp. The problem is described in https://github.com/vllm-project/vllm-ascend/issues/5825. It also adds a much more precise way to determine whether drafter should use `sp` or not. Also, it changes the `eager` of drafter to be a real `eager` in frontend to avoid a `fx-graph` problem. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? For simpilicity, we test it as in https://github.com/vllm-project/vllm-ascend/issues/5825. And we get the same result of `eagle3` with `sp` disabled. ```text -------------------------------------------------- total_num_output_tokens: 1000 num_drafts: 437 num_draft_tokens: 1311 num_accepted_tokens: 564 mean acceptance length: 2.29 -------------------------------------------------- acceptance at token 0: 0.62 acceptance at token 1: 0.40 acceptance at token 2: 0.27 acceptance at token 3: 0.00 acceptance at token 4: 0.00 acceptance at token 5: 0.00 ``` * vLLM version: v0.13.0 * vLLM main: `2f4e6548ef` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-14 09:00:37 +08:00
Levi	ecd4232698	[Feat] flashcomm2+oshard Generalized (#4723 ) ### What this PR does / why we need it? [FlashComm2](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf) introduces redundant storage of the o_proj matrix, which imposes pressure on GPU memory. We propose the FlashComm2+Oshard approach by integrating the shared linear layer feature (#2931). This approach distributes weights layer-by-layer to each GPU and accesses the o_proj of each layer via asynchronous broadcast operations, thereby alleviating memory pressure while achieving nearly lossless performance compared to the original FlashComm2. This PR implements a generalized FlashComm2+Oshard solution. Using following env to support flashcomm2 with oshard ```shell export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 --additional-config '{ "layer_sharding": ["o_proj"] }' ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2026-01-10 22:57:57 +08:00
ZT-AIA	e11ff8e535	[BufFix]Fix the error when using Ascend custom operators with rank=128 (#5394 ) ### What this PR does / why we need it? The customized ascend operator sgmv_expand and sgmv_shrink applies only to the scenario where rank is 8,16,32,64. When rank >= 128, the operator is out of range, causing the model to report an error. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Depends on this commit https://github.com/vllm-project/vllm/pull/31408 - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2026-01-09 15:57:43 +08:00
zhenwenqi2024	97f6be8108	[feature]dcp&pcp support mlapo (#5672 ) ### What this PR does / why we need it? mlapo in deepseek is a huge performance improvement in decode, this pr support pcp & dcp with mlapo ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-08 23:49:23 +08:00
LI SHENGYONG	b69db4ce55	[EPLB][CI] EPLB add aclgraph and redundant expert ci (#5625 ) ### What this PR does / why we need it? EPLB currently does not have CI related to aclgraph and redundancy experts; this PR adds them. release on #5529 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Tested the use cases to be added in this PR. PASSED ====================================================== warnings summary ========================================================== <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================== 1 passed, 2 warnings in 272.24s (0:04:32) ===================================================== - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-08 09:51:48 +08:00
Li Wang	1165b2c863	[1/N][CI] Refactor accuracy test (#5400 ) ### What this PR does / why we need it? 1. Accuracy testing no longer compares eager and graph modes; instead, it directly extracts the golden result under the graph mode configuration (the implicit purpose of this case is to verify whether modifications affect existing results) 2. Next step: finer-grained supervision of logits/sampler results ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-07 20:58:15 +08:00
wangxiyuan	6f7a81cd9f	[CI] cleanup single/multi-card test (#5623 ) 1. speed up e2e light test. 2. create `2-cards` and `4-cards` folder in multicard 3. move ops to nightly 4. run test in Alphabetical Order - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 14:13:34 +08:00
zhenwenqi2024	ad9b711f89	[Bugfix] fix dcp_only bug and add e2e accuracy test for dcp only and pcp only (#5565 ) ### What this PR does / why we need it? [Bugfix] fix dcp_only bug and add e2e accuracy test for dcp only and pcp only this pr fix the bug of accuracy test when decode_parallel_size>1 and prefill_context_parallel_size=1. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-06 22:48:21 +08:00
Shanshan Shen	b94d589769	[MM][Bugfix] Update `hf_config` to `hf_text_config` (#5319 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/5205, update `hf_config` to `hf_text_config`. Find more details at https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534 and https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-06 16:41:39 +08:00
wjunLu	3cf059a72b	[Main2Main] Upgrade vllm commit to 0105 (#5595 ) ### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since https://github.com/vllm-project/vllm/pull/31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to https://github.com/vllm-project/vllm/pull/30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to https://github.com/vllm-project/vllm/pull/31584 4. Adapt `self.block_size` calling due to https://github.com/vllm-project/vllm/pull/31540 5. Modify `test_mla_v1.py` due to https://github.com/vllm-project/vllm/pull/28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-06 08:44:29 +08:00
weiguihua2	549be94397	[Bugfix] fix pcp + eplb error (#5561 ) ### What this PR does / why we need it? Fix the bug in the PCP overlay feature 1、Fix the bug related to PCP and EPLB overlap by including PCP size in the word_size calculation. 2、In the PCP pooling scenario, a prompt has been added for setting the cp_kv_cache_interleave_size. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-05 14:08:11 +08:00
LookAround0301	d25a2c20c5	[Bugfix] Fix chunk prefill bug for long_sequence feature (#5444 ) ### What this PR does / why we need it? Fix chunk prefill bug for long_sequence feature When there are two requests with chunk prefill enabled in the long-sequence scenario, if one request has only 1 token during scheduling, it will be identified as a decode request and trigger an error. This PR fixes the issue. Closes: https://github.com/vllm-project/vllm-ascend/issues/5445 - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: LookAround <lixushi@huawei.com>	2026-01-05 09:16:36 +08:00
drslark	363ac1b80f	[Feat][main] Supported to use full-graph with Qwen3-Next-MTP (#5477 ) ### What this PR does / why we need it? Supported to use full-graph with Qwen3-Next-MTP. In detail, we adatpted `AscendAttentionState.ChunkedPrefill` in main model, and also adapted `AscendAttentionState.ChunkedPrefill` in mtp model. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? We changed the test of Qwen3-Next-MTP in `tests/e2e/multicard/test_qwen3_next.py` to make it a test of `FULL_DECODE_ONLY`. Then run `pytest -s tests/e2e/multicard/test_qwen3_next.py::test_qwen3_next_distributed_mp_eager_mtp_similarity_tp4`. And this test passed. ```text . ================================================================================================================================= warnings summary ================================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ==================================================================================================================== 1 passed, 2 warnings in 271.89s (0:04:31) ===================================================================================================================== ``` - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-04 12:03:21 +08:00
wjunLu	3c2d3e52e5	[Main2Main] Upgrade vllm commit to 1230 (#5495 ) ### What this PR does / why we need it? Upgrade vllm commit to 1230 Affected by https://github.com/vllm-project/vllm/pull/27614 (and the core PR https://github.com/vllm-project/vllm/pull/26866), we have to make the following changes: 1. Modify `tests/e2e/multicard/test_aclgraph_capture_replay.py` to keep compatible with both vllm version of `v0.13.0` and latest main commitID, while vllm enables async scheduling by default 2. Skip `test_guided_decoding.py` due to xgrammar errors (https://github.com/vllm-project/vllm-ascend/issues/5524) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com>	2025-12-31 09:44:35 +08:00
lilinsiman	46862ce1af	[main][test] Refactor the mtp and eagle test case (#5326 ) ### What this PR does / why we need it? 1. Refactor the current test with mtp and eagle cases 2. Add new necessary cases with mtp and eagle ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-12-31 09:22:58 +08:00
ZT-AIA	24328aaf00	update vllm pin to 12.27 (#5412 ) ### What this PR does / why we need it? update vllm pin to 12.27 1、Fix Qwen2-MoE shared_expert_gate ：https://github.com/vllm-project/vllm/pull/31339 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? vLLM version: release/v0.13.0 vLLM main: `5326c89803` Co-authored-by: leo-pony [nengjunma@outlook.com](nengjunma@outlook.com) --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-12-28 00:19:36 +08:00
zhangyiming	45c5bcd962	[E2E] Optimize the E2E test time. (#5294 ) ### What this PR does / why we need it? Add cudagraph_capture_sizes for E2E CI test. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-26 14:17:50 +08:00
ZT-AIA	adaa89a7a5	Update vllm pin to 12.25 (#5342 ) ### What this PR does / why we need it? - Fix vllm break in the pr: 1.[Drop v0.14 deprecations ]https://github.com/vllm-project/vllm/pull/31285 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: ZT-AIA <1028681969@qq.com>	2025-12-26 14:05:40 +08:00
wangxiyuan	2ae0bad96d	Remove VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE (#5272 ) `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with `VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR remove it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-25 11:09:56 +08:00
dsxsteven	30778f371b	[BugFix] Fix num_pcp_pads Assignment Issues (#5273 ) ### What this PR does / why we need it? The variable `self.num_pcp_pads` was incorrectly truncated during assignment, causing errors in certain scenarios such as PD disaggregated. This issue has now been resolved. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Co-author by: QiuChunshuo <qiuchunshuo@huawei.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: daishixun <dsxsteven@sina.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-25 10:38:09 +08:00
wjunLu	fca2f948c1	[E2E Refactor] Enable skipped e2e case (#5287 ) ### What this PR does / why we need it? The test case `tests/e2e/multicard/test_data_parallel.py` was skipped due to the errors encountered during migration from Ascend A2 to A3, the details are as follows ``` (EngineCore_DP0 pid=17833) RuntimeError: npu_moe_distribute_dispatch_v2:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:161 NPU function error: call aclnnMoeDistributeDispatchV3 failed, error code is 561002 (EngineCore_DP0 pid=17833) [ERROR] 2025-12-23-07:36:19 (PID:17833, Device:0, RankID:-1) ERR00100 PTA call acl api failed. (EngineCore_DP0 pid=17833) EZ9999: Inner Error! (EngineCore_DP0 pid=17833) EZ9999[PID: 17833] 2025-12-23-07:36:19.237.396 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 512, h = 2048, epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine = 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) + (maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 609MB, HCCL_BUFFSIZE=200MB.[FUNC:MoeDistributeDispatchA3TilingFuncImpl][FILE:moe_distribute_dispatch_v2_tiling.cc][LINE:941] (EngineCore_DP0 pid=17833) TraceBack (most recent call last): (EngineCore_DP0 pid=17833) MoeDistributeDispatchV2 do tiling failed, ret is -1. (EngineCore_DP0 pid=17833) Check NnopbaseExecutorDoTiling(executor) failed (EngineCore_DP0 pid=17833) Check NnopbaseExecutorTilingAndUpdateBinInfo(executor) failed (EngineCore_DP0 pid=17833) Check NnopbaseExecutorMatchCache(executor) failed (EngineCore_DP0 pid=17833) Check NnopbaseRunForWorkspace(*executor, workspaceSize) failed ``` ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? After fixed, I ran `pytest -sv --durations=0 tests/e2e/multicard/test_data_parallel.py`, and the result looks good ``` ========================================================================================= warnings summary ========================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ======================================================================================== slowest durations ========================================================================================= 112.69s call tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-vllm-ascend/Qwen3-30B-A3B-W8A8] 88.11s call tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-Qwen/Qwen3-30B-A3B] 70.06s call tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-Qwen/Qwen3-0.6B] (6 durations < 0.005s hidden. Use -vv to show these durations.) ============================================================================ 3 passed, 2 warnings in 270.88s (0:04:30) ============================================================================ ``` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2025-12-25 09:18:05 +08:00
wangxiyuan	fb3d6ca08c	Cleanup uesless env (#5270 ) `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is not used anywhere, let's remove it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-24 22:07:59 +08:00
zhangxinyuehfad	8ae7fca947	[CI] refect e2e ci test (#5246 ) ### What this PR does / why we need it? efect e2e ci test： 1. tests/e2e/singlecard/pooling/test_embedding.py: remove the eager parameter and rename test case 2. tests/e2e/singlecard/pooling/test_scoring.py: Rename test cases 3. tests/e2e/singlecard/pooling/test_classification.py: Rename test case 4. tests/e2e/singlecard/test_quantization.py: remove the eager parameter and chage model to vllm-ascend/Qwen2.5-0.6B-W8A8 and Rename test case 5. tests/e2e/multicard/test_shared_expert_dp.py: Rename test cases 6. tests/e2e/singlecard/test_sampler.py: Rename test cases 7. tests/e2e/singlecard/test_aclgraph_accuracy.py: Rename test cases 8. tests/e2e/multicard/test_offline_inference_distributed.py: Rename test cases and remove the eager parameter 9. tests/e2e/multicard/long_sequence/test_accuracy.py: Rename test cases and remove the eager parameter 10. tests/e2e/multicard/long_sequence/test_basic.py: Rename test cases and remove the eager parameter 11.tests/e2e/multicard/test_expert_parallel.py:remove the eager parameter 12.tests/e2e/multicard/test_full_graph_mode.py:remove the eager parameter 13.tests/e2e/multicard/test_ilama_lora_tp2.py:remove the eager parameter 14.tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py:remove the eager parameter 15.tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py:remove the eager parameter 16.tests/e2e/singlecard/test_aclgraph_accuracy.py:remove the eager parameter 17.tests/e2e/singlecard/test_camem.py:remove the eager parameter 18.tests/e2e/singlecard/test_ilama_lora.py:remove the eager parameter 19.tests/e2e/singlecard/test_multistream_overlap_shared_expert.py:remove the eager parameter 20.tests/e2e/singlecard/test_vlm.py:remove the eager parameter 21.tests/e2e/singlecard/test_xli:remove the eager parameter ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-23 18:42:35 +08:00
SILONG ZENG	29a93daa82	[CI]refactor: standardize test case naming convention (#5243 ) ### What this PR does / why we need it? - Standardize test case naming in `vllm-ascend/tests/e2e/multicard/` to follow the `<model>_<feature>_<distributed>` convention. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2025-12-23 14:13:42 +08:00
LI SHENGYONG	2e010e12dd	[EPLB][CI] Add dynamic EPLB CI for qwen3-moe (#5179 ) ### What this PR does / why we need it? Add dynamic EPLB CI for qwen3-moe-30B-W8A8 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-12-23 11:31:00 +08:00
zhangsicheng5	78aa7f2693	[feature] support pcp + mtp in full graph (#4572 ) 1. support pcp + mtp in full graph 2. pcp/dcp related mtp bugfix 3. support pcp + mtpx - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-12-22 16:13:39 +08:00
weiguihua2	21745221a3	[lint]clean code (#5218 ) ### What this PR does / why we need it? Fix lint error inreoduced by https://github.com/vllm-project/vllm-ascend/pull/5141 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-20 18:24:04 +08:00
weiguihua2	74aa968a9f	[e2e] add pcp e2e (#5141 ) ### What this PR does / why we need it? add pcp accuracy e2e test case - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-20 16:56:46 +08:00
LookAround0301	76e58d66be	support basic long_seq feature st (#5140 ) ### What this PR does / why we need it? support basic long_seq feature st - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: LookAround <lixushi@huawei.com>	2025-12-19 10:50:01 +08:00
ck-hw-1018	71e544e259	[test] add w4a8 accuracy case (#5110 ) ### What this PR does / why we need it? This PR add w4a8 accuracy testcase for e2e test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: cuikai (C) <c00827167@china.huawei.com> Co-authored-by: cuikai (C) <c00827167@china.huawei.com>	2025-12-18 14:10:14 +08:00
zzzzwwjj	06b82e7503	[main] rename device type (#5099 ) ### What this PR does / why we need it? Rename `_910B` to `A2`; Rename `_910_93` to `A3`; Rename `_910_95` to `A5`; - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-17 14:08:19 +08:00
ZixuanWang	b1a853b0f6	Upgrade vllm commit hash to 1216 (#5053 ) ### What this PR does / why we need it? Upstream vLLM PR #30212 https://github.com/vllm-project/vllm/pull/30212 refactored the attention backend selection interface, This PR adapts vllm-ascend's get_attn_backend_cls to align with the new upstream standard, ensuring compatibility and reducing maintenance overhead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zxwang <1476209578@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-12-17 08:48:36 +08:00
drslark	8fb0ef5ffa	[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932 ) ### What this PR does / why we need it? Fixes an accuracy bug of Qwen3-next-MTP when batched inferring. It is descibed in https://github.com/vllm-project/vllm-ascend/issues/4930. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: drslark <slarksblood@qq.com>	2025-12-15 13:22:30 +08:00
zhangxinyuehfad	bfafe30953	[CI] refect e2e test (#4799 ) ### What this PR does / why we need it? This PR updates the CI configuration and adjusts a set of end-to-end (e2e) tests under tests/e2e/multicard, in order to refactor the test suite and ensure compatibility with current codebase and CI workflows. 1. tests/e2e/multicard/test_prefix_caching.py: change model to Qwen3-8B and rename the test case 2. tests/e2e/multicard/test_quantization.py: rename the test case 3. tests/e2e/multicard/test_qwen3_moe.py: remove duplicate test and rename test cases 4. tests/e2e/multicard/test_qwen3_next.py: rename test cases and change the W8A8 pruning model to the W8A8 model and remove the eager parameter 5. tests/e2e/multicard/test_shared_expert_dp.py: rename test case and remove the eager parameter 6. tests/e2e/multicard/test_single_request_aclgraph.py: rename test case and change Qwen3-30B to Qwen3-0.6B 7. tests/e2e/multicard/test_torchair_graph_mode.py: delete test cases about torchair - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-12 08:42:08 +08:00
SILONG ZENG	e56dba9b0d	[CI]cleanup e2e test (#4800 ) ### What this PR does / why we need it? This PR refactors the E2E multicard test suite to improve test case identification and maintainability. Specifically, it renames various test functions to be more descriptive (explicitly indicating model families like Qwen/DeepSeek and parallelism strategies like DP/TP/PP/EP) and cleans up outdated or redundant test configurations in the offline distributed inference tests. Key Changes: 1. Test Function Renaming (Standardization): Renamed multiple test functions across `tests/e2e/multicard/` to include clear suffixes/prefixes regarding the model and parallel strategy. This helps differentiate test cases in CI logs and prevents naming collisions. `test_aclgraph_capture_replay.py`: - `test_aclgraph_capture_replay_dp2` -> `test_aclgraph_capture_replay_metrics_dp2` `test_data_parallel.py`: - `test_data_parallel_inference` -> `test_qwen_inference_dp2` `test_data_parallel_tp2.py`: - `test_data_parallel_inference` -> `test_qwen_inference_dp2_tp2` `test_expert_parallel.py`: - `test_e2e_ep_correctness` -> `test_deepseek_correctness_ep` `test_external_launcher.py`: - `test_external_launcher` -> `test_qwen_external_launcher` - `test_moe_external_launcher` -> `test_qwen_moe_external_launcher_ep` - `test_external_launcher_and_sleepmode` -> `test_qwen_external_launcher_with_sleepmode` - `test_external_launcher_and_sleepmode_level2` -> `test_qwen_external_launcher_with_sleepmode_level2` - `test_mm_allreduce` -> `test_qwen_external_launcher_with_matmul_allreduce` `test_full_graph_mode.py`: - `test_models_distributed_Qwen3_MOE_TP2_WITH_FULL_DECODE_ONLY` -> `test_qwen_moe_with_full_decode_only` - `test_models_distributed_Qwen3_MOE_TP2_WITH_FULL` -> `test_qwen_moe_with_full` `test_fused_moe_allgather_ep.py`: - `test_generate_with_allgather `-> `test_deepseek_moe_fused_allgather_ep` - `test_generate_with_alltoall` -> `test_deepseek_moe_fused_alltoall_ep` `test_offline_weight_load.py`: - `test_offline_weight_load_and_sleepmode` -> `test_qwen_offline_weight_load_and_sleepmode` `test_pipeline_parallel.py`: - `test_models` -> `test_models_pp2` 2. Distributed Inference Cleanup (`test_offline_inference_distributed.py`): model list changes: ``` QWEN_DENSE_MODELS = [ - "vllm-ascend/Qwen3-8B-W8A8", "vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8" + "vllm-ascend/Qwen3-8B-W8A8", ] ``` ``` - QWEN_W4A8_OLD_VERSION_MODELS = [ - "vllm-ascend/Qwen3-8B-W4A8", - ] - QWEN_W4A8_NEW_VERSION_MODELS = [ - "vllm-ascend/DeepSeek-V3-W4A8-Pruing", - "vllm-ascend/DeepSeek-V3.1-W4A8-puring", - ] + DEEPSEEK_W4A8_MODELS = [ + "vllm-ascend/DeepSeek-V3.1-W4A8-puring", + ] ``` Test Function Changes: - removed `test_models_distributed_QwQ` - removed `test_models_distributed_Qwen3_W8A8` - removed `test_models_distributed_Qwen3_W4A8DYNAMIC_old_version` - `test_models_distributed_Qwen3_W4A8DYNAMIC_new_version` -> `test_models_distributed_Qwen3_W4A8DYNAMIC` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2025-12-11 20:35:32 +08:00
zhangyiming	66b0781840	[E2E] Refactor the e2e testcases. (#4789 ) ### What this PR does / why we need it? Refactor the e2e testcases. - tests/e2e/multicard/test_weight_loader.py: Remove the unused code. - tests/e2e/singlecard/multi-modal/test_internvl.py: Move to accuracy test. - tests/e2e/singlecard/test_aclgraph.py: Rename the file. - tests/e2e/singlecard/test_embedding_aclgraph.py : Combine with tests/e2e/singlecard/test_bge_model.py - tests/e2e/singlecard/test_completion_with_prompt_embeds.py: Delete eager mode and modify model to Qwen3-0.6B - tests/e2e/singlecard/test_quantization.py: Modify model to Qwen3-0.6B-W8A8 - tests/e2e/singlecard/test_vlm.py: Modify model to Qwen3-VL-8B - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: menogrey <1299267905@qq.com>	2025-12-11 10:15:00 +08:00
wangxiyuan	f917d5edcf	Remove useless env (#4858 ) cleanup useless env. These envs are not used anymore `VLLM_ASCEND_TRACE_RECOMPILES`, `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE`, `VLLM_ASCEND_MLA_PA`, `PHYSICAL_DEVICES` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-11 06:51:07 +08:00
drslark	0fb1dc43a1	[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770 ) ### What this PR does / why we need it? The pad `-1` modification is from https://github.com/vllm-project/vllm/pull/25743. It still has bugs for batched chunked prefill. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: drslark <slarksblood@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 22:54:24 +08:00
Ruri	ce5872705e	[Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights (#4516 ) ### What this PR does / why we need it? Adds W4A16 quantization method for the Kimi-K2-Thinking model and updates relevant modules to support the new quantization method. - Implements complete W4A16 quantization method including weight packing/unpacking, per-group quantization parameter generation, post-processing logic and MoE method application. - Adds parameters `use_int4_w4a16`, `w1_offset` and `w2_offset`, adjusts `with_quant` conditional logic to support W4A16 matrix multiplication. - Adds `packed_modules_model_mapping` for Kimi-K2-Thinking model and processing logic for `weight_packed` field. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com> Signed-off-by: Ruri <zhouxiang100@huawei.com>	2025-12-10 15:58:52 +08:00
Yizhou	134e011896	[Test] Temporarily skips Qwen3-30B-A3B-W8A8 data parallel test case (#4857 ) ### What this PR does / why we need it? This case is breaking CI, please see https://github.com/vllm-project/vllm-ascend/actions/runs/20084930558/job/57620266368?pr=4854 ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-10 11:05:32 +08:00
wangxiyuan	835b4c8f1d	Drop torchair (#4814 ) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 09:20:40 +08:00
Yizhou	cd1c69ee0b	[Fix] Add extra warmup run count for MC2 on specific SoC version (#4843 ) ### What this PR does / why we need it? We didn’t account for this earlier because we didn’t have A3 in CI, but now that we do, this test case needs a few extra tweaks — please take a look at `profile_run`. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-09 21:37:38 +08:00
lhp-deep	b230e7e987	[MOE]move weight transpose to wakeup for RL secnarios (#4626 ) ### What this PR does / why we need it? In reinforcement learning scenarios, the current inference applies a transpose operation to the weights. For a cleaner architecture, the weight transpose module was moved to wakeup. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lhp-deep <liuhaopeng1@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-08 20:34:52 +08:00
LeeWenquan	38bd95229f	[Model] Add qwen3Next support in Main (#4596 ) ### What this PR does / why we need it? Add Qwen3Next support in main ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-12-03 14:17:37 +08:00
wangxiyuan	981a14f8d5	[CI]enable chunked prefill by default (#4569 ) set `enable_chunked_prefill` to True for e2e test by default to keep the same behavior with vLLM - vLLM version: v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-02 08:54:34 +08:00
Wang Kunpeng	a9c4b8604a	[main][bugfix] bugfix for qwen3 moe quantization (#4599 ) ### What this PR does / why we need it? Fix the issue where the qwen3 moe service cannot be started due to upgrading the vllm version Error info: AttributeError: 'AscendFusedMoE' object has no attribute 'use dp chunking' ### Does this PR introduce _any_ user-facing change? no - vLLM version: v0.11.2 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-01 23:48:57 +08:00
MengLong Chen	143e1f46d0	[Feat] shared expert dp for deepseek_mtp (#3811 ) ### What this PR does / why we need it? Support shared expert DP for deepseek_mtp feature. `shared_expert_dp` requires `SP==True`, with corresponding parameter restrictions. Previously, due to the coupling between `shared_expert_dp` and torchair, and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed. Currently, by performing the `reduce_scatter` on the input of deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the dimensions of `input_embedding`, and then perform the `all_gather` on the output of mtp. ### How was this patch tested? baseline: <img width="1184" height="692" alt="image" src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3" /> enable shared_expert_dp and multistream_overlap_shared_expert: <img width="1167" height="687" alt="image" src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc" /> TPOT: 48ms -> 45.4ms Average TPS per rank: 117.6 -> 126.1 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: chenmenglong <chenmenglong1@huawei.com> Signed-off-by: zengran <zengran2@huawei.com> Co-authored-by: zengran <zengran2@huawei.com>	2025-12-01 20:44:11 +08:00
wangxiyuan	27b09ca9b9	[CI] drop ascend scheduler test (#4582 ) let' drop ascend scheduler test first to ensure all function works without it. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-01 20:33:50 +08:00
wangxiyuan	0d14f635b4	upgrade torch npu version (#4433 ) vLLM graph feature now rely on torch >=2.8. To make graph mode work, we need upgrade torch version as well. For long term support, upgrade torch to a newer one is good to go as well. Related vLLM change: https://github.com/vllm-project/vllm/pull/25110 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2	2025-12-01 19:01:55 +08:00
Mengqing Cao	517fd9272d	Revert "drop ascend scheduler" (#4580 ) Reverts vllm-project/vllm-ascend#4498 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2	2025-11-29 22:20:48 +08:00

1 2 3

141 Commits