xc-llm-ascend

Author	SHA1	Message	Date
dsxsteven	30778f371b	[BugFix] Fix num_pcp_pads Assignment Issues (#5273 ) ### What this PR does / why we need it? The variable `self.num_pcp_pads` was incorrectly truncated during assignment, causing errors in certain scenarios such as PD disaggregated. This issue has now been resolved. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Co-author by: QiuChunshuo <qiuchunshuo@huawei.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: daishixun <dsxsteven@sina.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-25 10:38:09 +08:00
wjunLu	fca2f948c1	[E2E Refactor] Enable skipped e2e case (#5287 ) ### What this PR does / why we need it? The test case `tests/e2e/multicard/test_data_parallel.py` was skipped due to the errors encountered during migration from Ascend A2 to A3, the details are as follows ``` (EngineCore_DP0 pid=17833) RuntimeError: npu_moe_distribute_dispatch_v2:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:161 NPU function error: call aclnnMoeDistributeDispatchV3 failed, error code is 561002 (EngineCore_DP0 pid=17833) [ERROR] 2025-12-23-07:36:19 (PID:17833, Device:0, RankID:-1) ERR00100 PTA call acl api failed. (EngineCore_DP0 pid=17833) EZ9999: Inner Error! (EngineCore_DP0 pid=17833) EZ9999[PID: 17833] 2025-12-23-07:36:19.237.396 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 512, h = 2048, epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine = 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) + (maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 609MB, HCCL_BUFFSIZE=200MB.[FUNC:MoeDistributeDispatchA3TilingFuncImpl][FILE:moe_distribute_dispatch_v2_tiling.cc][LINE:941] (EngineCore_DP0 pid=17833) TraceBack (most recent call last): (EngineCore_DP0 pid=17833) MoeDistributeDispatchV2 do tiling failed, ret is -1. (EngineCore_DP0 pid=17833) Check NnopbaseExecutorDoTiling(executor) failed (EngineCore_DP0 pid=17833) Check NnopbaseExecutorTilingAndUpdateBinInfo(executor) failed (EngineCore_DP0 pid=17833) Check NnopbaseExecutorMatchCache(executor) failed (EngineCore_DP0 pid=17833) Check NnopbaseRunForWorkspace(*executor, workspaceSize) failed ``` ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? After fixed, I ran `pytest -sv --durations=0 tests/e2e/multicard/test_data_parallel.py`, and the result looks good ``` ========================================================================================= warnings summary ========================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ======================================================================================== slowest durations ========================================================================================= 112.69s call tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-vllm-ascend/Qwen3-30B-A3B-W8A8] 88.11s call tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-Qwen/Qwen3-30B-A3B] 70.06s call tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-Qwen/Qwen3-0.6B] (6 durations < 0.005s hidden. Use -vv to show these durations.) ============================================================================ 3 passed, 2 warnings in 270.88s (0:04:30) ============================================================================ ``` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2025-12-25 09:18:05 +08:00
Magnus	a9fccbeb30	[CI] add xlite e2e test (#5305 ) ### What this PR does / why we need it? add xlite e2e test - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: DaweiChang <405739598@qq.com>	2025-12-25 09:17:06 +08:00
Aoxuan Chen	6d25372baa	Add MagicMTP(block verify) and Triton optimization (#4443 ) ### What this PR does / why we need it? 1. MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy. 2. The rejection sampling logic in rejection_sampler.py was restructured using Triton-Ascend, enabling it to operate under high concurrency, thus resolving CPU and NPU operator bottlenecks and enhancing throughput. ### Does this PR introduce _any_ user-facing change? MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: chenaoxuan <cax1165@163.com>	2025-12-25 09:00:25 +08:00
Ascendyh	a90482803d	[Kernel] add l2norm triton kernel (#4595 ) ### What this PR does / why we need it? This pull request introduces an L2 normalization kernel implemented in Triton, specifically optimized for Ascend NPUs. ### Does this PR introduce _any_ user-facing change? No, this PR does not introduce any user-facing changes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-25 06:06:18 +08:00
Mengqing Cao	e54630e01c	Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) (#5317 ) ### What this PR does / why we need it? Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as it causes deepseek v3.2 hang error - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-24 22:24:17 +08:00
wangxiyuan	fb3d6ca08c	Cleanup uesless env (#5270 ) `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is not used anywhere, let's remove it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-24 22:07:59 +08:00
TmacAaron	5018f2d8fd	[quantization] Add w8a16 quantization support (#4541 ) ### What this PR does / why we need it? related to https://github.com/vllm-project/vllm-ascend/issues/4267 ### Does this PR introduce _any_ user-facing change? support w8a16 quantization now ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` ### Test tested using [aisbench](https://gitee.com/aisbench/benchmark/) with tp2 #### Precision \| ceval \| mmlu \| gsm8k -- \| -- \| -- \| -- bf16 \| 90.46 \| 89.17 \| 96.21 w8a16 \| 89.51 \| 89.29 \| 95.98 #### Performance \| input_len \| output_len \| concurrency \| TTFT (ms) \| TPOT (ms) \| TPS (Total) (tokens/s) -- \| -- \| -- \| -- \| -- \| -- \| -- bf16 \| 2048 \| 2048 \| 10 \| 1911.7136 \| 77.988 \| 253.9866 w8a16 \| 2048 \| 2048 \| 10 \| 2128.6334 \| 67.1633 \| 293.9117 bf16 \| 3500 \| 1024 \| 10 \| 3076.2509 \| 84.3525 \| 506.949 w8a16 \| 3500 \| 1024 \| 10 \| 2685.2031 \| 73.015 \| 585.4717 --------- Signed-off-by: yyt <yangyit139@gmail.com> Signed-off-by: TmacAaron <yangyit139@gmail.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-12-24 19:49:32 +08:00
zhangyiming	74a1de50a9	[E2E] Optimize e2e test. (#5091 ) ### What this PR does / why we need it? [E2E] Optimize e2e test. - Remove the test_basic_camem testcase. - Change Qwen2.5-0.5B-Instruct-W8A8 to Qwen3-0.6B-W8A8 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-24 10:41:55 +08:00
zhangyiming	bd4fb871c6	[CI] Add skipped testcases. (#5254 ) ### What this PR does / why we need it? Some E2E testcases are not in our CI workflow, this PR add them back. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-24 10:41:32 +08:00
Nengjun Ma	3b59f20a28	update to vllm 12-19 (#5223 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (https://github.com/vllm-project/vllm/pull/30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: https://github.com/vllm-project/vllm-ascend/issues/5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-23 23:52:11 +08:00
zhangxinyuehfad	8ae7fca947	[CI] refect e2e ci test (#5246 ) ### What this PR does / why we need it? efect e2e ci test： 1. tests/e2e/singlecard/pooling/test_embedding.py: remove the eager parameter and rename test case 2. tests/e2e/singlecard/pooling/test_scoring.py: Rename test cases 3. tests/e2e/singlecard/pooling/test_classification.py: Rename test case 4. tests/e2e/singlecard/test_quantization.py: remove the eager parameter and chage model to vllm-ascend/Qwen2.5-0.6B-W8A8 and Rename test case 5. tests/e2e/multicard/test_shared_expert_dp.py: Rename test cases 6. tests/e2e/singlecard/test_sampler.py: Rename test cases 7. tests/e2e/singlecard/test_aclgraph_accuracy.py: Rename test cases 8. tests/e2e/multicard/test_offline_inference_distributed.py: Rename test cases and remove the eager parameter 9. tests/e2e/multicard/long_sequence/test_accuracy.py: Rename test cases and remove the eager parameter 10. tests/e2e/multicard/long_sequence/test_basic.py: Rename test cases and remove the eager parameter 11.tests/e2e/multicard/test_expert_parallel.py:remove the eager parameter 12.tests/e2e/multicard/test_full_graph_mode.py:remove the eager parameter 13.tests/e2e/multicard/test_ilama_lora_tp2.py:remove the eager parameter 14.tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py:remove the eager parameter 15.tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py:remove the eager parameter 16.tests/e2e/singlecard/test_aclgraph_accuracy.py:remove the eager parameter 17.tests/e2e/singlecard/test_camem.py:remove the eager parameter 18.tests/e2e/singlecard/test_ilama_lora.py:remove the eager parameter 19.tests/e2e/singlecard/test_multistream_overlap_shared_expert.py:remove the eager parameter 20.tests/e2e/singlecard/test_vlm.py:remove the eager parameter 21.tests/e2e/singlecard/test_xli:remove the eager parameter ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-23 18:42:35 +08:00
Li Wang	5d1f6daef6	[CI] Mock spawn for vlm tests (#5279 ) ### What this PR does / why we need it? Using `spawn` in continuous testing scenarios ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-23 18:35:06 +08:00
SILONG ZENG	fa0c212bfa	[test]Corrected the Qwen3-Omni-30B-A3B-Instruct accuracy test configuration in nightly tests. (#5195 ) ### What this PR does / why we need it? Corrected the Qwen3-Omni-30B-A3B-Instruct accuracy test configuration in nightly tests. link: https://github.com/vllm-project/vllm-ascend/pull/4911 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2025-12-23 14:17:27 +08:00
SILONG ZENG	29a93daa82	[CI]refactor: standardize test case naming convention (#5243 ) ### What this PR does / why we need it? - Standardize test case naming in `vllm-ascend/tests/e2e/multicard/` to follow the `<model>_<feature>_<distributed>` convention. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2025-12-23 14:13:42 +08:00
meihanc	592cfb6a6f	[CI] Add Triton Ascend in CI (#4921 ) Add triton-ascend in UT and e2e - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2025-12-23 12:47:35 +08:00
LI SHENGYONG	2e010e12dd	[EPLB][CI] Add dynamic EPLB CI for qwen3-moe (#5179 ) ### What this PR does / why we need it? Add dynamic EPLB CI for qwen3-moe-30B-W8A8 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-12-23 11:31:00 +08:00
Mengqing Cao	449f8f65a7	[KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) ### What this PR does / why we need it? Support KV-Sharing feature in CLA (cross layer attention) models, which sharing kv cache in some layers. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-23 10:48:31 +08:00
Li Wang	9a79cbaecb	[ModelRunner] Add hunyuan-vl basic support (#5151 ) ### What this PR does / why we need it? This patch add handling of `XDRotaryEmbedding` in modelrunner to support for `hunyuan-vl` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with added/exist tests Closes: https://github.com/vllm-project/vllm-ascend/issues/4992 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-23 10:46:54 +08:00
Wang Kunpeng	c3a8d13ca7	[refactor] Remove unnecessary attributes from set_ascend_forward_context (#5204 ) ### What this PR does / why we need it? Remove unnecessary attributes from set_ascend_forward_context 1.prefetch_stream 2.weight_prefetch_method ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-23 08:49:52 +08:00
jiangyunfan1	3ba920a65b	[TEST]Update mm param --mm-processor-cache-gb (#5242 ) ### What this PR does / why we need it? This PR updates the mm param --mm-processor-cache-gb, we need it to run the case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-12-22 18:54:03 +08:00
zhangsicheng5	78aa7f2693	[feature] support pcp + mtp in full graph (#4572 ) 1. support pcp + mtp in full graph 2. pcp/dcp related mtp bugfix 3. support pcp + mtpx - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-12-22 16:13:39 +08:00
YuhanBai	5d02eed16f	[Performance] Add async exponential while model executing (#4501 ) ### What this PR does / why we need it? Add a control to enable the exponential distribution operator overlapping with model executing (default is OFF due to this feature might not perform well on MOE models, i.e. For Qwen3-30B). Enable async exponential overlapping will provides performance improvement. Also, overlapping the exponential operator with module execution can cover the performance drop introduced by AICPU-version's exponential operator. UPDATE: (12/12) Now our overlap will use the same stream that introduced in this pr: #4908 . We move the `do_async_exponential` from `model_runner_v1.py` to `sampler.py`. Now we are using `additional_config` to enable async exponential: Add `"enable_async_exponential": 1` in `addition_config`. Now we ONLY support default exponential/AI-CPU exponential, the old `"enable_async_exponential": 2` option has been aborted to keep consistency. ### Does this PR introduce _any_ user-facing change? YES, added a new `additional_config` : `"enable_async_exponential": 1`. When `enable_async_exponential` is set to 1, we enable the async exponential and overlap with model runner. When `enable_async_exponential` is set to 0 (default is 0), we disable the async exponential, but exponential will still running on a different stream using stream introduced in #4908. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: YuhanBai <yuhan.bai0830@gmail.com> Signed-off-by: YuhanBai yuhan.bai0830@gmail.com	2025-12-20 21:23:21 +08:00
weiguihua2	21745221a3	[lint]clean code (#5218 ) ### What this PR does / why we need it? Fix lint error inreoduced by https://github.com/vllm-project/vllm-ascend/pull/5141 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-20 18:24:04 +08:00
weiguihua2	74aa968a9f	[e2e] add pcp e2e (#5141 ) ### What this PR does / why we need it? add pcp accuracy e2e test case - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-20 16:56:46 +08:00
Mengqing Cao	5d59bf8ca0	[CI] unblock CI on suffix spec decoding (#4813 ) ### What this PR does / why we need it? unblock CI on suffix spec decoding ### How was this patch tested? CI passed with existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-20 14:54:49 +08:00
Li Wang	243ab7d720	[CI] Use offline mode for nightly test (#5187 ) ### What this PR does / why we need it? For single node test, the lack of a retry mechanism for accessing ModelScope resulted in an HTTP 400 error sometimes. I recommend using a local offline cache instead. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-19 21:21:42 +08:00
XiaoxinWang	0cc3fc357f	[pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (#4818 ) ### What this PR does / why we need it? qwen3_next add fused_sigmoid_gating_delta_rule_update op which fused fused_gdn_gating+fused_recurrent_gated_delta_rule - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-19 16:34:11 +08:00
wangqiankun13	118b0ed346	[Feature] Add token mask for DispatchGmmCombineDecode operator (#5171 ) ### What this PR does / why we need it? In this PR, DispatchGmmCombineDecode add an optional input x_active_mask, with which only token masked True will be dispatched and handle. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2025-12-19 16:31:48 +08:00
LookAround0301	76e58d66be	support basic long_seq feature st (#5140 ) ### What this PR does / why we need it? support basic long_seq feature st - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: LookAround <lixushi@huawei.com>	2025-12-19 10:50:01 +08:00
zhaomingyu13	73e4b4f496	[BugFix] Fix top_p,top_k issue with EAGLE and add top_p,top_k in EAGLE e2e (#5131 ) ### What this PR does / why we need it? Add top_p,top_k in EAGLE e2e - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2025-12-18 23:07:14 +08:00
zxr2333	073a3a6e6c	[Doc][P/D] Fix MooncakeConnector's name (#5172 ) ### What this PR does / why we need it? vLLM community has integrated their MooncakeConnector. The original scripts will now find this MooncakeConnector instead of the one from vLLM-Ascend. All scripts that involve using the MooncakeConnector need to be modified to another name. ### Does this PR introduce _any_ user-facing change? Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector. ### How was this patch tested? By CI. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-12-18 22:29:19 +08:00
ZT-AIA	6cb76ecd02	[Nightly] Avoid max_model_len being smaller than the decoder prompt to prevent single-node-accuray-tests from failing (#5174 ) ### What this PR does / why we need it? [Nightly] Avoid max_model_len being smaller than the decoder prompt to prevent single-node-accuray-tests from failing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2025-12-18 22:25:45 +08:00
Angazenn	acc3578f58	[Graph][Fusion]Add new pattern for AddRmsnormQuant with SP. (#5077 ) ### What this PR does / why we need it? 1. In addition to [#4168](https://github.com/vllm-project/vllm-ascend/pull/4168), [#5011](https://github.com/vllm-project/vllm-ascend/pull/5011)， this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-18 20:25:44 +08:00
ck-hw-1018	71e544e259	[test] add w4a8 accuracy case (#5110 ) ### What this PR does / why we need it? This PR add w4a8 accuracy testcase for e2e test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: cuikai (C) <c00827167@china.huawei.com> Co-authored-by: cuikai (C) <c00827167@china.huawei.com>	2025-12-18 14:10:14 +08:00
panchao-hub	8069442b41	enable npugraph_ex (#5120 ) ### What this PR does / why we need it? We will expose the enabling switch for npugraph_ex to better facilitate subsequent optimization. ### Does this PR introduce _any_ user-facing change? Previously, the enable_npugraph_ex switch would trigger an error; now we have removed the error reporting mechanism to better facilitate subsequent optimization efforts. Basic functionalities are available in CANN and torch_npu for Q3, while advanced optimizations will depend on the Q4 release. ### How was this patch tested? llm =LLM( model=model, enforce_eager=False , additional_config={ "enable_npugraph_ex": True }, compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16], }, } - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-18 09:08:40 +08:00
shaopeng-666	39bdd4cfaa	fix profile run for vl model (#5136 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-12-17 23:51:31 +08:00
zzzzwwjj	06b82e7503	[main] rename device type (#5099 ) ### What this PR does / why we need it? Rename `_910B` to `A2`; Rename `_910_93` to `A3`; Rename `_910_95` to `A5`; - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-17 14:08:19 +08:00
Icey	cadfa5ddc1	[Fusion] [Graph] Add qknorm rope fusion operator (#4711 ) ### What this PR does / why we need it? This PR add `qkv_rmsnorm_rope` operator and introduces a graph fusion pass for `qknorm_rope` operations. The implementation includes a new configuration flag, a pattern matching pass using `torch._inductor.pattern_matcher`, and a custom Triton kernel for the fused operation. Co-authored-by: Angazenn [supperccell@163.com](mailto:supperccell@163.com) ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-17 08:53:44 +08:00
ZixuanWang	b1a853b0f6	Upgrade vllm commit hash to 1216 (#5053 ) ### What this PR does / why we need it? Upstream vLLM PR #30212 https://github.com/vllm-project/vllm/pull/30212 refactored the attention backend selection interface, This PR adapts vllm-ascend's get_attn_backend_cls to align with the new upstream standard, ensuring compatibility and reducing maintenance overhead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zxwang <1476209578@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-12-17 08:48:36 +08:00
anon189Ty	5b1da4e914	[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle (#4893 ) ### What this PR does / why we need it? We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: drslark <slarksblood@qq.com>	2025-12-16 22:06:40 +08:00
realliujiaxu	9e24bdd44c	[Feat] Refactor rejection sampler (#4975 ) ### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - https://github.com/vllm-project/vllm/pull/19482 - https://github.com/vllm-project/vllm/pull/26060 - https://github.com/vllm-project/vllm/pull/29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following https://github.com/vllm-project/vllm/pull/26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with https://github.com/vllm-project/vllm-ascend/pull/4893/) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-12-16 11:32:26 +08:00
SILONG ZENG	237fad635c	[Fix]Revert temporary skip on mtp1/mtp2 correctness tests (aclgraph fix) (#5039 ) ### What this PR does / why we need it? This Pull Request removes the @pytest.mark.skip decorators from test_mtp1_correctness_piecewise_graph and test_mtp2_correctness_piecewise_graph. These tests were temporarily skipped because of an issue with the MTP ACL Graph (as per the original TODO comment). Since the relevant bug/issue has been resolved, these tests are now re-enabled to ensure full correctness coverage for MTP functionality. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-16 10:40:00 +08:00
Icey	5fae65f3a8	[Graph][Fusion] Add AddRMSNorm(with bias) and Quant Fusion Pattern (#5011 ) ### What this PR does / why we need it? AddRMSNorm(with bias) and Quant Fusion Pattern ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-15 18:37:56 +08:00
SILONG ZENG	70606e0bb9	[Test]update accuracy test of models (#4911 ) ### What this PR does / why we need it? Delete accuracy tests for models that are no longer retained： - Meta-Llama-3.1-8B-Instruct - llava-1.5-7b-hf - InternVL2-8B.yaml - InternVL2_5-8B.yaml - InternVL3-8B.yaml Add accuracy tests for the new models： - Llama-3.2-3B-Instruct - llava-onevision-qwen2-0.5b-ov-hf - Qwen3-VL-30B-A3B-Instruct - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2025-12-15 15:04:20 +08:00
drslark	8fb0ef5ffa	[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932 ) ### What this PR does / why we need it? Fixes an accuracy bug of Qwen3-next-MTP when batched inferring. It is descibed in https://github.com/vllm-project/vllm-ascend/issues/4930. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: drslark <slarksblood@qq.com>	2025-12-15 13:22:30 +08:00
Li Wang	0f92d34a70	[CI] Pull latest vllm-ascend src before tests (#4988 ) ### What this PR does / why we need it? Currently, our image build suffers from errors during cross-compilation, which causing the image to fail to build sometimes(see https://github.com/vllm-project/vllm-ascend/actions/runs/20152861650/job/57849208186). This results in the nightly test code not being the latest version. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-13 19:04:14 +08:00
zhenwenqi2024	4721e4f53f	[bugfix] asyncscheduler bug fix (#4968 ) ### What this PR does / why we need it? now vllm-ascend uses AsyncGPUModelRunnerOutput ,AsyncNPUModelRunnerOutput before is outdated, so we should fix it - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2025-12-13 17:04:54 +08:00
Li Wang	5b12c068f9	[Nightly] Remove gen_ranktable logic (#4941 ) ### What this PR does / why we need it? Since the `llmdatadist` has sunset, the logic gen_ranktable should also be removed - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-12 17:20:18 +08:00
zhangxinyuehfad	bfafe30953	[CI] refect e2e test (#4799 ) ### What this PR does / why we need it? This PR updates the CI configuration and adjusts a set of end-to-end (e2e) tests under tests/e2e/multicard, in order to refactor the test suite and ensure compatibility with current codebase and CI workflows. 1. tests/e2e/multicard/test_prefix_caching.py: change model to Qwen3-8B and rename the test case 2. tests/e2e/multicard/test_quantization.py: rename the test case 3. tests/e2e/multicard/test_qwen3_moe.py: remove duplicate test and rename test cases 4. tests/e2e/multicard/test_qwen3_next.py: rename test cases and change the W8A8 pruning model to the W8A8 model and remove the eager parameter 5. tests/e2e/multicard/test_shared_expert_dp.py: rename test case and remove the eager parameter 6. tests/e2e/multicard/test_single_request_aclgraph.py: rename test case and change Qwen3-30B to Qwen3-0.6B 7. tests/e2e/multicard/test_torchair_graph_mode.py: delete test cases about torchair - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-12 08:42:08 +08:00

... 5 6 7 8 9 ...

650 Commits