xc-llm-ascend

Author	SHA1	Message	Date
wjunLu	fca2f948c1	[E2E Refactor] Enable skipped e2e case (#5287 ) ### What this PR does / why we need it? The test case `tests/e2e/multicard/test_data_parallel.py` was skipped due to the errors encountered during migration from Ascend A2 to A3, the details are as follows ``` (EngineCore_DP0 pid=17833) RuntimeError: npu_moe_distribute_dispatch_v2:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:161 NPU function error: call aclnnMoeDistributeDispatchV3 failed, error code is 561002 (EngineCore_DP0 pid=17833) [ERROR] 2025-12-23-07:36:19 (PID:17833, Device:0, RankID:-1) ERR00100 PTA call acl api failed. (EngineCore_DP0 pid=17833) EZ9999: Inner Error! (EngineCore_DP0 pid=17833) EZ9999[PID: 17833] 2025-12-23-07:36:19.237.396 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 512, h = 2048, epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine = 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) + (maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 609MB, HCCL_BUFFSIZE=200MB.[FUNC:MoeDistributeDispatchA3TilingFuncImpl][FILE:moe_distribute_dispatch_v2_tiling.cc][LINE:941] (EngineCore_DP0 pid=17833) TraceBack (most recent call last): (EngineCore_DP0 pid=17833) MoeDistributeDispatchV2 do tiling failed, ret is -1. (EngineCore_DP0 pid=17833) Check NnopbaseExecutorDoTiling(executor) failed (EngineCore_DP0 pid=17833) Check NnopbaseExecutorTilingAndUpdateBinInfo(executor) failed (EngineCore_DP0 pid=17833) Check NnopbaseExecutorMatchCache(executor) failed (EngineCore_DP0 pid=17833) Check NnopbaseRunForWorkspace(*executor, workspaceSize) failed ``` ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? After fixed, I ran `pytest -sv --durations=0 tests/e2e/multicard/test_data_parallel.py`, and the result looks good ``` ========================================================================================= warnings summary ========================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ======================================================================================== slowest durations ========================================================================================= 112.69s call tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-vllm-ascend/Qwen3-30B-A3B-W8A8] 88.11s call tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-Qwen/Qwen3-30B-A3B] 70.06s call tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-Qwen/Qwen3-0.6B] (6 durations < 0.005s hidden. Use -vv to show these durations.) ============================================================================ 3 passed, 2 warnings in 270.88s (0:04:30) ============================================================================ ``` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2025-12-25 09:18:05 +08:00
Magnus	a9fccbeb30	[CI] add xlite e2e test (#5305 ) ### What this PR does / why we need it? add xlite e2e test - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: DaweiChang <405739598@qq.com>	2025-12-25 09:17:06 +08:00
Aoxuan Chen	6d25372baa	Add MagicMTP(block verify) and Triton optimization (#4443 ) ### What this PR does / why we need it? 1. MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy. 2. The rejection sampling logic in rejection_sampler.py was restructured using Triton-Ascend, enabling it to operate under high concurrency, thus resolving CPU and NPU operator bottlenecks and enhancing throughput. ### Does this PR introduce _any_ user-facing change? MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: chenaoxuan <cax1165@163.com>	2025-12-25 09:00:25 +08:00
Ascendyh	a90482803d	[Kernel] add l2norm triton kernel (#4595 ) ### What this PR does / why we need it? This pull request introduces an L2 normalization kernel implemented in Triton, specifically optimized for Ascend NPUs. ### Does this PR introduce _any_ user-facing change? No, this PR does not introduce any user-facing changes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-25 06:06:18 +08:00
Mengqing Cao	e54630e01c	Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) (#5317 ) ### What this PR does / why we need it? Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as it causes deepseek v3.2 hang error - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-24 22:24:17 +08:00
wangxiyuan	fb3d6ca08c	Cleanup uesless env (#5270 ) `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is not used anywhere, let's remove it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-24 22:07:59 +08:00
TmacAaron	5018f2d8fd	[quantization] Add w8a16 quantization support (#4541 ) ### What this PR does / why we need it? related to https://github.com/vllm-project/vllm-ascend/issues/4267 ### Does this PR introduce _any_ user-facing change? support w8a16 quantization now ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` ### Test tested using [aisbench](https://gitee.com/aisbench/benchmark/) with tp2 #### Precision \| ceval \| mmlu \| gsm8k -- \| -- \| -- \| -- bf16 \| 90.46 \| 89.17 \| 96.21 w8a16 \| 89.51 \| 89.29 \| 95.98 #### Performance \| input_len \| output_len \| concurrency \| TTFT (ms) \| TPOT (ms) \| TPS (Total) (tokens/s) -- \| -- \| -- \| -- \| -- \| -- \| -- bf16 \| 2048 \| 2048 \| 10 \| 1911.7136 \| 77.988 \| 253.9866 w8a16 \| 2048 \| 2048 \| 10 \| 2128.6334 \| 67.1633 \| 293.9117 bf16 \| 3500 \| 1024 \| 10 \| 3076.2509 \| 84.3525 \| 506.949 w8a16 \| 3500 \| 1024 \| 10 \| 2685.2031 \| 73.015 \| 585.4717 --------- Signed-off-by: yyt <yangyit139@gmail.com> Signed-off-by: TmacAaron <yangyit139@gmail.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-12-24 19:49:32 +08:00
linfeng-yuan	515267de22	[perf][bugfix] improve performance of rejection sampler and eliminate HD synchronize in TopKTopPSampler (#4154 ) ### What this PR does / why we need it? 1. Use optimized apply_top_k_top_p for NPU platfrom in rejection sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24 per DP) 2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p introduced by parameter validation which improves inference speed with `async_scheduling` enabled;</del> In order to elminate the D2H synchronization introduced by parameter validation before calling `npu_top_k_top_p`, we directly drop this fused operator since the performance improvement is not significant compared to async_scheduling and may bring potential accuracy problem. 3. Refactor the implementation of AscendTopKTopPSampler to align that of vLLM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E serving test with combinations of `k=500` and `p=0.95` with async_scheduling in single node and wide-EP scenarios. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2025-12-24 19:10:33 +08:00
Li Wang	2f03a2f4a4	[CI] Skip some failed ops tests (#5309 ) ### What this PR does / why we need it? Skip some failed ops tests - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-24 18:29:34 +08:00
Nengjun Ma	42c989a437	Update vllm pin to 12.24 (#5307 ) ### What this PR does / why we need it? Fix vllm break in the pr: 1. [Add MiMo-V2-Flash support] (https://github.com/vllm-project/vllm/pull/30836) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Co-authored-by: zxwang [1476209578@qq.com](mailto:1476209578@qq.com) - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-24 17:24:31 +08:00
ZYang6263	a3f65b938f	[Doc] Add pa_shape_list description to qwen dense tutorial (#5225 ) ### What this PR does / why we need it? Add pa_shape_list description to qwen dense tutorial. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: ZYang6263 <zy626375@gmail.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-24 14:40:20 +08:00
Chen Chen	9227e6af73	[bugfix] remove the EP buffer allocation introduced by fused-op dispatch_ffn_c… (#5284 ) ### What this PR does / why we need it? - This PR removes the Expert Parallel (EP) HCCL buffer allocation that was previously introduced by the fused-op `dispatch_ffn_combine` (#3532 ), since the fused-op has switch to MC2 HCCL buffer (#5156 ). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-12-24 11:26:19 +08:00
zhangyiming	74a1de50a9	[E2E] Optimize e2e test. (#5091 ) ### What this PR does / why we need it? [E2E] Optimize e2e test. - Remove the test_basic_camem testcase. - Change Qwen2.5-0.5B-Instruct-W8A8 to Qwen3-0.6B-W8A8 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-24 10:41:55 +08:00
zhangyiming	bd4fb871c6	[CI] Add skipped testcases. (#5254 ) ### What this PR does / why we need it? Some E2E testcases are not in our CI workflow, this PR add them back. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-24 10:41:32 +08:00
wujinyuan1	7ff1db4b84	[Refactor]5/N Extract common code of mla_v1.py & extract mla_cp (#5097 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason： The functions related to Cp differ significantly from those of normal MLA-Attention, but the coupling is quite severe. Steps： 1)Extract common code AscendMLAMetadataBuilder.build to 4 functions: build_prefill_metadata, build_decode_metadata,build_cp_metadata, build_chunked_metadata todo： 1)refactor function _compute_prefill_context; 2)refactor function _mla_preprocess,_mla_decode_preprocess 3）Extract public data and processing functions from the attention_cp.py and mla_cp.py files to the common_cp file. vLLM version: 0.13.0rc3 vLLM main: `ad32e3e19c` - vLLM version: 0.13.0rc3 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-24 10:25:19 +08:00
shaopeng-666	2a2d527e96	fix transformer version to 4.57.3 (#5250 ) ### What this PR does / why we need it? In certain scenarios (such as smoke testing), the source code is used to update the vllm-ascend version for running updated models (such as Qwen3-VL). However, vllm and vllm-ascend themselves have no restrictions on the transformer version, and the transformer will not be updated, resulting in errors when launching the model. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2025-12-23 23:55:40 +08:00
Nengjun Ma	3b59f20a28	update to vllm 12-19 (#5223 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (https://github.com/vllm-project/vllm/pull/30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: https://github.com/vllm-project/vllm-ascend/issues/5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-23 23:52:11 +08:00
Zhu Yi Lin	e14514e2fd	[Bugfix] quick fix balance scheduling patch (#5281 ) ### What this PR does / why we need it? quick fix balance scheduling patch - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-23 21:23:05 +08:00
weichen	ffe51eedd6	[Refactor][MoE] Reuse vLLM's all_reduce logic (#5189 ) ### What this PR does / why we need it? Move all_reduce logic to AscendFusedMoE.forward, reuse vLLM's logic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: weichen <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-23 18:53:48 +08:00
zhangxinyuehfad	8ae7fca947	[CI] refect e2e ci test (#5246 ) ### What this PR does / why we need it? efect e2e ci test： 1. tests/e2e/singlecard/pooling/test_embedding.py: remove the eager parameter and rename test case 2. tests/e2e/singlecard/pooling/test_scoring.py: Rename test cases 3. tests/e2e/singlecard/pooling/test_classification.py: Rename test case 4. tests/e2e/singlecard/test_quantization.py: remove the eager parameter and chage model to vllm-ascend/Qwen2.5-0.6B-W8A8 and Rename test case 5. tests/e2e/multicard/test_shared_expert_dp.py: Rename test cases 6. tests/e2e/singlecard/test_sampler.py: Rename test cases 7. tests/e2e/singlecard/test_aclgraph_accuracy.py: Rename test cases 8. tests/e2e/multicard/test_offline_inference_distributed.py: Rename test cases and remove the eager parameter 9. tests/e2e/multicard/long_sequence/test_accuracy.py: Rename test cases and remove the eager parameter 10. tests/e2e/multicard/long_sequence/test_basic.py: Rename test cases and remove the eager parameter 11.tests/e2e/multicard/test_expert_parallel.py:remove the eager parameter 12.tests/e2e/multicard/test_full_graph_mode.py:remove the eager parameter 13.tests/e2e/multicard/test_ilama_lora_tp2.py:remove the eager parameter 14.tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py:remove the eager parameter 15.tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py:remove the eager parameter 16.tests/e2e/singlecard/test_aclgraph_accuracy.py:remove the eager parameter 17.tests/e2e/singlecard/test_camem.py:remove the eager parameter 18.tests/e2e/singlecard/test_ilama_lora.py:remove the eager parameter 19.tests/e2e/singlecard/test_multistream_overlap_shared_expert.py:remove the eager parameter 20.tests/e2e/singlecard/test_vlm.py:remove the eager parameter 21.tests/e2e/singlecard/test_xli:remove the eager parameter ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-23 18:42:35 +08:00
Li Wang	5d1f6daef6	[CI] Mock spawn for vlm tests (#5279 ) ### What this PR does / why we need it? Using `spawn` in continuous testing scenarios ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-23 18:35:06 +08:00
Tiger Xu / Zhonghu Xu	cb963c53a5	[Doc] Added deploying on k8s with kthena (#4674 ) ### What this PR does / why we need it? [Kthena](https://github.com/volcano-sh/kthena) is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high performance and enterprise-grade scalability for LLM inference workloads. The platform extends Kubernetes with purpose-built Custom Resource Definitions (CRDs) for managing LLM workloads, supporting multiple inference engines (vLLM, SGLang, Triton) and advanced serving patterns like prefill-decode disaggregation. This pr added a example on deloying llm on Ascend Kubernetes clusters. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>	2025-12-23 17:46:04 +08:00
Slightwind	22138e2727	[main][Refactor] Remove `with_prefill` parameter from `set_ascend_forward_context` (#5094 ) Removes the redundant `with_prefill` parameter from `set_ascend_forward_context` to align the interface with vLLM's `set_forward_context` for future refactoring. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: Slightwind <slightwindsec@gmail.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-23 14:30:50 +08:00
SILONG ZENG	fa0c212bfa	[test]Corrected the Qwen3-Omni-30B-A3B-Instruct accuracy test configuration in nightly tests. (#5195 ) ### What this PR does / why we need it? Corrected the Qwen3-Omni-30B-A3B-Instruct accuracy test configuration in nightly tests. link: https://github.com/vllm-project/vllm-ascend/pull/4911 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2025-12-23 14:17:27 +08:00
SILONG ZENG	29a93daa82	[CI]refactor: standardize test case naming convention (#5243 ) ### What this PR does / why we need it? - Standardize test case naming in `vllm-ascend/tests/e2e/multicard/` to follow the `<model>_<feature>_<distributed>` convention. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2025-12-23 14:13:42 +08:00
meihanc	592cfb6a6f	[CI] Add Triton Ascend in CI (#4921 ) Add triton-ascend in UT and e2e - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2025-12-23 12:47:35 +08:00
LI SHENGYONG	2e010e12dd	[EPLB][CI] Add dynamic EPLB CI for qwen3-moe (#5179 ) ### What this PR does / why we need it? Add dynamic EPLB CI for qwen3-moe-30B-W8A8 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-12-23 11:31:00 +08:00
Mengqing Cao	449f8f65a7	[KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) ### What this PR does / why we need it? Support KV-Sharing feature in CLA (cross layer attention) models, which sharing kv cache in some layers. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-23 10:48:31 +08:00
Li Wang	9a79cbaecb	[ModelRunner] Add hunyuan-vl basic support (#5151 ) ### What this PR does / why we need it? This patch add handling of `XDRotaryEmbedding` in modelrunner to support for `hunyuan-vl` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with added/exist tests Closes: https://github.com/vllm-project/vllm-ascend/issues/4992 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-23 10:46:54 +08:00
rongfu.leng	c9b5881bcd	[Doc] fix docs set rope_theta value is 10e6 in qwen3-235b model (#5258 ) ### What this PR does / why we need it? Fixes https://github.com/vllm-project/vllm-ascend/issues/5201 ### Does this PR introduce _any_ user-facing change? No, doc only ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: rongfu.leng <lenronfu@gmail.com>	2025-12-23 10:21:46 +08:00
Shanshan Shen	6c478531f8	[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm/pull/29873, register `AscendApplyRotaryEmb` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### ✅ Test Qwen2.5-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d ``` #### ✅ Test Qwen3-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-12-23 10:04:37 +08:00
zhangyiming	35dbdbb398	[Doc] Add new contributors and relative scripts. (#5070 ) ### What this PR does / why we need it? [Doc] Add new contributors and relative scripts. Usage of scripts: - `export GITHUB_TOKEN=<your github token>` - `bash tools/collect_user_first_contribution.sh vllm-project/vllm-ascend <base_sha> <head_sha>` and save the result to one temporary file such as `contributors.txt` - `python tools/format_contributors.py contributors.txt --start <start index now>` - Use the output to update the `contributors.md` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: menogrey <1299267905@qq.com>	2025-12-23 10:01:45 +08:00
Zhu Yi Lin	3d04ae8e7d	[Main] [Patch] support balance scheduling patch (#5212 ) ### Motivation. Limitations of the current vLLM v1 scheduling strategy vLLM v1 scheduling currently enables chunkedprefill by default, which processes prefill and decode requests simultaneously in a single scheduling session. This can impact the overall system throughput and performance in some scenarios. Balance scheduling addresses this issue by synchronizing the number of running queues across all schedulers to delay the scheduling of new requests, thereby improving the overall system's steady-state decoding time. This achieves: ✅Adding `balance_gather` to the scheduler synchronizes the number of requests in the running queues between DPs. ✅Balance scheduling improves the decode steady-state time, thereby increasing the overall output throughput of the inference system. ### Proposed Change. 1.Feature Overview In the vLLM scheduler, running requests (i.e., requests that are already undergoing pre-filled computation) have the highest priority, followed by waiting requests (i.e., requests that have not yet been computed). As shown in the diagram above, when the entire inference system exits from a steady state, the scheduler will schedule a batch of new requests for prefill operations and then synchronize them among the dynamic programming (DP) models. This can cause some DP models that are entirely decoded to synchronize with the number of prefilled tokens. Frequent prefill scheduling by certain DP models can lead to a deterioration in the overall system output throughput. Balance scheduling synchronizes the number of running queue requests across different DPs, and only schedules new requests for prefilling when at least every scheduler has fewer than max_nun_requst. 2.Implementation Design 3.Experiment Results - Fixed-length input scenario: In the performance test scenario with 3.5K fixed-length input and 1.5K fixed-length output, the throughput performance was improved by approximately 18% after adding balance scheduling. \| Method \| Model \| Input Len \| Request Count \| Output Len \| BatchSize \| Average TTFT \| Average TPOT \| e2e duration \| Input Token Throughput \| Output Token Throughput \| Request Throughput \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| \| Baseline \| DeepSeekV3.1 \| 3500 \| 512 \| 1500 \| 128 \| 6600 \| 86.85 \| 591.9s \| 3030.5 \| 1297.3 \| 0.86 \| \| Balance scheduling \| DeepSeekV3.1 \| 3500 \| 512 \| 1500 \| 128 \| 7012 \| 70.63 \| 501.7s \| 3575.7 \| 1530.7 \| 1.02 \| 4.Demo PR [#29721 ](https://github.com/vllm-project/vllm/pull/29721) --------- Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-23 09:04:38 +08:00
zhangyiming	f883a2edb9	[Doc] Update the weight download URL. (#5238 ) ### What this PR does / why we need it? Update the weight download URL. Because the model was renamed. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: menogrey <1299267905@qq.com>	2025-12-23 08:53:30 +08:00
Wang Kunpeng	c3a8d13ca7	[refactor] Remove unnecessary attributes from set_ascend_forward_context (#5204 ) ### What this PR does / why we need it? Remove unnecessary attributes from set_ascend_forward_context 1.prefetch_stream 2.weight_prefetch_method ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-23 08:49:52 +08:00
weijinqian0	95e8a52156	[Refactor] move the metadata from attention_v1 to util(ready for extract common_cp) & realize Ascendmetadata inherit from the parent class. (#5203 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 1. Remove the pcp-related code from attention_v1. 2. Establish the inheritance relationship of CommonAttentionMetadata. TODO 1. extract common_cp 2. move cp metadata to common_cp. 3. remove commonAttentionMetadata for aclgraph. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-23 00:10:52 +08:00
ApsarasX	3d9954eff0	[Bugfix] Use hf_text_config instead of hf_config to support multimodal PD-Disaggregated (#5205 ) ### What this PR does / why we need it? In code files such as`mooncake_connector.py`, `vllm_config.model_config.hf_config` is used to get the LLM configs. This approach works for LLMs, but not for multi-modal models. For multi-modal models, `vllm_config.model_config.hf_text_config` must be used instead to get the LLM configs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-12-22 20:21:45 +08:00
jiangyunfan1	3ba920a65b	[TEST]Update mm param --mm-processor-cache-gb (#5242 ) ### What this PR does / why we need it? This PR updates the mm param --mm-processor-cache-gb, we need it to run the case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-12-22 18:54:03 +08:00
zhangxinyuehfad	61efaffcaf	[Bugfix] Implement multimodal_cpu_fields in model runner (#5196 ) ### What this PR does / why we need it? Related to https://github.com/vllm-project/vllm-ascend/issues/4084 Implement multimodal_cpu_fields in model runner - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-22 18:39:45 +08:00
zzzzwwjj	052e472453	[bugfix] fix w8a8dynamic fused_moe trans nz (#5199 ) ### What this PR does / why we need it? Currently, `torch_npu.npu_grouped_matmul_swiglu_quant` can only support weight nz, so we need to trans w13_weight, w2_weight to nz forcely. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-22 17:45:34 +08:00
lvjunqi	55beac9c91	[Feat]Xlite Qwen3-vl Support (#5228 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-VL model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. The latest performance comparison data between xlite and the default aclgraph mode is as follows: ### Does this PR introduce _any_ user-facing change? XLite graph mode supports the Qwen3-VL model. ### How was this patch tested? vLLM version: v0.12.0 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: lvjunqi <lvjunqi1@huawei.com> Co-authored-by: lvjunqi <lvjunqi1@huawei.com>	2025-12-22 16:30:52 +08:00
zhangsicheng5	78aa7f2693	[feature] support pcp + mtp in full graph (#4572 ) 1. support pcp + mtp in full graph 2. pcp/dcp related mtp bugfix 3. support pcp + mtpx - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-12-22 16:13:39 +08:00
Zhu Yi Lin	12d581605b	[Triton]support swiglu_quant triton in w4a8 (#5161 ) ### What this PR does / why we need it? support swiglu_quant triton in w4a8 ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-22 16:01:58 +08:00
Yizhou	60d9398f6d	[1/N][Eagle3] Aligns auxiliary hidden state usage for eagle3 models (#5162 ) ### What this PR does / why we need it? This is to prepare for the migration to vLLM's `EagleProposer`, it does not have `name` attribution. Also it's a breakdown of #5100 . Introduces logic to determine whether eagle3 heads require auxiliary hidden states based on configuration, ensuring consistent handling across related components. Prevents incorrect assumptions for eagle3 variants that do not use auxiliary outputs, improving compatibility and correctness. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-12-22 15:24:54 +08:00
wangxiyuan	b62b2ebd9b	[Doc] Update readme (#5226 ) Add 0.11.0 news in Readme and correct main branch maintain rule - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-22 15:21:16 +08:00
dependabot[bot]	4861484b68	Bump actions/checkout from 4 to 6 (#5234 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-12-22 15:16:43 +08:00
dependabot[bot]	11a25497ce	Bump actions/upload-artifact from 4 to 6 (#5233 ) Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 6. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-12-22 15:15:45 +08:00
Qiu	64669c4243	[misc][FlashComm1][ACLGraph] Incompatibility between Flashcomm1 and FULL_DECODE_ONLY. (#5200 ) ### What this PR does / why we need it? Currently, Flashcomm1 and FULL_DECODE_ONLY are incompatible. When both features are enabled, graph capture errors occur without clear error messages. After discussion, it has been determined that enabling FULL_DECODE_ONLY with Flashcomm1 in mixed deployment scenarios provides almost no TPOT benefit. Additionally, a reconstruction of the decode phase for flashcomm1 is currently underway. Therefore, related adaptation work is temporarily postponed and will be addressed after the decode phase reconstruction plan is finalized. For now, an assert will be added to provide clear error messages and correct deployment recommendations. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? NO - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2025-12-22 14:33:32 +08:00
Shanshan Shen	b84ad8c5d8	[CustomOp] Register AscendMMEncoderAttention CustomOp and remove related patch (#4750 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm/pull/30125, register `AscendMMEncoderAttention` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ✅ Run Qwen2.5-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b4e3053f30ab2442","object":"chat.completion","created":1764922950,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` ✅ Run Qwen3-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-97571fbda8267bd1","object":"chat.completion","created":1764923306,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-12-22 14:32:53 +08:00
Ascendyh	b2c121637f	[task] Add fused gdn gating triton kernel (#4304 ) ### What this PR does / why we need it? This commit introduces a Triton-based fused GDN gating kernel for Ascend NPU, aimed at improving performance in the Gated Delta Net workflow. ### Does this PR introduce _any_ user-facing change? It only adds and refactors internal Triton kernels and wrappers for Ascend. These are backend implementation details. There are no new APIs, flags, CLI options, or behavior changes visible to end users. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com>	2025-12-22 14:09:19 +08:00

1 2 3 4 5 ...

1815 Commits