xc-llm-ascend

Author	SHA1	Message	Date
Chen Chen	6b290acfe1	remove redundant params in mla_preprocess kernel (#3530 ) ### What this PR does / why we need it? This pull request removes the redundant parameters `gamma1` and `beta1` (also named `gamma0`/`beta0` in some places) from the `mla_preprocess` kernel and its calling hierarchy. The changes are consistent across C++ kernel code, bindings, and Python call sites. The parameters were unused in the lower-level functions, so their removal is a good cleanup. ### Does this PR introduce _any_ user-facing change? The python interface of the kernel is affected, and the params of `gamma0` and `beta0` are not needed. ### How was this patch tested? The unit-test of the kernel is adapted accordingly. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: mojave2 <chenchen145@huawei.com>	2025-10-21 19:20:13 +08:00
Yizhou	ec1d2b5c04	[Test] Temporarily skip flaky ACL graph test (#3577 ) ### What this PR does / why we need it? Disables `FULL_DECODE_ONLY` end-to-end test that fails intermittently. This prevents CI blockages while the root cause of the flakiness is investigated. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-21 17:16:15 +08:00
lilinsiman	70bef33f13	add new accuracy test case for aclgraph (#3390 ) ### What this PR does / why we need it? Add new accuracy test case Deepseek-V2-Lite-W8A8 for aclgraph ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-20 20:04:04 +08:00
anon189Ty	248ee7fa11	[Feat]Make full graph mode compalible with MTP (#3276 ) ### What this PR does / why we need it? Make the Full Graph mode can run with MTP. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-17 20:19:56 +08:00
lilinsiman	1b424fb7f1	ACLgraph enable: Test cases revisions for all features (#3388 ) ### What this PR does / why we need it? This PR revise the test cases of various features on the warehouse which add the enablement of aclgraph to the test cases. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-17 17:15:19 +08:00
weichen	cec1fab509	Revert "[MoE] [Refactor] Remove manual memory cleanup (#3365 )" (#3483 ) This reverts commit `4f937f561d`. ### What this PR does / why we need it? This reverts commit `4f937f561d`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-15 22:25:46 +08:00
weichen	4f937f561d	[MoE] [Refactor] Remove manual memory cleanup (#3365 ) ### What this PR does / why we need it? 1. Replace manual memory cleanup with passing parameter. 2. FusedMoEPrepareAndFinalizeWithMC2 inherits All2All avoid duplicated code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-15 12:36:24 +08:00
CaranLic	15b2e5c995	Remove unused row_idx in token_dispatcher (#3442 ) ### What this PR does / why we need it? The `row_idx` parameter is no longer used since PR[#2689](https://github.com/vllm-project/vllm-ascend/pull/2689), so remove it across multiple files to remove unnecessary calculations and parameter passing. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? accuracy test passed for Qwen3 235B and DeepSeek V3 671B after this PR. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: CaranLic <740821011@qq.com>	2025-10-15 09:08:31 +08:00
zouyida2052	3642b64afc	bugfix for mtp with multistream_moe (#3419 ) ### What this PR does / why we need it? when infer deepseek mtp layer with multistream_moe, we should pass a boolean to evaluate this feature and fix bugs when we are in mtp layer - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-15 08:59:58 +08:00
xuyexiong	02c26dcfc7	[Feat] Supports Aclgraph for bge-m3 (#3171 ) ### What this PR does / why we need it? [Feat] Supports Aclgraph for bge-m3 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` pytest -s tests/e2e/singlecard/test_embedding.py pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py ``` to start an online server with bs 10, each batch's seq length=8192, we set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked: ``` vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}' ``` For bs10, each batch's seq length=8192, QPS is improved from 85 to 104, which is a 22% improvement, lots of host bound is reduced. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com> Co-authored-by: wangyongjun <1104133197@qq.com>	2025-10-14 23:07:45 +08:00
yuzhup	78777237a9	[2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `gate_up_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": True, "prefetch_ratio": { "moe": { "gate_up": 0.8 }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com>	2025-10-14 20:16:33 +08:00
Chen Chen	bcc313e8f2	add mla_preprocess kernel (#3226 ) ### What this PR does / why we need it? - Adds the `mla_preprocess` custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs. - Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths. ### Does this PR introduce any user-facing change? - No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged. ### How was this patch tested? - Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-10-12 07:39:45 +08:00
panchao-hub	1756efa5fd	[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125 ) ### What this PR does / why we need it? Adds support for capturing the Multi-Layer Attention (MLA) decode operation into an ACL graph. This improves performance by compiling the attention kernel for single-token decoding. Key changes include: - Implementing the graph capture logic for the MLA kernel, including workspace management and parameter updates. - Modifying the rotary embedding (RoPE) handling to use pre-allocated tensors, which is a requirement for graph capture. - Adding a `build_for_graph_capture` method to the MLA metadata builder to create dummy metadata during the graph compilation phase. Known issues: - Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're working on a fix - We are preparing to remove update_mla_attn_params with auto_dispatch_capture ### Does this PR introduce _any_ user-facing change? compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: panchao-hub <315134829@qq.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-10 16:31:20 +08:00
wangxiyuan	1c5b302f0d	[Misc] Clean up useless patch (#3320 ) ### What this PR does / why we need it? 1. clean up v0.10.2 support in ut and e2e test 2. remove v0.11.0 period job, we're at v0.11.0 now. 3. remove uesless patch for deepseek v3.2. They have been done in vLLM already. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-09 14:07:26 +08:00
wangxiyuan	c73dd8fecb	[CI] Fix CI by addressing max_split_size_mb config (#3258 ) ### What this PR does / why we need it? Fix CI by addressing max_split_size_mb config ### Does this PR introduce _any_ user-facing change? No, test onyl ### How was this patch tested? Full CI passed, espcially eagle one - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-29 14:05:12 +08:00
wangxiyuan	15b8aff582	[CI] Add max_split_size_mb for e2e test to avoid oom (#3252 ) ### What this PR does / why we need it? we add a patch for model weight loader to avoid using vLLM weight loader v2, since v2 will lead unknown issue for torchair. While this patch make some unknown memory usage problem. To quick fix the problem, let's expend the `max_split_size_mb` to a larger value to avoid weight load oom issue. Further solution is to remove the patch and address weight loader v2 from vLLM. Closes: https://github.com/vllm-project/vllm-ascend/issues/3251 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-29 09:13:08 +08:00
Icey	2a9d02e080	[Bugfix] eagle and eagle3 spec decode failures and enable e2e test (#2979 ) ### What this PR does / why we need it? - Fix the bug https://github.com/vllm-project/vllm-ascend/issues/2978 - Enable e2e test, - Adapt to scenarios where Speculative tokens are greater than 2, - Fix the bug that causes Eagle3 inference failures under high concurrency and improve the acceptance rate of draft models, by https://github.com/vllm-project/vllm-ascend/pull/2794 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with new added/existing test. Co-authored-by: hukongyi [hukongyi@cmbchina.com](mailto:hukongyi@cmbchina.com) Co-authored-by: guanyuzhu [zhuguanyu@huawei.com](mailto:zhuguanyu@huawei.com) Co-authored-by: liumail680 [liumail680@163.com](mailto:liumail680@163.com) - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-09-25 14:39:12 +08:00
Li Wang	12bcbd02bb	[CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907 ) ### What this PR does / why we need it? 1. This pr bump vllm commit to `6d8246aaff` 2. fix upstream changes https://github.com/vllm-project/vllm/pull/24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by https://github.com/vllm-project/vllm/pull/23693 4. fix `structured_outputs_config` changes introduced by https://github.com/vllm-project/vllm/pull/22772 5. fix `moe_config` changes introduced by https://github.com/vllm-project/vllm/pull/22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-20 17:37:57 +08:00
whx	0a526768f5	[Feature] Support moe multi-stream for aclgraph. (#2946 ) This PR puts the calculation of shared experts into a separate stream, overlaping with routing experts. - vLLM version: v0.10.2 - vLLM main: `fbd6523ac0` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-19 11:06:45 +08:00
xuyexiong	6681dde902	[Feat][Graph] Support MTP for ACL Graph (#2932 ) ### What this PR does / why we need it? This PR depends on the merge of #2707 and has adapted the aclgraph functionality to support MTP. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `2b85697031` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-18 14:05:33 +08:00
wangxiyuan	382c29f3e1	[BugFix] Fix world size bug in model_runner (#2915 ) - Fix world size bug in model_runner to make sure ep>16 runs with MC2 - enable e2e test for vl Co-Authored-By: whx-sjtu <2952154980@qq.com> Co-Authored-By: Icey <1790571317@qq.com> - vLLM version: v0.10.2 - vLLM main: `3e903b6cb4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-14 12:20:25 +08:00
Jiawei Li	e57cca971c	Fix the bugs about operator registration by PyTorch Dispatcher (#2786 ) Background: There are two principles about operator registration in PyTorch - The same namespace can be only registered once by `TORCH_LIBRARY` - The operator signatures can be only registered once by `def` Considering that all custom operators defined in the current repo are only used by Ascend, instead of defining a common operator schema by vLLM, all accelerators then follow this operator schema and complete the implementation based on their respective hardware, which is conducive to functional abstraction. Therefore, we can rename the operator registration namespace to an Ascend-specific namespace(_C_ascend). Related ISSUE: https://github.com/vllm-project/vllm-ascend/issues/2742 - vLLM version: main - vLLM main: `f592b3174b` Signed-off-by: FFFrog <ljw1101.vip@gmail.com>	2025-09-13 11:58:52 +08:00
无脸男	c3c2221503	[Feat]support dynamic quantization in allgather (#2841 ) ### What this PR does / why we need it? [Feat]support dynamic quantization in allgather ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: main - vLLM main: `5931b7e5d9` Signed-off-by: withHades <244036962@qq.com> Signed-off-by: WithHades <244036962@qq.com>	2025-09-11 18:47:20 +08:00
jiangpeng	2b9269b581	[Perf][V1] Fully overlap model execution (#2783 ) This PR is based on top of [#23569](https://github.com/vllm-project/vllm/pull/23569) and [#24219](https://github.com/vllm-project/vllm/pull/24219). ### What this PR does / why we need it? This PR allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This diff is functional and does not support speculative decoding, PP, or guided decoding. Expected speedup is 5-10% over the current async scheduling. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? server ``` python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\ --trust-remote-code --enforce-eager \ --distributed-executor-backend=mp \ -tp=4 \ --port 8006 \ --max-model-len 32000 \ --block-size 128 \ --gpu-memory-utilization 0.99 ``` client ``` python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \ --dataset-name random --random-input-len 2048 --random-output-len 2048 \ --ignore-eos\ --num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \ --metric-percentiles 90 --base-url http://localhost:8006 --save-result \ --result-dir $PROFILER_DIR ``` benchmark test based on Qwen3-32B TPOT result: \|\|forward async\| scheduler async \|sync\| \|-\|-\|-\|-\| \|avg\|41.73\|41.86\|44.20\| \|improve0\|0.3%\|0\|0\| \|improve1\|5.58%\|0\|0\| benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result: \|\|forward async\|sync\| \|-\|-\|-\| \|avg\|23.22\|29.16\| \|improve\|20.3%\|0\| - vLLM version: main - vLLM main: `e93f4cc9e3` Signed-off-by: jiangpeng36 <jiangpeng36@huawei.com> Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: jiangpeng36 <jiangpeng36@huawei.com> Co-authored-by: Ronald1995 <ronaldautomobile@163.com>	2025-09-11 16:35:36 +08:00
weichen	a041d4f328	[main] [refactor] refactor common_fused_moe.py (#2706 ) ### What this PR does / why we need it? 1. Move prepare/finalize operation from moe_comm_method to /ops/moe/fused_moe_prepare_and_finalize 2. Adapt to token_dispatcher in moe_comm_method 3. Move moe_comm_method/experts_selector/token_dispatcher/fused_moe_prepare_and_finalize to /ops/moe ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: `f4962a6d55` Signed-off-by: weichen <calvin_zhu0210@outlook.com> Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-09-08 20:09:50 +08:00
1092626063	5b3646ab21	[FEATURE][MTP] Support MTP > 1 (#2708 ) ### What this PR does / why we need it? [RFC：Support MTP > 1 for DeepSeek](https://github.com/vllm-project/vllm-ascend/issues/2745) - [x] dp1 tp16 - [x] dp4 tp4 - [x] dp2 tp 8 - [x] torchair graph - vLLM version: v0.10.1.1 - vLLM main: `c9f7081f9c` Signed-off-by: 1092626063 <1092626063@qq.com>	2025-09-05 09:11:22 +08:00
sherie	f86596a66c	allgather use fusedop. (#2689 ) ### What this PR does / why we need it? Use 'npu_moe_init_routing_v2' &'npu_moe_token_unpermute' repalce 'npu_moe_init_routing' &‘npu_moe_compute_expert_tokens’& 'npu_moe_finalize_routing' to optimize performance ### Does this PR introduce _any_ user-facing change? \| branch\| tps\| TTFT \|TPOT \| \| --- \| --- \| --- \|--- \| \|main \|733.98 \| 280.05 \|34.30 \| \|main+fusedop \| 740.33 \| 273.34 \|33.99 \| ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-09-04 11:56:29 +08:00
wangxiyuan	24d4dad7b2	[CI] Enable MTP torchair e2e test (#2705 ) enable MTP torchair e2e test - vLLM version: v0.10.1.1 - vLLM main: `ce30dca5c4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-03 08:57:43 +08:00
wangxiyuan	0829b4873f	[CI] recover e2e test (#2688 ) 1. recover the skipped test. 2. remove pangu eager mode test, it's tested by torchair mode already. 3. skip pangu test util the bug is fixed. - vLLM version: v0.10.1.1 - vLLM main: `56d04089ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-02 18:49:17 +08:00
xuyexiong	214b32a346	[V1][BUGFIX][0.10.1] FIX mtp on main branch (#2632 ) ### What this PR does / why we need it? Fix MTP torchair bug caused by torchair refactor and moe refactor Depends on PRs: fused moe fix: https://github.com/vllm-project/vllm-ascend/pull/2627 torchair multi DP fix: https://github.com/vllm-project/vllm-ascend/pull/2626 ### Does this PR introduce _any_ user-facing change? when dp is enabled, to run mtp online server, need to disable server log due to the current metrics does not support multi dp `--disable-log-stats` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `7c8271cd1e` Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-02 11:12:41 +08:00
wangxiyuan	fef18b60bc	Refactor e2e CI (#2276 ) Refactor E2E CI to make it clear and faster 1. remove some uesless e2e test 2. remove some uesless function 3. Make sure all test runs with VLLMRunner to avoid oom error 4. Make sure all ops test end with torch.empty_cache to avoid oom error 5. run the test one by one to avoid resource limit error - vLLM version: v0.10.1.1 - vLLM main: `a344a5aa0a` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-02 09:02:22 +08:00
weichen	320edde2df	[main] [refactor] refactor fused_moe.py to enable token_dispatchers (#2570 ) ### What this PR does / why we need it? Enable token_dispatcher to replace fused_experts_with_xxx in eager mode ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: `704432af3c` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: sherie <963372609@qq.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com> Co-authored-by: shiyuan680 <72335504+shiyuan680@users.noreply.github.com>	2025-08-28 10:13:35 +08:00
wangxiyuan	f22077daa6	[Embedding] Recover embedding function (#2483 ) Fix broken embedding function. It's broken by http://github.com/vllm-project/vllm/pull/23162 - vLLM version: v0.10.1.1 - vLLM main: `efc88cf64a` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-27 09:22:01 +08:00
s30076806	6a4ec186e7	[Qwen-moe] Remove the minor operation arange (#2373 ) ### What this PR does / why we need it? Integrate the arange operator to reduce the time spent and improve performance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `56dcf4e7e9` --------- Signed-off-by: s30076806 <songjiayang2@h-partners.com>	2025-08-27 09:13:31 +08:00
Shanshan Shen	0767d51dd5	[Structured Output][CI] Add test for `outlines` backend for structured output in CI (#2283 ) ### What this PR does / why we need it? Add test for `outlines` backend for structured output in CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests have all passed with: ```bash pytest -sv tests/e2e/singlecard/test_guided_decoding.py ``` - vLLM version: v0.10.0 - vLLM main: `53415653ff` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-25 09:59:13 +08:00
linfeng-yuan	4af5b80606	[Scheduler] validate max_num_batched_tokens and max_model_len in AscendSchedulerConfig (#2434 ) ### What this PR does / why we need it? Add configuration check logic for ascend scheduler: if chunked_prefill is disabled, max_num_batched_tokens couldn't be less than max_model_len, following vLLM; ### Does this PR introduce _any_ user-facing change? users cannot set max_num_batched_tokens smaller than max_model_len with ascend scheduler ### How was this patch tested? CI and vllm serving passed - vLLM version: v0.10.0 - vLLM main: `f77a0802b7` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-23 19:39:44 +08:00
ZhaoJiangJiang	3629bc4431	feat: add mtp ut and fix some bugs (#2453 ) ### What this PR does / why we need it? Fix mtp mode ut ### Does this PR introduce _any_ user-facing change? Nothing ### How was this patch tested? This can be tested in the same way as a unit test. - vLLM version: v0.10.0 - vLLM main: `53415653ff` Signed-off-by: 赵江江 <zhaojiangjiang1@h-partners.com> Co-authored-by: 赵江江 <zhaojiangjiang1@h-partners.com>	2025-08-22 17:09:08 +08:00
Mengqing Cao	60ac4fb576	[QuickFix] Skip failed ut to recover CI quickly (#2484 ) ### What this PR does / why we need it? Skip failed ut to recover CI quickly related ut: - `test_embed_models_correctness`: revert me when pooler is adapted with the latest vllm main - `test_check_and_update_config_enforce_eager_mode`: revert me when the occasional failed is fixed - vLLM version: v0.10.0 - vLLM main: `8896eb72eb` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-22 14:14:51 +08:00
Mengqing Cao	1327f9be1c	Fix some ci issue and refactor modelrunner (#2445 ) ### What this PR does / why we need it? Fix some ci issue and refactor modelrunner ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `4d9c61993a` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: weiguihua2 <weiguihua2@huawei.com>	2025-08-20 09:01:04 +08:00
xleoken	2a763b8326	[Bug] Fix bug in test_chunked.py (#1992 ) ### What this PR does / why we need it? 1. Remove the return statement, it will always skip following logic. 2. Update `deepseek` to `Qwen2.5-Instruct` for OOM in github e2e test env. 3. Fix the comparison logic ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Local Test. - vLLM version: v0.10.0 - vLLM main: `0933f9d518` Signed-off-by: xleoken <xleoken@163.com>	2025-08-19 10:23:47 +08:00
shiyuan680	e14f2ef669	refactor select_experts of moe module (#2150 ) ### What this PR does / why we need it? this pr refactor select_experts of moe module i merge implementations of quantitative and non-quantitative method in a new class use such as vllm like ExpertsSelector.select_experts ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? test in qwen3-moe and all ut. - vLLM version: v0.10.0 - vLLM main: `e18859298d` Signed-off-by: yangcheng <yangcheng104@huawei.com> Co-authored-by: yangcheng (AJ) <y00806874@china.huawei.com>	2025-08-14 11:50:53 +08:00
whx	29aaba5f84	[Perf][MTP] Optimize reject sampler in greedy situation. (#2137 ) This PR port optimization in PR #2002 to main and makes it cleaner. - vLLM version: v0.10.0 - vLLM main: `afa5b7ca0b` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-08-11 17:37:49 +08:00
Pleaplusone	c0f0b70813	[core] Support capture custom ops into aclgraph (#2113 ) ### What this PR does / why we need it? Thanks to the PR https://github.com/vllm-project/vllm-ascend/pull/426 make vllm-ascend support the aclgraph inference to reduce the host overhead. However, the capability of aclgraph strongly relies on the functionality provided by `torch.compile`, which is the key feature supported in torch 2.x . Therefore, capture custom op into aclgraph is only possible when it can be recognize and captured by `torch.compile`. In this PR, we register the meta implementation of current custom ops to enable the fx graph capture. And by doing that, insert those custom ops into aclgraph become a natural thing to the ascend runtime. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? Tested in unittest, we will integrate the `rotary_embedding` op into a small custom model and use `torch.compile` and aclgraph to capture and replay it to verify its functionality. - vLLM version: v0.10.0 - vLLM main: `1b99028069` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-11 15:59:42 +08:00
wangxiyuan	9260910c8d	[CI] Fix broken CI (#2302 ) 1. disable test_eagle_ccorrectness test, we'll reopen it once oom error fixed. 2. drop transformers version limit for main, since vLLM rely on >=4.55.0, see: `65552b476b` 3. fix kv_connector_output bug, see: `796bae07c5` - vLLM version: v0.10.0 - vLLM main: `d1af8b7be9` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 11:22:32 +08:00
Icey	0bd5ff5299	Fix accuracy test config and add DeepSeek-V2-Lite test (#2261 ) ### What this PR does / why we need it? This PR fix accuracy test related to https://github.com/vllm-project/vllm-ascend/pull/2073, users can now perform accuracy tests on multiple models simultaneously and generate different report files by running: ```bash cd ~/vllm-ascend pytest -sv ./tests/e2e/models/test_lm_eval_correctness.py \ --config-list-file ./tests/e2e/models/configs/accuracy.txt ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? <img width="1648" height="511" alt="image" src="https://github.com/user-attachments/assets/1757e3b8-a6b7-44e5-b701-80940dc756cd" /> - vLLM version: v0.10.0 - vLLM main: `766bc8162c` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-08-08 11:09:16 +08:00
leo-pony	807f0895b2	Bump torch version to 2.7.1 (#1562 ) ### What this PR does / why we need it? Bump torch version to 2.7.1, and cleanup infer schema patch https://github.com/vllm-project/vllm-ascend/commit/857f489 (https://github.com/vllm-project/vllm-ascend/pull/837), this patch depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974 ### Does this PR introduce any user-facing change? No #### How was this patch tested? CI passed torch-npu 2.7.1rc1 install guide: https://gitee.com/ascend/pytorch/tree/v2.7.1/ install depending: ``` pip3 install pyyaml pip3 install setuptools ``` install torch-npu: Closes: https://github.com/vllm-project/vllm-ascend/issues/1866 Closes: https://github.com/vllm-project/vllm-ascend/issues/1390 - vLLM version: v0.10.0 - vLLM main: `9af654cc38` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-05 08:43:24 +08:00
leo-pony	e467fe1b77	Add qwen-vl model and sampling feature UT for 310I series (#2168 ) ### What this PR does / why we need it? Add qwen-vl model and sampling feature UT for 310I series - vLLM version: v0.10.0 - vLLM main: `e0f63e4a35` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-08-02 11:26:12 +08:00
22dimensions	9e65da990e	[Misc] Add warning for incompatible Ray backend with ACL Graph mode (#2132 ) ### What this PR does / why we need it? cherry-pick #1501 from 0.9.1-dev to main Currently, Ray is not compatible with ACL Graph, so we need to fall back to eager mode when using the Ray backend. co-authored: Yizhou Liu <liu_yizhou@outlook.com> - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:06:09 +08:00
Icey	86bdde1ca8	Enable pytest and yaml style accuracy test (#2073 ) ### What this PR does / why we need it? This PR enabled pytest and yaml style accuracy test, users now can enable accuracy test by running: ```bash cd ~/vllm-ascend pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \ --config ./tests/e2e/singlecard/models/configs/Qwen3-8B-Base.yaml \ --report_output ./benchmarks/accuracy/Qwen3-8B-Base.md pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \ --config-list-file ./tests/e2e/singlecard/models/configs/accuracy.txt ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/1970 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-07-31 21:39:13 +08:00
huangxialu	9c9a7cd90b	[main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112 ) backport of v0.9.1-dev: https://github.com/vllm-project/vllm-ascend/pull/1902 origin main npu_moe_gating_top_k_softmax: https://github.com/vllm-project/vllm-ascend/pull/1355 - vLLM version: v0.10.0 - vLLM main: `055bd3978e` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-07-31 21:05:56 +08:00

1 2

81 Commits