xc-llm-ascend

Author	SHA1	Message	Date
Icey	137f28341d	[Tests] Add qwen3-8b nightly test (#5597 ) ### What this PR does / why we need it? Add qwen3-8b nightly test - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-07 18:42:05 +08:00
wangxiyuan	6f7a81cd9f	[CI] cleanup single/multi-card test (#5623 ) 1. speed up e2e light test. 2. create `2-cards` and `4-cards` folder in multicard 3. move ops to nightly 4. run test in Alphabetical Order - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 14:13:34 +08:00
wangyibo1005	25baf6df09	[Feature]EPLB:Adapt DispatchGmmCombineDecode operator to eplb tensor list and expert token numbers (#5552 ) #### What this PR does / why we need it? This PR adapt DispatchGmmCombineDecode operator to eplb tensor list and expert token numbers. This operator support gmm1, gmm2, gmm1Scale and gmm2Scale in format of list. This operator support couting how many token each local expert recieves by expertTokensNum . - vLLM version: v0.13.0 - vLLM main: `7157596103` More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476	2026-01-07 11:23:42 +08:00
starmountain1997	086c093347	[CI] Add DeepSeek-V3.2-W8A8 nightly ci test (#5371 ) # What this PR does / why we need it? Add DeepSeek-V3.2-W8A8 dual-node nightly CI test and update A3 nightly test configuration: 1. Add DeepSeek-V3.2-W8A8 dual-node test: tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml - 2 nodes, 16 NPUs per node (32 NPUs total) - Configuration: 2P+1D (data-parallel-size=4, tensor-parallel-size=8, data-parallel-size-local=2) - Includes performance and accuracy benchmarks with GSM8K dataset 2. Update A3 nightly workflow: .github/workflows/nightly_test_a3.yaml - Added DeepSeek-V3.2-W8A8 dual-node test to the A3 nightly test matrix - Test name: multi-node-dpsk3.2-2node 3. Improve test scripts: Updated .github/workflows/_e2e_nightly_multi_node.yaml and related scripts for better multi-node testing support test on A3 instances - Performance baseline: 1 (threshold: 0.97) - Accuracy baseline: 95% (threshold: 5%) - Test dataset: GSM8K with 512 prompts for performance, gsm8k-lite for accuracy --------- Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-01-07 10:02:02 +08:00
InSec	089ca2ddcc	[Nightly][Test] Add Qwen3-Next-80B-A3B-Instruct-W8A8 nightly test (#5616 ) ### What this PR does / why we need it? There was an accuracy issue with Qwen3-Next-80B-A3B-Instruct-W8A8 model in the old version of Triton-Ascend, so, we are now adding one nightly test to maintain it. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: IncSec <1790766300@qq.com>	2026-01-06 17:36:00 +08:00
Li Wang	c5e2f48510	[CI] mv ops to correct path (#5615 ) ### What this PR does / why we need it? mv ops to correct path :`tests/e2e/nightly/single_node/ops/singlecard_ops/triton` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-05 23:17:07 +08:00
dsxsteven	129ba9fe1b	[BugFix] Fix Smoke Testing Bug for DSR1 longseq (#5613 ) ### What this PR does / why we need it? Fix Smoke Testing Bug for DSR1 longseq We need to make this change because the daily smoke test case is throwing an error: "max_tokens or max_completion_tokens is too large: 32768.This model's maximum context length is 32768 tokens and your request has 128 input tokens". We encounter this error due to max-out-len equals to max-model-len. We can fix this error by increasing max-model-len argument in the script. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: daishixun <dsxsteven@sina.com>	2026-01-05 22:40:28 +08:00
Angazenn	11e75494b1	[TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (#5267 ) ### What this PR does / why we need it? Add nightly test for triton split_rmsnorm_rope ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-05 21:35:37 +08:00
ZT-AIA	58e8d19c35	[UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat (#5474 ) ### What this PR does / why we need it? [UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? pytest -sv tests/ut/ops/test_fused_qkvzba_split_reshape_cat.py - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-01-05 20:05:07 +08:00
Yizhou	755caeb06e	[Feat][Spec] Optimize token index calculation in spec decode with Triton kernel (#5356 ) ### What this PR does / why we need it? Replace multiple PyTorch operations with a fused Triton kernel to determine token indices for sampling during speculative decoding. This reduces kernel launch overhead and memory traffic, improving overall performance on Ascend hardware. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-05 16:51:29 +08:00
daniel	8ffe3f5d78	feat: implement high-performance Triton kernels for rejection sampling: optimization for rejection_random_sample_kernel (#5259 ) ### What this PR does / why we need it? This PR introduces optimized Triton implementations for the rejection_random_sample_kernel delivering superior performance compared to the existing Triton implementations. The new Triton kernels maintain full functional accuracy while delivering significant performance improvements across various batch sizes and MTP configurations. ### Does this PR introduce _any_ user-facing change? Yes, this PR modifies rejection_sampler.py to use optimized Triton kernels: rejection_random_sample_kernel is modified and optimized ### How was this patch tested? performance benchmark results: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=Generator content="Microsoft Excel"> <!--[if !mso]> </head> <body> <!--StartFragment--> Batch Size \| MTP \| origin implementation(us) \| optimized version(us) -- \| -- \| -- \| -- 1 \| 1 \| 2.934 \| 3.64 8 \| 1 \| 4.467 \| 4 32 \| 1 \| 6.98 \| 4.54 64 \| 1 \| 11.087 \| 6.42 128 \| 1 \| 13.414 \| 7.84 256 \| 1 \| 19.66 \| 8.487 512 \| 1 \| 39.908 \| 11.62 1024 \| 1 \| 81.781 \| 18.16 2048 \| 1 \| 137.923 \| 32.934 1 \| 2 \| 3.4 \| 4.02 8 \| 2 \| 3.74 \| 4.24 32 \| 2 \| 6.373 \| 7.394 64 \| 2 \| 9.747 \| 6.46 128 \| 2 \| 12.98 \| 7.76 256 \| 2 \| 20.834 \| 9.787 512 \| 2 \| 39.314 \| 13.56 1024 \| 2 \| 83.135 \| 22.387 2048 \| 2 \| 157.563 \| 40.607 <!--EndFragment--> </body> </html> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: 1024daniel <xxltju324@gmail.com>	2026-01-05 16:03:02 +08:00
Trunrain	91bf524364	[BugFix][kernel] fix matmul_allreduce_add_rmsnorm_kernel (#5335 ) ### What this PR does / why we need it? fix matmul_allreduce_add_rmsnorm_kernel, add hccl Init, SetCcTiling interface test case use multicard-4 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? pytest -sv tests/e2e/nightly/ops/test_matmul_allreduce_add_rmsnorm.py multicard-4 pass https://github.com/vllm-project/vllm-ascend/actions/runs/20502630658/job/58914474652?pr=5335 - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: tongrunze <t00574058@china.huawei.com> Co-authored-by: tongrunze <t00574058@china.huawei.com>	2026-01-05 15:19:54 +08:00
weiguihua2	549be94397	[Bugfix] fix pcp + eplb error (#5561 ) ### What this PR does / why we need it? Fix the bug in the PCP overlay feature 1、Fix the bug related to PCP and EPLB overlap by including PCP size in the word_size calculation. 2、In the PCP pooling scenario, a prompt has been added for setting the cp_kv_cache_interleave_size. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-05 14:08:11 +08:00
dsxsteven	37fd48bee5	[CI] Move longseq Nightly CI (#5577 ) ### What this PR does / why we need it? move longseq nightly CI to correct path due to #5479 [1/N] Refactor nightly test structure Signed-off-by: daishixun <dsxsteven@sina.com>	2026-01-04 15:42:43 +08:00
dsxsteven	3c7e6c6817	[CI] Add multi-nodes longseq configs of DeepSeek-R1-W8A8 & Qwen3-235B-W8A8 (#5381 ) ### What this PR does / why we need it? add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and longseq (PCP&DCP) scenario - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: daishixun <dsxsteven@sina.com>	2026-01-04 10:38:40 +08:00
Jade Zheng	7d5242faca	[Refactor] Formatting output types related to FuseMoE (#5481 ) Currently in the Fused MoE module, functions of classes like MoECommMethod and MoETokenDispatcher output data in dictionary or tuple format, which hampers code maintainability, readability, and extensibility. This PR introduces dataclasses for these key output types to address these issues. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-31 14:24:37 +08:00
Jade Zheng	38570cfeb6	[Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072 ) By converting the KV cache from ND to NZ format when the decode node receives it, this PR ensures that the KV NZ feature works correctly during the decoding phase in disagg-prefill scenario. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Co-authored-by: ghphotoframe <854746559@qq.com> Co-authored-by: alex101-ops <alex1015718386@gmail.com>	2025-12-31 14:24:04 +08:00
Li Wang	2ee17e50a1	[2/N] Upgrade nightly doc (#5534 ) ### What this PR does / why we need it? Follow up https://github.com/vllm-project/vllm-ascend/pull/5479, upgrade the corresponding doc for developers - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-31 09:11:42 +08:00
Li Wang	e760aae1df	[1/N] Refactor nightly test structure (#5479 ) ### What this PR does / why we need it? This patch is a series of refactoring actions, including clarifying the directory structure of nightly tests, refactoring the config retrieval logic, and optimizing the workflow, etc. This is the first step: refactoring the directory structure of nightly to make it more readable and logical. - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-30 19:03:02 +08:00
zzzzwwjj	71f729a661	Revert "moe_gating_top_k" (#5512 ) Reverts vllm-project/vllm-ascend#5271 It breaks e2e test - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1`	2025-12-30 15:05:47 +08:00
ZCG12345	45c3c279e2	moe_gating_top_k (#5271 ) 1. What this PR does / why we need it? This PR supports the moe_gating_top_k operator, which enables post-positioned renormalization (renorm) on the basis of softmax. 2. Does this PR introduce any user-facing change? No user-facing changes are required. 3. How was this patch tested? This patch was tested with the test_npu_moe_gating_top_k test case. vLLM version: release/v0.13.0 vLLM main: `ad32e3e19c` --------- Signed-off-by: ZCG12345 <2097562023@qq.com> Signed-off-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-30 09:28:01 +08:00
jiazhengyi	d5f72835e6	[OP] add custom op aclnnMoeInitRoutingCustom (#5251 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> This pull request introduces a new custom operator `aclnnMoeInitRoutingCustom` for Mixture-of-Experts models. It can be replaced by `aclnnMoeInitRoutingV3` once CANN 8.5 becomes available. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: jiazhengyi <jiazhengyi@huawei.com> Signed-off-by: Chenxi Qian <chenxi.qian.cq@outlook.com> Co-authored-by: jiazhengyi <jiazhengyi@huawei.com> Co-authored-by: Chenxi Qian <chenxi.qian.cq@outlook.com>	2025-12-29 19:29:40 +08:00
Li Wang	1d81bfaed1	Fix nightly (#5413 ) ### What this PR does / why we need it? This pacth mainly do the following things: 1. Bugfix for multi_node_tests log, log names must be unique when uploading logs. 2. Optimize `get_cluster_ips` logic, increase the max retry times for robustness 3. Abandoned the existing gh-proxy temporarily until it is stable enough. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-27 18:16:46 +08:00
Nengjun Ma	f5af6bbd1e	[CI] Add qwen-235b-a22b a2 multi-node test (#5393 ) ### What this PR does / why we need it? Qwen3-235B-A22B belongs to the TopN model, but there is currently a lack of care for the test cases of the wen3-235B-A22B model on Atlas A2, and most of the machines currently owned by users in the community are A2. When users encounter problems, we currently have no way of knowing whether the model runs normally on the corresponding version of the code, so we added it. In addition, we currently see TopN models such as: qwen-dense, qwen3-30b-a3b, Qwen3-Next, Qwen2.5-Omni, but Qwen3-235B-A22B is missing. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Test with multi-node, result as following: 1. Accuracy test (Time for executing this test case: 25 minutes) test running successfully, accuracy as following: ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 95.68 ``` 2. Perf test (Time for executing this test case: 1h15 minutes) test running successfully, throughput as following(This is the atlas A3, for A2 the result about A3/1.3): ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤══════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪══════╡ │ E2EL │ total │ 384086.3958 ms │ 214767.0486 ms │ 528014.771 ms │ 387621.5746 ms │ 388776.7492 ms │ 390164.3559 ms │ 488105.8512 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ TTFT │ total │ 159409.9868 ms │ 1849.4588 ms │ 302439.6965 ms │ 162183.7007 ms │ 162965.477 ms │ 164274.1936 ms │ 262578.6041 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ TPOT │ total │ 149.8842 ms │ 130.2175 ms │ 151.2625 ms │ 150.473 ms │ 150.6978 ms │ 150.9102 ms │ 151.2131 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ ITL │ total │ 149.6789 ms │ 0.0099 ms │ 283.0242 ms │ 150.3276 ms │ 156.8649 ms │ 168.1372 ms │ 199.378 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ InputTokens │ total │ 3654.3079 │ 3108.0 │ 4280.0 │ 3629.0 │ 3728.0 │ 3842.1 │ 4079.0 │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ OutputTokens │ total │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ OutputTokenThroughput │ total │ 3.935 token/s │ 2.8408 token/s │ 6.9843 token/s │ 3.8698 token/s │ 3.8799 token/s │ 3.9916 token/s │ 6.2137 token/s │ 2800 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧══════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 4391524.3389 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 2800 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 2800 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 244.8903 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 256 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 0.6376 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 10232062 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 22.924 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 4200000 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 2329.9568 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 956.3877 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 3286.3445 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-12-26 23:46:09 +08:00
jiangyunfan1	48854aef5c	[TEST]Add sending request with and without chat (#5286 ) ### What this PR does / why we need it? This PR adds the method for sending chat and non-chat request, we need it to test much folloing cases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-12-26 18:04:17 +08:00
Zhu Yi Lin	18302c8467	Revert "Add MagicMTP(block verify) and Triton optimization (#4443 )" (#5380 ) ### What this PR does / why we need it? #4443 introduces a precision issue in scenarios where MTP >= 3 + deepseek v3.1, and this pr reverts it - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-26 15:06:13 +08:00
wangxiyuan	29d2fe653d	cleanup ascend config (#5296 ) 1. refresh additional config doc 2. move kv config logic to platform. 3. improve `dump_config` init logic and rename it to `dump_config_path` this change is user impacted. dump_config is changed from dict to string. 4. correct `enable_async_exponential` type 5. remove useless `chunked_prefill_for_mla` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-26 14:07:37 +08:00
ZT-AIA	adaa89a7a5	Update vllm pin to 12.25 (#5342 ) ### What this PR does / why we need it? - Fix vllm break in the pr: 1.[Drop v0.14 deprecations ]https://github.com/vllm-project/vllm/pull/31285 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: ZT-AIA <1028681969@qq.com>	2025-12-26 14:05:40 +08:00
Li Wang	c2f776b846	[Nightly] Initial logging for nightly multi-node testing (#5362 ) ### What this PR does / why we need it? Currently, our multi-node logs only show the master node's logs (via the Kubernetes API), which is insufficient for effective problem localization if other nodes experience issues. Therefore, this pull request adds the ability to upload logs for other nodes. Next plan: Output structured directory logs, including logs from each node and the polog. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-26 11:39:07 +08:00
Qi Mao	7372225bcb	[FIX] Update _causal_conv1d_update_kernel for Efficient Conv State Handling on NPU (#5322 ) Description: This PR updates the implementation of the Triton operator for deployment on NPU devices, focusing on optimizing grid size and memory handling based on NPU limitations. Design Plan: Grid Calculation: The grid size is now dynamically calculated by batch and dim to ensure that the number of programs executed does not exceed the NPU's vector core capacity. This ensures optimal parallelism without overloading the hardware. Data Block Handling: Due to the limited on-chip memory (UB) on Ascend NPUs, this implementation splits large data into smaller chunks of 32k or less per block. The kernel performs a for-loop to process the data in these smaller chunks, minimizing memory usage and avoiding potential overflows. Changes Compared to GPU Implementation: Grid and Block Sizing: For GPU, the grid and block size were determined based on available thread counts and memory size. In contrast, the NPU version dynamically adjusts these parameters using B_TILE and BLOCK_N to optimize for NPU’s architecture. Memory Chunking: The original GPU implementation did not require chunking due to the higher available memory and processing capacity. For the NPU, data is divided into smaller chunks (32k or smaller) to comply with memory constraints on the device. The kernel has been modified to handle this chunking mechanism inside a loop. Optimized Thread Usage: The NPU implementation takes into account the hardware-specific thread limit (24 threads per vector core), ensuring that the number of active programs is aligned with the NPU's vector core count, avoiding over-subscription that would lead to serial processing. This PR ensures that the operator functions efficiently on Ascend NPU, considering hardware limitations while maintaining the same functionality and input parameters as the GPU implementation. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: maoxx241 <maomaoyu870@gmail.com>	2025-12-26 09:12:30 +08:00
Aoxuan Chen	8caad0510d	fix e2e rejection-sampler error (#5341 ) ### What this PR does / why we need it? Fixed the error in the CI process for vllm-ascend/tests/e2e/nightly/ops/triton/test_rejection_sampler.py Error: test_rejection_sampler_block_verify_triton_kernel: duplicate parametrization of 'vocab_size'. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: chenaoxuan <cax1165@163.com>	2025-12-25 11:39:38 +08:00
wangxiyuan	2ae0bad96d	Remove VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE (#5272 ) `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with `VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR remove it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-25 11:09:56 +08:00
Aoxuan Chen	6d25372baa	Add MagicMTP(block verify) and Triton optimization (#4443 ) ### What this PR does / why we need it? 1. MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy. 2. The rejection sampling logic in rejection_sampler.py was restructured using Triton-Ascend, enabling it to operate under high concurrency, thus resolving CPU and NPU operator bottlenecks and enhancing throughput. ### Does this PR introduce _any_ user-facing change? MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: chenaoxuan <cax1165@163.com>	2025-12-25 09:00:25 +08:00
Ascendyh	a90482803d	[Kernel] add l2norm triton kernel (#4595 ) ### What this PR does / why we need it? This pull request introduces an L2 normalization kernel implemented in Triton, specifically optimized for Ascend NPUs. ### Does this PR introduce _any_ user-facing change? No, this PR does not introduce any user-facing changes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-25 06:06:18 +08:00
Wang Kunpeng	c3a8d13ca7	[refactor] Remove unnecessary attributes from set_ascend_forward_context (#5204 ) ### What this PR does / why we need it? Remove unnecessary attributes from set_ascend_forward_context 1.prefetch_stream 2.weight_prefetch_method ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-23 08:49:52 +08:00
jiangyunfan1	3ba920a65b	[TEST]Update mm param --mm-processor-cache-gb (#5242 ) ### What this PR does / why we need it? This PR updates the mm param --mm-processor-cache-gb, we need it to run the case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-12-22 18:54:03 +08:00
wangqiankun13	118b0ed346	[Feature] Add token mask for DispatchGmmCombineDecode operator (#5171 ) ### What this PR does / why we need it? In this PR, DispatchGmmCombineDecode add an optional input x_active_mask, with which only token masked True will be dispatched and handle. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2025-12-19 16:31:48 +08:00
zxr2333	073a3a6e6c	[Doc][P/D] Fix MooncakeConnector's name (#5172 ) ### What this PR does / why we need it? vLLM community has integrated their MooncakeConnector. The original scripts will now find this MooncakeConnector instead of the one from vLLM-Ascend. All scripts that involve using the MooncakeConnector need to be modified to another name. ### Does this PR introduce _any_ user-facing change? Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector. ### How was this patch tested? By CI. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-12-18 22:29:19 +08:00
Li Wang	0f92d34a70	[CI] Pull latest vllm-ascend src before tests (#4988 ) ### What this PR does / why we need it? Currently, our image build suffers from errors during cross-compilation, which causing the image to fail to build sometimes(see https://github.com/vllm-project/vllm-ascend/actions/runs/20152861650/job/57849208186). This results in the nightly test code not being the latest version. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-13 19:04:14 +08:00
Li Wang	5b12c068f9	[Nightly] Remove gen_ranktable logic (#4941 ) ### What this PR does / why we need it? Since the `llmdatadist` has sunset, the logic gen_ranktable should also be removed - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-12 17:20:18 +08:00
QilaiZhang	78bf211539	[OPS] support triton causal_conv1d_fn ops (#4119 ) ### What this PR does / why we need it? Support triton causal_conv1d_fn ops. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: QilaiZhang <245706640@qq.com>	2025-12-11 15:52:39 +08:00
chenjunyi	c12eb22cbe	[feat] mlapo add bf16 no_quant support (#4852 ) ### What this PR does / why we need it? This PR adds mlapo operation support for bf16 no_quant mode. ### Does this PR introduce _any_ user-facing change? This PR makes quant related parameters optional. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chenjunyi <isjunyi.chen@gmail.com>	2025-12-11 11:06:56 +08:00
zhangyiming	c95c271538	[E2E] Optimize nightly testcase. (#4886 ) ### What this PR does / why we need it? Optimize nightly testcase. Changes: - tests/e2e/nightly/multi_node/config/models/Qwen3-235B-A3B.yaml: Add accuracy and performance benchmark - tests/e2e/models/configs/Qwen3-8B-Base.yaml: Delete - tests/e2e/models/configs/internlm-7b.yaml: Change to internlm3-8b-instruct - tests/e2e/nightly/models/test_deepseek_r1_w8a8_eplb.py: Change to DeepSeek-R1-0528-W8A8 model - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-11 10:15:39 +08:00
Li Wang	89733111fa	[Nightly] Optimize nightly online test logger info (#4798 ) ### What this PR does / why we need it? This patch do some tiny optimization for nightly ci: 1. Polling the frequency with which the service prints logs when it starts up in order to obtain useful information more quickly. 2. Shorten the timeout for waiting server - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-10 09:24:19 +08:00
wangxiyuan	835b4c8f1d	Drop torchair (#4814 ) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 09:20:40 +08:00
Trunrain	ba9cda9dfd	[Kernel] add custom op MatmulAllreduceAddRmsnorm (#4606 ) What this PR does / why we need it? Optimization of the fused operator for Qwen3 32B: Matmul, AllReduce, Add, and RMSNorm Does this PR introduce _any_ user-facing change? No How was this patch tested? vLLM version: v0.11.2 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: tongrunze <t00574058@china.huawei.com> Co-authored-by: tongrunze <t00574058@china.huawei.com>	2025-12-10 09:05:33 +08:00
Nengjun Ma	863a5a5a17	Add gsm8k accuracy test for multi-note Qwen3-235B-A22B (#4802 ) ### What this PR does / why we need it? As there is not accuracy test for qwen3-235B-A22B model Test result: dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 96.29 Times long for test case running: 30mintues - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-09 23:05:41 +08:00
wangxiaoteng888	a77045f355	[P/D][main]Offline the llmdatadist connector related parts of the code and files. (#4780 ) ### What this PR does / why we need it? As support for the mooncake connector is now available, the llmdatadist connector is no longer being maintained, so the llmdatadist-related files need to be retired. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-12-09 22:36:43 +08:00
wangqiankun13	9567e5dd8c	[kernel] Adapt DispatchGmmCombineDecode operator to parameters of small operators (#4790 ) ### What this PR does / why we need it? This PR adapt DispatchGmmCombineDecode operator to parameters of small operators. 1. This operator no longer requires permuting the weights and scales of GMM1. 2. This operator no longer requires transposing the weights of GMM2. Therefore, this operator and the small operator can use the same parameters (weights and scales), which is beneficial for model adaptation. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2025-12-09 16:17:06 +08:00
wangxiyuan	0b65ac6c4b	remove useless patch (#4699 ) patach_config is useless now. Let's remove it - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-08 11:02:42 +08:00

1 2 3 4 5

208 Commits