xc-llm-ascend

Author	SHA1	Message	Date
ZT-AIA	96b90ad625	[CI] repair bug custom op ci conftest.py (#8786 ) ### What this PR does / why we need it? repair bug custom op ci `conftest.py`：Some test cases fail or are skipped, leading to incorrect time retrieval. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? nightly Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-04-29 10:17:08 +08:00
SILONG ZENG	2e2aaa2fae	[Doc][v0.18.0] Fix documentation formatting and improve code examples (#8701 ) ### What this PR does / why we need it? This PR fixes various documentation issues and improves code examples throughout the project. Signed-off-by: MrZ20 <2609716663@qq.com>	2026-04-28 09:01:25 +08:00
ZT-AIA	0cc76860d5	[CI ][Misc] Add timeout check for custom op CI and optimize test parameters (#8755 ) ### What this PR does / why we need it? This PR introduces a mechanism to track test duration in `conftest.py` and skip subsequent tests in a file if a certain number of tests exceed a timeout threshold. This is intended to prevent CI hangs or long-running nightly tests. Additionally, it reduces the parameter space for `test_fused_qkvzba_split_reshape_cat.py` to further optimize CI runtime. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? nightly Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-04-27 21:48:54 +08:00
hucong	8a671a109c	[CI][Cherry-pick] Relax TTFT benefits threshold from 0.4 to 0.5 to account for DP load imbalance (#8684 ) Cherry-pick https://github.com/vllm-project/vllm-ascend/pull/8683 ### What this PR does / why we need it? This PR relaxes the TTFT threshold from `0.4` to `0.5` to improve robustness under Data Parallel (DP) load imbalance. #### Background The current assertion enforces: prefix75 < prefix0 * 0.4 #### ❌ Nightly Failure Cases (Observed) \| prefix0 \| threshold (0.4x) \| prefix75 \| delta \| \|--------\|------------------\|----------\|--------\| \| 4696.24 \| 1878.50 \| 1883.99 \| +5.49 \| \| 4696.20 \| 1878.48 \| 1896.01 \| +17.53 \| \| 4636.73 \| 1854.69 \| 1902.48 \| +47.79 \| \| 4655.17 \| 1862.07 \| 1913.54 \| +51.47 \| \| 4685.35 \| 1874.14 \| 1919.36 \| +45.22 \| \| 4660.33 \| 1864.13 \| 1915.41 \| +51.28 \| \| 4648.30 \| 1859.32 \| 1950.50 \| +91.18 \| \| 4655.30 \| 1862.12 \| 1962.32 \| +100.20 \| --- #### ✅ Nightly Passing Cases (Observed) \| prefix0 \| threshold (0.4x) \| prefix75 \| margin \| \|--------\|------------------\|----------\|---------\| \| 4685.64 \| 1874.26 \| 1864.46 \| -9.80 \| \| 5520.28 \| 2208.11 \| 1928.97 \| -279.14 \| \| 4639.23 \| 1855.69 \| 1846.86 \| -8.83 \| \| 4651.64 \| 1860.66 \| 1854.30 \| -6.36 \| \| 4640.39 \| 1856.15 \| 1840.32 \| -15.83 \| \| 4677.20 \| 1870.88 \| 1848.35 \| -22.53 \| --- #### Key Observations - Failures exceed the threshold by only ~5 ms to ~100 ms (~0.3%–5%) - Passing cases often have very tight margins (~5–10 ms) - There is clear overlap between pass and fail boundaries - Many failures are borderline violations, not real regressions --- #### Root Cause The instability is caused by Data Parallel (DP) load imbalance, which introduces systematic variance: - Uneven request distribution across workers - Queueing delays - Increased TTFT variance (especially for `prefix75`) --- #### Conclusion - The current threshold (`0.4x`) is too strict - Observed natural fluctuation: - Absolute: up to ~100 ms - Relative: up to ~5% over threshold - Pass/fail boundary is currently too sensitive to runtime jitter --- #### Change We relax the threshold: 0.4 → 0.5 This adjustment: - Accounts for expected runtime variance - Reduces false negatives - Maintains a meaningful performance constraint Even with `0.5`, the requirement remains strict (`prefix75 < 50% of prefix0`) and does not mask real regressions. --- ### Does this PR introduce _any_ user-facing change? No. This change only affects internal test assertions and does not impact user-facing behavior or model performance. --- ### How was this patch tested? - Verified against existing TTFT test cases: - Previously failing cases (due to small variance) now pass - No regressions observed in other scenarios - Confirmed that failures were due to DP load imbalance rather than actual performance degradation - Ensured the updated threshold still enforces a meaningful constraint on TTFT Signed-off-by: underfituu <hzhucong@163.com>	2026-04-27 16:38:07 +08:00
ZT-AIA	2cee0c32e5	[CI] Repair custom op nightly (#8707 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> #### Fixed: 1. The function name in test_moe_init_routing_custom.py is incorrect; it is not named as a test case function starting with 'test'. 2.In Night ops singlecard_ops add the printing of timestamps for use cases, making it easier to quickly locate issues after a timeout occurs. #### To be repaired: 1. The test_penality.py test case partially fails. It takes one hour. The owner has been notified to fix the case after the 5.1 holiday. ——Yang Cheng 3. The csrc/copy_and_expand_eagle_inputs operator invoked by test_copy_and_expand_eagle_inputs.py supports only 910b.——HF001 4. The test_causal_conv1d.py test case is incorrect. The triton operator `causal_conv1d_fn` invoked by the test_causal_conv1d.py test case uses `get_forward_context`, but the operator case does not use `set_forward_context` (which is normal in the model). ——Zeng Tian 5. The test_causal_conv1d.py case is incorrect. In this scenario, uboverflow occurs when the triton invoked ——Zeng Tian ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> no ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> nightly Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-04-25 19:05:33 +08:00
ZT-AIA	81d0a37bf5	[CI] repair ci custom op (#8571 ) ### What this PR does / why we need it? After he completes the subsequent repairs, it can be restored. For now, let's skip test_copy_and_expand_eagle_inputs ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? nightly Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-04-24 17:06:25 +08:00
linfeng-yuan	5c048a9b71	[Doc][releases/v0.18.0] fix documentation error or non-standard description (#8626 ) ### What this PR does / why we need it? fix documentation error or non-standard description in releases/v0.18.0 branch ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation check. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-04-23 18:55:44 +08:00
herizhen	ff76c6780e	[releases/v0.18.0][Doc][Misc] Modifying Configuration Parameters (#8618 ) ### What this PR does / why we need it? This PR renames the environment variable VLLM_NIXL_ABORT_REQUEST_TIMEOUT to VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT to align with the Mooncake connector naming convention. It also updates the documentation and test configurations to reflect this change and adjusts the suggested timeout value in the documentation to 480 seconds for consistency. ### Does this PR introduce _any_ user-facing change? Yes. The environment variable for configuring the abort request timeout has been renamed. Users should update their environment settings from VLLM_NIXL_ABORT_REQUEST_TIMEOUT to VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT. ### How was this patch tested? The changes were verified by updating the corresponding test configuration files and ensuring consistency across the documentation. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>	2026-04-23 16:23:31 +08:00
Frank Chen	a4ba82e138	[BugFix] Require kv producer for layer sharding (#8563 ) ### What this PR does / why we need it? This PR introduce stricter Ascend `additional_config.layer_sharding` validation to the 0.18 release branch so it is only accepted on PD-disaggregated P nodes with `kv_role="kv_producer"`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E test --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-04-23 16:06:53 +08:00
starmountain1997	2cb9f76a0f	[CI] Ds32 ep aime2025 (#8496 ) Backport of #7882 to releases/v0.18.0. Adds aime2025 benchmark test for DeepSeek-V3.2-W8A8 EP with disaggregated prefill on A3 (4-node, 16 NPUs per node, accuracy benchmark baseline 66.67%). Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-04-21 19:49:06 +08:00
yangjiuhua	b717dc17a3	[v0.18.0][Test][Misc] Update CI for GLM-5 configuration on vllm-ascend/releases/v0.18.0 branch (#8322 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Update CI for GLM-5 configuration on vllm-ascend/releases/v0.18.0 branch 在0.18.0版本上对glm5-w4a8做测试 ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: yangjiuhua <y00845194@china.huawei.com> Co-authored-by: yangjiuhua <y00845194@china.huawei.com>	2026-04-21 14:10:11 +08:00
Li Wang	36a0470de1	[Doc] Upgrade env `VLLM_ASCEND_ENABLE_FUSED_MC2` used in nightly test and tutorials (#8441 ) ### What this PR does / why we need it? The env `VLLM_ASCEND_ENABLE_FUSED_MC2` should only enabled in the decoder node during Prefill-Decode Disaggregation scenario --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-04-20 22:39:23 +08:00
ZT-AIA	3db5048d74	[CI]repair ci for custom op (#8455 ) ### What this PR does / why we need it? repair ci for custom op nightly Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-04-20 17:51:37 +08:00
zhangxinyuehfad	7a706fb197	[v0.18.0][CI] fix report_template.md (#8429 ) ### What this PR does / why we need it? fix report_template.md the error caused by https://github.com/vllm-project/vllm-ascend/pull/8340 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-04-20 11:13:14 +08:00
sunshine202600	1dd1de8153	[Doc][Misc] Improve readability and fix typos in documentation (#8340 ) ### What this PR does / why we need it? This PR improves the readability of the documentation by fixing typos, correcting command extensions, and fixing broken links in the Chinese README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. --------- Signed-off-by: sunshine202600 <sunshine202600@163.com>	2026-04-17 08:54:38 +08:00
zhangxinyuehfad	808d00406f	[v0.18.0][CI]Add rank0 process count check for DeepSeek-R1-W8A8-HBM test (#8072 ) ### What this PR does / why we need it? Adds a `check_rank0_process_count` validation step to the DeepSeek-R1-W8A8-HBM nightly single-node test. The check verifies that after the server starts, there is exactly 1 `vllm serve` process running on rank0. This guards against the regression fixed in #8041 (extra NPU context leaking on device 0), ensuring it does not silently reappear in future releases. #### Changes - `tests/e2e/nightly/single_node/models/scripts/test_single_node.py`: Add `run_check_rank0_process_count` async handler. It calls `npu-smi info` for diagnostics, then uses `psutil` to assert exactly one `vllm serve` process exists on rank0. - `tests/e2e/nightly/single_node/models/configs/DeepSeek-R1-W8A8-HBM.yaml`: Register `check_rank0_process_count` in the `test_content` list for the HBM test case. Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-04-15 17:16:27 +08:00
Nengjun Ma	99cea6c1b5	[CI] Fix the nightly pip binary install doc test fail. (#8129 ) ### What this PR does / why we need it? Fix the nightly pip binary install doc test fail. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Nightly doc test Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-04-10 17:34:18 +08:00
ZYang6263	34386c8896	[v0.18.0][CI] Fix and simplify the CI for Qwen3 32B (#8093 ) ### What this PR does / why we need it? This PR fixes and simplifies the CI configuration for Qwen3 32B. The main changes are: - Remove the redundant `Qwen3-32B-Int8-A3-Feature-Stack3.yaml` config and consolidate the CI setup into `Qwen3-32B-Int8.yaml`. - Improve runtime stability by adding `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` and setting `--max-num-seqs 80`. - Update the accuracy benchmark from `aime2024` to `gsm8k-lite`, and adjust the related dataset config, output length, baseline, and threshold accordingly. These changes make the Qwen3 32B CI easier to maintain and more stable in nightly validation. --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>	2026-04-10 14:22:24 +08:00
hucong	4a628f1042	[UT][v0.18.0] Fix APC nightly UT and TTFT ratio (cherry-pick #7468 ) (#8053 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7468 - Fix TTFT ratio threshold from 0.8 to 0.4 for prefix cache benchmarks - Fix max_out_len values for warm_up and benchmark configs - Applied to both DeepSeek-R1-0528-W8A8 and Qwen3-32B-Int8 configs ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: underfituu <hzhucong@163.com>	2026-04-08 21:08:26 +08:00
cvSoldier	6c19270498	[BugFix] fix qwen3-next compilation error (#7977 ) ### What this PR does / why we need it? fix qwen3-next compilation error - vLLM version: v0.18.0 - vLLM release0.18.0: `445dc7196f` --------- Signed-off-by: cvSoldier <610496306@qq.com>	2026-04-03 20:03:39 +08:00
guxin108	81c6f51a45	【CI】add nightly cases: MiniMax-M2.5-W8A8 Qwen3.5-27B-w8a8 Qwen3.5-397B-A1… (#7968 ) ### What this PR does / why we need it? This PR Qwen3.5-27B ;MiniMax-M2.5-w8a8 ;Qwen3.5-397B-w8a8-mtp acc/perf 3 cases on A3, we need test them daily. - vLLM version: v0.18.0 - vLLM main: `35141a7eed` Signed-off-by: guxin108 <1252896542@qq.com>	2026-04-03 17:50:59 +08:00
jiangmengyu18	3f462d251e	[v0.18.0][CI] fix acc baseline of qwen3vl 235b (#7981 ) ### What this PR does / why we need it? fix acc baseline of qwen3vl 235b --------- Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>	2026-04-03 17:38:17 +08:00
LeeWenquan	0d773efd70	[CI]Fix qwen3Next Nightly CI config (#7903 ) ### What this PR does / why we need it? Fix qwen3Next Nightly CI config in 0.18.0. backport: #7679 Signed-off-by: Your Name <you@example.com> Co-authored-by: Your Name <you@example.com>	2026-04-03 16:46:25 +08:00
jiangmengyu18	902d1312d9	[v0.18.0][CI] add nightly ci test for qwen3vl (#7913 ) ### What this PR does / why we need it? Add nightly ci test for qwen3vl ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: betta18 <jiangmengyu1@huawei.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com>	2026-04-03 11:39:28 +08:00
hucong	d3de7333dc	[BugFix][v0.18.0][cherry-pick] Fix embedding prefix caching for APC (#7894 ) ## What this PR does / why we need it? pick-from:https://github.com/vllm-project/vllm-ascend/pull/7452 ### Problem Embedding models produce inconsistent outputs when prefix caching is enabled vs disabled. ### Root Cause The attention router condition was too broad: - All `model_runner_type == "pooling"` → `_forward_encoder_attention()` → uses `npu_fusion_attention` - But `npu_fusion_attention` does NOT support prefix caching - Result: Numerical mismatch when KV cache is managed by prefix caching ### Solution Refine the router condition to check causality: Before: ``` if attn_metadata.model_runner_type == "pooling": → npu_fusion_attention (no prefix caching support) ``` After: ``` if attn_metadata.model_runner_type == "pooling" and not attn_metadata.causal: → npu_fusion_attention (for true encoders) else: → npu_fused_infer_attention_score (prefix caching support) ``` ### Changes Made 1. Fixed router condition (`vllm_ascend/attention/attention_v1.py` L968) - Added `and not attn_metadata.causal` check - Effect: Non-causal embeddings now use correct operator 2. Simplified encoder attention (`vllm_ascend/attention/attention_v1.py` L864-877) - Removed redundant causal branch (encoders never use causal mask) - Reduced from 34 lines to 14 lines 3. Added test (`tests/e2e/singlecard/pooling/test_embedding.py`) - Validates embedding outputs with/without prefix caching are consistent ## Does this PR introduce _any_ user-facing change? ### Functional Changes ✅ Yes - Bug fix: Embedding models now produce consistent outputs with prefix caching ### API Changes ❌ No - All public APIs unchanged ### Configuration Changes ❌ No - No new configuration required ### Backward Compatibility ✅ Fully compatible - Only fixes incorrect behavior ## How was this patch tested? ### New Test Added `test_embed_models_using_prefix_caching_correctness()`: - Tests: `Qwen3-Embedding-0.6B` - Validates numerical consistency between runs with/without prefix caching - Uses long sequences to activate prefix caching - Tolerance: 1e-2 - vLLM version: v0.18.0 Signed-off-by: underfituu <hzhucong@163.com>	2026-04-01 16:57:33 +08:00
Nagisa125	2cb9195ff0	[Releases/v0.18.0][CI] Updated the parameters for the single-node test to fix the OOM issue for DeepSeek-V3.2 (#7862 ) ### What this PR does / why we need it? Fix the OOM (Out-of-Memory) error in the single-node-deepseek-v3-2-w8a8 nightly test of vllm-ascend: - Reduced the value of HCCL_BUFFSIZE - Lowered the gpu-memory-utilization Optimize service-side performance: Updated service-oriented configuration parameters (e.g., max-num-seqs, cudagraph_capture_sizes, batch_size) to improve the inference performance,so that the performance is closer to the optimal performance of the current mainline. Align performance baseline with main branch: Updated the performance baseline according to the latest performance data ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The test has passed. https://github.com/vllm-project/vllm-ascend/actions/runs/23734079080/job/69134387320?pr=7793 --------- Signed-off-by: wyh145 <1987244901@qq.com>	2026-04-01 10:28:46 +08:00
weiguihua2	59a7526339	[CI][Misc] modify ds3.2+dcp ci (#7841 ) ### What this PR does / why we need it? Due to the current dcp solution of allgathering the KV cache, the performance deteriorates significantly, and the CI may get stuck. This PR temporarily removes the performance and accuracy benchmarks for DeepSeek-V3.2-W8A8-cp to prevent CI hangs until optimization is complete. pcik-from:https://github.com/vllm-project/vllm-ascend/pull/7842 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Verified that the configuration file remains valid and that the CI no longer attempts to run the problematic benchmarks. pick-from: https://github.com/vllm-project/vllm-ascend/pull/7842 --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-04-01 08:58:21 +08:00
linfeng-yuan	ed4ef1f4e7	[releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties (#7794 ) ### What this PR does / why we need it? Implement get_token_bin_counts_and_mask and apply_penalties with Triton-Ascend kernels. This significantly reduces latency of the sampling process when repetition/frequency/presence penalties are enabled. Cherry-pick from main PR #7569 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2026-03-31 19:01:51 +08:00
ZT-AIA	66db070423	[cherry-pick][Test]repair for test_compute_slot_mapping (#7836 ) ### What this PR does / why we need it? repair for test_compute_slot_mapping Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-03-31 16:52:58 +08:00
Yang Yuxi	e776d5c0f1	[Bugfix]v0.18.0 support FlashComm1 & DCP for Qwen (#7726 ) ### What this PR does / why we need it? This PR backports the changes from #7673 ([Bugfix] support FlashComm1 & DCP for Qwen) to the releases/v0.18.0 branch. -------- Signed-off-by: Yang Yuxi <907276627@qq.com>	2026-03-29 15:59:19 +08:00
wangbj127	9cc41c9457	[v0.18.0][Bugfix][EAGLE] Fix FIA pad bug under max concurrency (#7754 ) cherry picked from https://github.com/vllm-project/vllm-ascend/pull/7740 Fixes padding problems of FIA op under max concurrency. - vLLM version: v0.18.0 - vLLM main: `35141a7eed` Signed-off-by: Wangbingjie <wangbj1207@126.com>	2026-03-29 12:23:44 +08:00
weiguihua2	bc8e87f3db	[v0.18.0][Bugfix] fix ds3.2 dcp mtp (#7681 ) ### What this PR does / why we need it? Fixed the issue where the DCP overlaps the MTP scenario in the ds3.2 scenario. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? cherry-pick from: https://github.com/vllm-project/vllm-ascend/pull/7617 Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-03-27 14:24:53 +08:00
Li Wang	2c2d8bb015	[cherry-pick][CI] Enforce torchaudio and torchvison compatible with pta (#7688 ) ### What this PR does / why we need it? This patch cherry-pick from #7648 Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-27 11:06:13 +08:00
wangbj127	2ad0ca52a6	Qwen3.5 MoE supports flashcomm v1 (#7644 ) cherry pick from https://github.com/vllm-project/vllm-ascend/pull/7486 <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Multimodal models like Qwen3.5 MoE does embedding in model_runner, so when flash comm is enabled, the first AllGather operation should be skipped. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> - vLLM version: v0.18.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Wangbingjie <wangbj1207@126.com> Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>	2026-03-25 23:09:33 +08:00
Shaoxu Cheng	3f4087a8f0	[310P]fused recurrent gated delta rule pytorch core and ut (#7398 ) ### What this PR does / why we need it? RFC https://github.com/vllm-project/vllm-ascend/issues/7394 Add a PyTorch implementation of the fused recurrent gated delta ruler on 310P. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-25 08:53:14 +08:00
SILONG ZENG	1e3c1e76bf	[Lint]Add lint hooks for clang-format, shellcheck, forbidden imports, and boolean context manager checks (#7511 ) ### What this PR does / why we need it? This PR introduces several upstream `vllm`-aligned lint hooks into `vllm-ascend` and makes them part of the actual `pre-commit` flow. Main changes in this PR: - add `check-boolean-context-manager` to catch boolean expressions in `with` statements - add `check-forbidden-imports` to forbid direct `re` imports and disallowed direct `triton` imports - enable shell script linting through `tools/shellcheck.sh` - add root `.clang-format` aligned with upstream `vllm`, enable `clang-format` in `pre-commit`, temporarily exclude all `csrc/` from `clang-format` to avoid bringing a large native code reformat into this PR This PR focuses on landing the smaller and immediately useful lint alignment first, without mixing in the larger requirements-management migration. ### Does this PR introduce _any_ user-facing change? No. This PR only updates repository lint configuration, static checks, and internal import/style enforcement. It does not change runtime behavior or public interfaces. ### How was this patch tested? Tested locally in the project virtual environment. Commands used: ```bash bash format.sh ``` Verified checks passed: ``` bash ruff check...............................................................Passed ruff format..............................................................Passed codespell................................................................Passed typos....................................................................Passed clang-format.............................................................Passed Lint GitHub Actions workflow files.......................................Passed Lint shell scripts.......................................................Passed Lint PNG exports from excalidraw.........................................Passed Check for spaces in all filenames........................................Passed Enforce __init__.py in Python packages...................................Passed Check for forbidden imports..............................................Passed Check for boolean ops in with-statements.................................Passed Suggestion...............................................................Passed - hook id: suggestion - duration: 0s To bypass pre-commit hooks, add --no-verify to git commit. ``` note: clang-format is enabled but currently excludes all csrc/ - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-24 20:03:01 +08:00
lhp-deep	0e3186f07c	[model_runner_v2]:optimize the performance of the _compute_slot_mappings_kernel (#7575 ) ### What this PR does / why we need it? This PR optimizes the `_compute_slot_mappings_kernel` for Ascend NPUs to improve performance. The key changes include: - A new Triton kernel implementation (`_compute_slot_mappings_kernel`) with NPU-specific optimizations, such as using `tl.gather` to handle non-contiguous memory access and replacing modulo operations. - A new method `compute_slot_mappings` in `AscendBlockTables` to use this new kernel. - An end-to-end test to verify the correctness of the new kernel against the reference GPU implementation. The optimization is needed to avoid performance degradation from scalar computation on Ascend devices. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>	2026-03-24 17:29:14 +08:00
realliujiaxu	5d12446573	[Feat][SP] Suport SP for VL MoE models (#7044 ) ### What this PR does / why we need it? 2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712, extend SP to VL MoE models. ### Does this PR introduce _any_ user-facing change? remove `sp_threshold` in additional config and reuse `sp_min_token_num` from vLLM. ### How was this patch tested? - Model: Qwen3-VL-30B-A3B, - TP4 DP2 - 100 reqs - max concurrency 1 \| Seq length \| Mean TTFT (ms) main \| Mean TTFT (ms) this PR \| \|------------\|---------------------\|------------------------\| \| 4k \| 429.40 \| 323.3 \| \| 16k \| 1297.01 \| 911.74 \| - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-24 17:16:00 +08:00
LeeWenquan	9615bc33fd	Fix Qwen3Next CI Config (#7561 ) ### What this PR does / why we need it? This pr modifies qwen3Next nightly CI config. (1) Add a nightly CI . (2) Set a more precise accuracy standard - vLLM version: v0.18.0 - vLLM main: `6a9cceb219` Signed-off-by: Your Name <you@example.com> Co-authored-by: Your Name <you@example.com>	2026-03-24 17:08:17 +08:00
jiaojiao	1de805ce0a	[Ops][Misc] Refactor and optimize CausalConv1d for Ascend (#7495 ) ### What this PR does / why we need it? During the prefill phase of Qwen3-Next and Qwen3.5, the `torch.ops._C_ascend.causal_conv1d_fn` operator exhibits significant performance bottlenecks. To address this, we have re-implemented the optimization using `torch.ops._C_ascend.npu_causal_conv1d_custom`. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? 1 accuracy test ``` [2026-03-20 16:44:22,961] [ais_bench] [INFO] Start launch task state board ... +-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+ \| Task Name \| Process \| Progress \| Time Cost \| Status \| Log Path \| Extend Parameters \| +=============================+===========+============+=============+==========+===========================================+=====================+ \| vllm-api-general-chat/gsm8k \| 2918978 \| NA \| 0:00:01 \| finish \| logs/eval/vllm-api-general-chat/gsm8k.out \| None \| +-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+ [2026-03-20 16:44:34,284] [ais_bench] [INFO] Evaluation tasks completed. [2026-03-20 16:44:34,287] [ais_bench] [INFO] Summarizing evaluation results... dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 271d0b accuracy gen 96.21 ``` 2 ut modify test `pytest -sv /home/c30006096/vllm-ascend/tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_causal_conv1d.py::test_ascend_causal_conv1d` - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: wenba0 <3054239545@qq.com> Signed-off-by: jiaojiao <56385650+wenba0@users.noreply.github.com>	2026-03-24 00:07:12 +08:00
Nengjun Ma	8e0789bb36	[CI] Recover pd disaggregated encoder test case that been incorrectly skipped (#7505 ) ### What this PR does / why we need it? [CI] Recover pd disaggregated encoder test case that been incorrectly skipped in PR: https://github.com/vllm-project/vllm-ascend/pull/7412 ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-03-23 21:41:28 +08:00
weijinqian0	bdd90c0088	[model_runner_v2]optimize the performance of the post_update. (#7496 ) ### What this PR does / why we need it? - This PR aims to enhance the operator performance in the `post_update` phase of `model_runner_v2` on NPUs. By optimizing the relevant operations, it is expected to improve the overall efficiency and speed of the model running on NPU hardware, which is crucial for scenarios where high-performance inference is required. - when bs = 256, time cost reduce from 26us to 11 us; ### Does this PR introduce _any_ user-facing change? No, there are no changes to the API, interface, or other high-level behaviors that would directly affect the user's code or interaction with the system beyond the performance improvement. ### How was this patch tested? CI passed with new added/existing tests. In addition to the regular CI tests, specific benchmark tests were conducted on NPU hardware to measure the performance improvement of the `post_update` operators. --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2026-03-23 20:29:55 +08:00
Shaoxu Cheng	13397e9cb7	[310p] Add a PyTorch implementation of the GDN gating operator on 310P (#7430 ) ### What this PR does / why we need it? RFC #7394 Add a PyTorch implementation of the GDN gating operator on 310P. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT - vLLM version: v0.17.0 - vLLM main: `4497431df6` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-23 20:26:39 +08:00
zhangxinyuehfad	886756aea0	[Bugfix][CI] Fix aisbench installation to avoid Gitee authentication (#7536 ) ### What this PR does / why we need it? - Pass GITEE_USERNAME (var) and GITEE_TOKEN (secret) as Docker build args in nightly image build so Dockerfile can authenticate to Gitee - In Dockerfile.nightly.a2/a3, embed credentials into clone URL to avoid auth failure during `git clone` - In single-node and multi-node PR test workflows, backup the pre-installed benchmark from the nightly image before wiping vllm-ascend, then restore it instead of re-cloning from Gitee, which is inaccessible from fork PR contexts ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `8b6325758c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-23 20:16:51 +08:00
liuhy1213-cell	fb283b5820	[CI] Add nightly CI test cases for the GLM-5 (#7429 ) ### What this PR does / why we need it? Add nightly CI test cases for the GLM-5 Add model download for the GLM-5 https://github.com/vllm-project/vllm-ascend/actions/runs/23286178651/job/67710409642#logs - vLLM version: v0.17.0 - vLLM main: `b31e9326a7` --------- Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com> Signed-off-by: liuhy1213-cell <liuhy1213@gmail.com> Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>	2026-03-23 19:14:19 +08:00
Qiu	71df17f4e6	bugfix(MC2): refactor the comm group of MC2 to be compatible with PP (#7291 ) ### What this PR does / why we need it? This PR refactors the communication group of MC2 to keep it consistent with vllm's EP group, making it compatible with PP. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-23 15:44:21 +08:00
Shanshan Shen	5c0d02f689	[Bugfix] Fix multi-instance serving OOM on single card (#7427 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/7308. Subtracting `init_non_torch_memory` (maybe used by the first instance) from the total `non_torch_memory` when calculating `available_kv_cache_memory`. Directly use `non_torch_memory_increase` (contained in `non_kv_cache_memory`) to calculate `available_kv_cache_memory`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Launch tow vllm-ascend instances sequentially on single card. ```bash # Launch first instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8100 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager # Launch second instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8101 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager ``` Before this PR: ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340388298034668 GiB init_non_torch_memory: 0.3616676330566406 GiB non_torch_memory_before_empty_cache: 0.3896217346191406 GiB non_torch_memory_increase: 0.0279541015625 GiB non_torch_memory_cleared_by_empty_cache: 0.3616676330566406 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2336344718933105 GiB init_non_torch_memory: 18.37220001220703 GiB non_torch_memory_before_empty_cache: 18.399906158447266 GiB non_torch_memory_increase: 0.02754974365234375 GiB non_torch_memory_cleared_by_empty_cache: 18.372356414794922 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: -1.32 GiB ``` After this PR: ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340540885925293 GiB init_non_torch_memory: 0.36182403564453125 GiB non_torch_memory_before_empty_cache: 0.38979339599609375 GiB non_torch_memory_increase: 0.0279693603515625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.233344554901123 GiB init_non_torch_memory: 18.74309539794922 GiB non_torch_memory_before_empty_cache: 18.770355224609375 GiB non_torch_memory_increase: 0.02725982666015625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: 17.05 GiB ``` - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2026-03-23 14:22:59 +08:00
Li Wang	75fae619d5	[Misc] Refactor aclgraph accuracy test to use logprob-based comparison (#7455 ) ### What this PR does / why we need it? Replace text-match assertions with a two-tier logprob accuracy check: - Prefill (token 0): assert token ID is identical between eager baseline and compiled mode, then verify logprob matches within `atol`. - Decode (tokens 1-2): if chosen tokens match, compare logprobs directly; if they differ, cross-lookup the baseline token in the compiled model's top-20 distribution and assert the assigned logprob is within `decode_atol` (defaults to 2x atol). This tolerates minor argmax drift caused by floating-point differences while still catching distribution divergence. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-23 09:08:21 +08:00
meihanc	bff4fbfca5	upgrade to 0.18.0 (#7502 ) ### What this PR does / why we need it? 1. upgrade to 0.18.0 2. ensure kernel_block_sizes is int for Eagle drafter ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-03-21 16:05:38 +08:00
linfeng-yuan	88d03a783f	[refactor] replace scattered business kwargs with typed request objects and explicit stage boundaries (#7024 ) ### What this PR does / why we need it? Refactor `vllm_ascend/ops/fused_moe` to replace scattered MoE business `**kwargs` with typed request objects and explicit stage boundaries. - Prepare, dispatch, MLP, and quant stages now have clearer ownership. - Main MoE path no longer depends on business `kwargs.get(...)` lookups. - Comm and dispatcher interfaces are request-only on the main path. - UTs can assert stage-level fields directly instead of inferring behavior indirectly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-20 23:23:57 +08:00

1 2 3 4 5 ...

648 Commits