xc-llm-ascend

Author	SHA1	Message	Date
Nagisa125	600bf80c6d	[CI]Fix the error caused by layer_sharding in dsv32 (#8719 ) ### What this PR does / why we need it? This PR fixes the error in DSV32 mixed deployment caused by enabling layer_sharding. - Currently, mixed deployment no longer supports the enabling of layer_sharding. Therefore, it has been removed from the service-oriented configuration. - The error "RPC call to sample_tokens timed out" occurred because the dshm size limit was set too small. Therefore, it was increased to 512 Gi. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? The nightly test has passed. Signed-off-by: wyh145 <1987244901@qq.com>	2026-04-30 10:35:48 +08:00
hucong	8a671a109c	[CI][Cherry-pick] Relax TTFT benefits threshold from 0.4 to 0.5 to account for DP load imbalance (#8684 ) Cherry-pick https://github.com/vllm-project/vllm-ascend/pull/8683 ### What this PR does / why we need it? This PR relaxes the TTFT threshold from `0.4` to `0.5` to improve robustness under Data Parallel (DP) load imbalance. #### Background The current assertion enforces: prefix75 < prefix0 * 0.4 #### ❌ Nightly Failure Cases (Observed) \| prefix0 \| threshold (0.4x) \| prefix75 \| delta \| \|--------\|------------------\|----------\|--------\| \| 4696.24 \| 1878.50 \| 1883.99 \| +5.49 \| \| 4696.20 \| 1878.48 \| 1896.01 \| +17.53 \| \| 4636.73 \| 1854.69 \| 1902.48 \| +47.79 \| \| 4655.17 \| 1862.07 \| 1913.54 \| +51.47 \| \| 4685.35 \| 1874.14 \| 1919.36 \| +45.22 \| \| 4660.33 \| 1864.13 \| 1915.41 \| +51.28 \| \| 4648.30 \| 1859.32 \| 1950.50 \| +91.18 \| \| 4655.30 \| 1862.12 \| 1962.32 \| +100.20 \| --- #### ✅ Nightly Passing Cases (Observed) \| prefix0 \| threshold (0.4x) \| prefix75 \| margin \| \|--------\|------------------\|----------\|---------\| \| 4685.64 \| 1874.26 \| 1864.46 \| -9.80 \| \| 5520.28 \| 2208.11 \| 1928.97 \| -279.14 \| \| 4639.23 \| 1855.69 \| 1846.86 \| -8.83 \| \| 4651.64 \| 1860.66 \| 1854.30 \| -6.36 \| \| 4640.39 \| 1856.15 \| 1840.32 \| -15.83 \| \| 4677.20 \| 1870.88 \| 1848.35 \| -22.53 \| --- #### Key Observations - Failures exceed the threshold by only ~5 ms to ~100 ms (~0.3%–5%) - Passing cases often have very tight margins (~5–10 ms) - There is clear overlap between pass and fail boundaries - Many failures are borderline violations, not real regressions --- #### Root Cause The instability is caused by Data Parallel (DP) load imbalance, which introduces systematic variance: - Uneven request distribution across workers - Queueing delays - Increased TTFT variance (especially for `prefix75`) --- #### Conclusion - The current threshold (`0.4x`) is too strict - Observed natural fluctuation: - Absolute: up to ~100 ms - Relative: up to ~5% over threshold - Pass/fail boundary is currently too sensitive to runtime jitter --- #### Change We relax the threshold: 0.4 → 0.5 This adjustment: - Accounts for expected runtime variance - Reduces false negatives - Maintains a meaningful performance constraint Even with `0.5`, the requirement remains strict (`prefix75 < 50% of prefix0`) and does not mask real regressions. --- ### Does this PR introduce _any_ user-facing change? No. This change only affects internal test assertions and does not impact user-facing behavior or model performance. --- ### How was this patch tested? - Verified against existing TTFT test cases: - Previously failing cases (due to small variance) now pass - No regressions observed in other scenarios - Confirmed that failures were due to DP load imbalance rather than actual performance degradation - Ensured the updated threshold still enforces a meaningful constraint on TTFT Signed-off-by: underfituu <hzhucong@163.com>	2026-04-27 16:38:07 +08:00
yangjiuhua	b717dc17a3	[v0.18.0][Test][Misc] Update CI for GLM-5 configuration on vllm-ascend/releases/v0.18.0 branch (#8322 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Update CI for GLM-5 configuration on vllm-ascend/releases/v0.18.0 branch 在0.18.0版本上对glm5-w4a8做测试 ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: yangjiuhua <y00845194@china.huawei.com> Co-authored-by: yangjiuhua <y00845194@china.huawei.com>	2026-04-21 14:10:11 +08:00
Li Wang	36a0470de1	[Doc] Upgrade env `VLLM_ASCEND_ENABLE_FUSED_MC2` used in nightly test and tutorials (#8441 ) ### What this PR does / why we need it? The env `VLLM_ASCEND_ENABLE_FUSED_MC2` should only enabled in the decoder node during Prefill-Decode Disaggregation scenario --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-04-20 22:39:23 +08:00
zhangxinyuehfad	808d00406f	[v0.18.0][CI]Add rank0 process count check for DeepSeek-R1-W8A8-HBM test (#8072 ) ### What this PR does / why we need it? Adds a `check_rank0_process_count` validation step to the DeepSeek-R1-W8A8-HBM nightly single-node test. The check verifies that after the server starts, there is exactly 1 `vllm serve` process running on rank0. This guards against the regression fixed in #8041 (extra NPU context leaking on device 0), ensuring it does not silently reappear in future releases. #### Changes - `tests/e2e/nightly/single_node/models/scripts/test_single_node.py`: Add `run_check_rank0_process_count` async handler. It calls `npu-smi info` for diagnostics, then uses `psutil` to assert exactly one `vllm serve` process exists on rank0. - `tests/e2e/nightly/single_node/models/configs/DeepSeek-R1-W8A8-HBM.yaml`: Register `check_rank0_process_count` in the `test_content` list for the HBM test case. Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-04-15 17:16:27 +08:00
ZYang6263	34386c8896	[v0.18.0][CI] Fix and simplify the CI for Qwen3 32B (#8093 ) ### What this PR does / why we need it? This PR fixes and simplifies the CI configuration for Qwen3 32B. The main changes are: - Remove the redundant `Qwen3-32B-Int8-A3-Feature-Stack3.yaml` config and consolidate the CI setup into `Qwen3-32B-Int8.yaml`. - Improve runtime stability by adding `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` and setting `--max-num-seqs 80`. - Update the accuracy benchmark from `aime2024` to `gsm8k-lite`, and adjust the related dataset config, output length, baseline, and threshold accordingly. These changes make the Qwen3 32B CI easier to maintain and more stable in nightly validation. --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>	2026-04-10 14:22:24 +08:00
hucong	4a628f1042	[UT][v0.18.0] Fix APC nightly UT and TTFT ratio (cherry-pick #7468 ) (#8053 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7468 - Fix TTFT ratio threshold from 0.8 to 0.4 for prefix cache benchmarks - Fix max_out_len values for warm_up and benchmark configs - Applied to both DeepSeek-R1-0528-W8A8 and Qwen3-32B-Int8 configs ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: underfituu <hzhucong@163.com>	2026-04-08 21:08:26 +08:00
guxin108	81c6f51a45	【CI】add nightly cases: MiniMax-M2.5-W8A8 Qwen3.5-27B-w8a8 Qwen3.5-397B-A1… (#7968 ) ### What this PR does / why we need it? This PR Qwen3.5-27B ;MiniMax-M2.5-w8a8 ;Qwen3.5-397B-w8a8-mtp acc/perf 3 cases on A3, we need test them daily. - vLLM version: v0.18.0 - vLLM main: `35141a7eed` Signed-off-by: guxin108 <1252896542@qq.com>	2026-04-03 17:50:59 +08:00
jiangmengyu18	3f462d251e	[v0.18.0][CI] fix acc baseline of qwen3vl 235b (#7981 ) ### What this PR does / why we need it? fix acc baseline of qwen3vl 235b --------- Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>	2026-04-03 17:38:17 +08:00
LeeWenquan	0d773efd70	[CI]Fix qwen3Next Nightly CI config (#7903 ) ### What this PR does / why we need it? Fix qwen3Next Nightly CI config in 0.18.0. backport: #7679 Signed-off-by: Your Name <you@example.com> Co-authored-by: Your Name <you@example.com>	2026-04-03 16:46:25 +08:00
jiangmengyu18	902d1312d9	[v0.18.0][CI] add nightly ci test for qwen3vl (#7913 ) ### What this PR does / why we need it? Add nightly ci test for qwen3vl ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: betta18 <jiangmengyu1@huawei.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com>	2026-04-03 11:39:28 +08:00
Nagisa125	2cb9195ff0	[Releases/v0.18.0][CI] Updated the parameters for the single-node test to fix the OOM issue for DeepSeek-V3.2 (#7862 ) ### What this PR does / why we need it? Fix the OOM (Out-of-Memory) error in the single-node-deepseek-v3-2-w8a8 nightly test of vllm-ascend: - Reduced the value of HCCL_BUFFSIZE - Lowered the gpu-memory-utilization Optimize service-side performance: Updated service-oriented configuration parameters (e.g., max-num-seqs, cudagraph_capture_sizes, batch_size) to improve the inference performance,so that the performance is closer to the optimal performance of the current mainline. Align performance baseline with main branch: Updated the performance baseline according to the latest performance data ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The test has passed. https://github.com/vllm-project/vllm-ascend/actions/runs/23734079080/job/69134387320?pr=7793 --------- Signed-off-by: wyh145 <1987244901@qq.com>	2026-04-01 10:28:46 +08:00
LeeWenquan	9615bc33fd	Fix Qwen3Next CI Config (#7561 ) ### What this PR does / why we need it? This pr modifies qwen3Next nightly CI config. (1) Add a nightly CI . (2) Set a more precise accuracy standard - vLLM version: v0.18.0 - vLLM main: `6a9cceb219` Signed-off-by: Your Name <you@example.com> Co-authored-by: Your Name <you@example.com>	2026-03-24 17:08:17 +08:00
liuhy1213-cell	fb283b5820	[CI] Add nightly CI test cases for the GLM-5 (#7429 ) ### What this PR does / why we need it? Add nightly CI test cases for the GLM-5 Add model download for the GLM-5 https://github.com/vllm-project/vllm-ascend/actions/runs/23286178651/job/67710409642#logs - vLLM version: v0.17.0 - vLLM main: `b31e9326a7` --------- Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com> Signed-off-by: liuhy1213-cell <liuhy1213@gmail.com> Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>	2026-03-23 19:14:19 +08:00
aipaes	87d6424b2e	[CI] Add nightly CI test cases for the GLM-4.7 model. (#7391 ) ### What this PR does / why we need it? Add acc nightly CI test cases for the GLM-4.7 model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? through CI - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-19 16:43:29 +08:00
LoganJane	270c5cb8cd	[CI] Add nightly CI test cases for the Kimi-K2.5 (#7416 ) ### What this PR does / why we need it? Add nightly CI test cases for the Kimi-K2.5. - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: LoganJane <loganJane73@hotmail.com> Signed-off-by: LoganJane <42287016+LoganJane@users.noreply.github.com>	2026-03-19 11:02:29 +08:00
SparrowMu	fb8e22ec00	[DOC] MiniMax-M2.5 model intro (#7296 ) ### What this PR does / why we need it? 1. Add nightly test on MiniMax-M2.5 with deployment method on A3 2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-18 20:14:36 +08:00
liuhy1213-cell	58725b8b24	[doc] add Prefill-Decode Disaggregation doc for GLM5.md (#7300 ) ### What this PR does / why we need it? add Prefill-Decode Disaggregation doc for GLM5.md w8a8 65k-1.5k Concurrency: 80 prefixcache: 90% tps: 2054 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com> Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>	2026-03-18 17:00:31 +08:00
LeeWenquan	65eae6de7b	Add Ascend Ops recurrent_gated_delta_rule (#6725 ) ### What this PR does / why we need it? Change recurrent_gated_delta_rule ops from triton to ascend C version for better performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-03-09 14:14:14 +08:00
SILONG ZENG	859f2c25b9	[Nightly][Refactor]Migrate nightly single-node model tests from `.py` to `.yaml` (#6503 ) ### What this PR does / why we need it? This PR refactors the nightly single-node model test by migrating test configurations from Python scripts to a more maintainable `YAML-based` format. \| Original PR \| Python (`.py`) \| YAML (`.yaml`) \| \| :--- \| :--- \| :--- \| \| [#3568](https://github.com/vllm-project/vllm-ascend/pull/3568) \| `test_deepseek_r1_0528_w8a8_eplb.py` \| `DeepSeek-R1-0528-W8A8.yaml` \| \| [#3631](https://github.com/vllm-project/vllm-ascend/pull/3631) \| `test_deepseek_r1_0528_w8a8.py` \| `DeepSeek-R1-0528-W8A8.yaml` \| \| [#5874](https://github.com/vllm-project/vllm-ascend/pull/5874) \| `test_deepseek_r1_w8a8_hbm.py` \| `DeepSeek-R1-W8A8-HBM.yaml` \| \| [#3908](https://github.com/vllm-project/vllm-ascend/pull/3908) \| `test_deepseek_v3_2_w8a8.py` \| `DeepSeek-V3.2-W8A8.yaml` \| \| [#5682](https://github.com/vllm-project/vllm-ascend/pull/5682) \| `test_kimi_k2_thinking.py` \| `Kimi-K2-Thinking.yaml` \| \| [#4111](https://github.com/vllm-project/vllm-ascend/pull/4111) \| `test_mtpx_deepseek_r1_0528_w8a8.py` \| `MTPX-DeepSeek-R1-0528-W8A8.yaml` \| \| [#3733](https://github.com/vllm-project/vllm-ascend/pull/3733) \| `test_prefix_cache_deepseek_r1_0528_w8a8.py` \| `Prefix-Cache-DeepSeek-R1-0528-W8A8.yaml` \| \| [#6543](https://github.com/vllm-project/vllm-ascend/pull/6543) \| `test_qwen3_235b_w8a8.py` \| `Qwen3-235B-A22B-W8A8.yaml` \| \| [#6543](https://github.com/vllm-project/vllm-ascend/pull/6543) \| `test_qwen3_235b_a22b_w8a8_eplb.py` \| `Qwen3-235B-A22B-W8A8.yaml` \| \| [#3973](https://github.com/vllm-project/vllm-ascend/pull/3973) \| `test_qwen3_30b_w8a8.py` \| `Qwen3-30B-A3B-W8A8.yaml` \| \| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) \| `test_qwen3_32b_int8.py` \| `Qwen3-32B-Int8.yaml` \| \| [#3757](https://github.com/vllm-project/vllm-ascend/pull/3757) \| `test_qwq_32b.py` \| `QwQ-32B.yaml` \| \| [#5616](https://github.com/vllm-project/vllm-ascend/pull/5616) \| `test_qwen3_next_w8a8.py` \| `Qwen3-Next-80B-A3B-Instruct-W8A8.yaml` \| \| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) \| `test_qwen2_5_vl_7b.py` \| `Qwen2.5-VL-7B-Instruct.yaml` \| \| [#5301](https://github.com/vllm-project/vllm-ascend/pull/5301) \| `test_qwen2_5_vl_7b_epd.py` \| `Qwen2.5-VL-7B-Instruct-EPD.yaml` \| \| [#3707](https://github.com/vllm-project/vllm-ascend/pull/3707) \| `test_qwen2_5_vl_32b.py` \| `Qwen2.5-VL-32B-Instruct.yaml` \| \| [#3676](https://github.com/vllm-project/vllm-ascend/pull/3676) \| `test_qwen3_32b_int8_a3_feature_stack3.py` \| `Qwen3-32B-Int8-A3-Feature-Stack3.yaml` \| \| [#3709](https://github.com/vllm-project/vllm-ascend/pull/3709) \| `test_prefix_cache_qwen3_32b_int8.py` \| `Prefix-Cache-Qwen3-32B-Int8.yaml` \| \| [#5395](https://github.com/vllm-project/vllm-ascend/pull/5395) \| `test_qwen3_next.py` \| `Qwen3-Next-80B-A3B-Instruct-A2.yaml` \| \| [#3474](https://github.com/vllm-project/vllm-ascend/pull/3474) \| `test_qwen3_32b.py` \| `Qwen3-32B.yaml` \| \| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) \| `test_qwen3_32b_int8.py` \| `Qwen3-32B-Int8-A2.yaml` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-03 20:13:43 +08:00

20 Commits