### What this PR does / why we need it?
The env `VLLM_ASCEND_ENABLE_FUSED_MC2` should only enabled in the
decoder node during Prefill-Decode Disaggregation scenario
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Adds a `check_rank0_process_count` validation step to the
DeepSeek-R1-W8A8-HBM nightly single-node test.
The check verifies that after the server starts, there is **exactly 1**
`vllm serve` process running on rank0. This guards against the
regression fixed in #8041 (extra NPU context leaking on device 0),
ensuring it does not silently reappear in future releases.
#### Changes
-
**`tests/e2e/nightly/single_node/models/scripts/test_single_node.py`**:
Add `run_check_rank0_process_count` async handler. It calls `npu-smi
info` for diagnostics, then uses `psutil` to assert exactly one `vllm
serve` process exists on rank0.
-
**`tests/e2e/nightly/single_node/models/configs/DeepSeek-R1-W8A8-HBM.yaml`**:
Register `check_rank0_process_count` in the `test_content` list for the
HBM test case.
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR fixes and simplifies the CI configuration for Qwen3 32B.
The main changes are:
- Remove the redundant `Qwen3-32B-Int8-A3-Feature-Stack3.yaml` config
and consolidate the CI setup into `Qwen3-32B-Int8.yaml`.
- Improve runtime stability by adding
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` and setting
`--max-num-seqs 80`.
- Update the accuracy benchmark from `aime2024` to `gsm8k-lite`, and
adjust the related dataset config, output length, baseline, and
threshold accordingly.
These changes make the Qwen3 32B CI easier to maintain and more stable
in nightly validation.
---------
Signed-off-by: ZYang6263 <zy626375@gmail.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
Cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7468
- Fix TTFT ratio threshold from 0.8 to 0.4 for prefix cache benchmarks
- Fix max_out_len values for warm_up and benchmark configs
- Applied to both DeepSeek-R1-0528-W8A8 and Qwen3-32B-Int8 configs
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: underfituu <hzhucong@163.com>
### What this PR does / why we need it?
This PR Qwen3.5-27B ;MiniMax-M2.5-w8a8 ;Qwen3.5-397B-w8a8-mtp acc/perf 3
cases on A3, we need test them daily.
- vLLM version: v0.18.0
- vLLM main:
35141a7eed
Signed-off-by: guxin108 <1252896542@qq.com>
### What this PR does / why we need it?
Fix qwen3Next Nightly CI config in 0.18.0.
backport: #7679
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
### What this PR does / why we need it?
Add nightly ci test for qwen3vl
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
### What this PR does / why we need it?
Fix the OOM (Out-of-Memory) error in the single-node-deepseek-v3-2-w8a8
nightly test of vllm-ascend:
- Reduced the value of HCCL_BUFFSIZE
- Lowered the gpu-memory-utilization
Optimize service-side performance:
Updated service-oriented configuration parameters (e.g., max-num-seqs,
cudagraph_capture_sizes, batch_size) to improve the inference
performance,so that the performance is closer to the optimal performance
of the current mainline.
Align performance baseline with main branch:
Updated the performance baseline according to the latest performance
data
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The test has passed.
https://github.com/vllm-project/vllm-ascend/actions/runs/23734079080/job/69134387320?pr=7793
---------
Signed-off-by: wyh145 <1987244901@qq.com>
### What this PR does / why we need it?
This pr modifies qwen3Next nightly CI config.
(1) Add a nightly CI .
(2) Set a more precise accuracy standard
- vLLM version: v0.18.0
- vLLM main:
6a9cceb219
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
### What this PR does / why we need it?
Add acc nightly CI test cases for the GLM-4.7 model.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
through CI
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
### What this PR does / why we need it?
1. Add nightly test on MiniMax-M2.5 with deployment method on A3
2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
### What this PR does / why we need it?
Change recurrent_gated_delta_rule ops from triton to ascend C version
for better performance.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
9562912cea
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>