34 Commits

Author SHA1 Message Date
Nagisa125
2cb9195ff0 [Releases/v0.18.0][CI] Updated the parameters for the single-node test to fix the OOM issue for DeepSeek-V3.2 (#7862)
### What this PR does / why we need it?
Fix the OOM (Out-of-Memory) error in the single-node-deepseek-v3-2-w8a8
nightly test of vllm-ascend:

- Reduced the value of HCCL_BUFFSIZE

- Lowered the gpu-memory-utilization

Optimize service-side performance:
Updated service-oriented configuration parameters (e.g., max-num-seqs,
cudagraph_capture_sizes, batch_size) to improve the inference
performance,so that the performance is closer to the optimal performance
of the current mainline.
Align performance baseline with main branch:
Updated the performance baseline according to the latest performance
data

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
The test has passed.

https://github.com/vllm-project/vllm-ascend/actions/runs/23734079080/job/69134387320?pr=7793

---------

Signed-off-by: wyh145 <1987244901@qq.com>
2026-04-01 10:28:46 +08:00
SILONG ZENG
1e3c1e76bf [Lint]Add lint hooks for clang-format, shellcheck, forbidden imports, and boolean context manager checks (#7511)
### What this PR does / why we need it?
This PR introduces several upstream `vllm`-aligned lint hooks into
`vllm-ascend` and makes them part of the actual `pre-commit` flow.

Main changes in this PR:
- add `check-boolean-context-manager` to catch boolean expressions in
`with` statements
- add `check-forbidden-imports` to forbid direct `re` imports and
disallowed direct `triton` imports
- enable shell script linting through `tools/shellcheck.sh`
- add root `.clang-format` aligned with upstream `vllm`, enable
`clang-format` in `pre-commit`, temporarily **exclude all `csrc/**`**
from `clang-format` to avoid bringing a large native code reformat into
this PR

This PR focuses on landing the smaller and immediately useful lint
alignment first, without mixing in the larger requirements-management
migration.

### Does this PR introduce _any_ user-facing change?
No.

This PR only updates repository lint configuration, static checks, and
internal import/style enforcement. It does not change runtime behavior
or public interfaces.

### How was this patch tested?
Tested locally in the project virtual environment.

Commands used:
```bash
bash format.sh
```
Verified checks passed:
``` bash
ruff check...............................................................Passed
ruff format..............................................................Passed
codespell................................................................Passed
typos....................................................................Passed
clang-format.............................................................Passed
Lint GitHub Actions workflow files.......................................Passed
Lint shell scripts.......................................................Passed
Lint PNG exports from excalidraw.........................................Passed
Check for spaces in all filenames........................................Passed
Enforce __init__.py in Python packages...................................Passed
Check for forbidden imports..............................................Passed
Check for boolean ops in with-statements.................................Passed
Suggestion...............................................................Passed
- hook id: suggestion
- duration: 0s

To bypass pre-commit hooks, add --no-verify to git commit.
```
**note:**
clang-format is enabled but currently excludes all csrc/**


- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-24 20:03:01 +08:00
LeeWenquan
9615bc33fd Fix Qwen3Next CI Config (#7561)
### What this PR does / why we need it?
This pr modifies qwen3Next nightly CI config. 
(1) Add a nightly CI .
(2) Set a more precise accuracy standard

- vLLM version: v0.18.0
- vLLM main:
6a9cceb219

Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
2026-03-24 17:08:17 +08:00
liuhy1213-cell
fb283b5820 [CI] Add nightly CI test cases for the GLM-5 (#7429)
### What this PR does / why we need it?
Add nightly CI test cases for the GLM-5
Add model download for the GLM-5

https://github.com/vllm-project/vllm-ascend/actions/runs/23286178651/job/67710409642#logs
- vLLM version: v0.17.0
- vLLM main:
b31e9326a7
---------
Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com>
Signed-off-by: liuhy1213-cell <liuhy1213@gmail.com>
Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>
2026-03-23 19:14:19 +08:00
aipaes
87d6424b2e [CI] Add nightly CI test cases for the GLM-4.7 model. (#7391)
### What this PR does / why we need it?
Add acc nightly CI test cases for the GLM-4.7 model.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
through CI

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
2026-03-19 16:43:29 +08:00
LoganJane
270c5cb8cd [CI] Add nightly CI test cases for the Kimi-K2.5 (#7416)
### What this PR does / why we need it?
Add nightly CI test cases for the Kimi-K2.5.

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: LoganJane <loganJane73@hotmail.com>
Signed-off-by: LoganJane <42287016+LoganJane@users.noreply.github.com>
2026-03-19 11:02:29 +08:00
SparrowMu
fb8e22ec00 [DOC] MiniMax-M2.5 model intro (#7296)
### What this PR does / why we need it?
1. Add nightly test on MiniMax-M2.5 with deployment method on A3
2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
2026-03-18 20:14:36 +08:00
liuhy1213-cell
58725b8b24 [doc] add Prefill-Decode Disaggregation doc for GLM5.md (#7300)
### What this PR does / why we need it?
add Prefill-Decode Disaggregation doc for GLM5.md
w8a8  65k-1.5k 
Concurrency: 80 
prefixcache: 90%
tps: 2054

- vLLM version: v0.17.0

- vLLM main:
4034c3d32e
---------
Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com>
Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>
2026-03-18 17:00:31 +08:00
LeeWenquan
65eae6de7b Add Ascend Ops recurrent_gated_delta_rule (#6725)
### What this PR does / why we need it?
Change recurrent_gated_delta_rule ops from triton to ascend C version
for better performance.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2026-03-09 14:14:14 +08:00
SILONG ZENG
859f2c25b9 [Nightly][Refactor]Migrate nightly single-node model tests from .py to .yaml (#6503)
### What this PR does / why we need it?
This PR refactors the nightly single-node model test by migrating test
configurations from Python scripts to a more maintainable `YAML-based`
format.

| Original PR | Python (`.py`) | YAML (`.yaml`) |
| :--- | :--- | :--- |
| [#3568](https://github.com/vllm-project/vllm-ascend/pull/3568) |
`test_deepseek_r1_0528_w8a8_eplb.py` | `DeepSeek-R1-0528-W8A8.yaml` |
| [#3631](https://github.com/vllm-project/vllm-ascend/pull/3631) |
`test_deepseek_r1_0528_w8a8.py` | `DeepSeek-R1-0528-W8A8.yaml` |
| [#5874](https://github.com/vllm-project/vllm-ascend/pull/5874) |
`test_deepseek_r1_w8a8_hbm.py` | `DeepSeek-R1-W8A8-HBM.yaml` |
| [#3908](https://github.com/vllm-project/vllm-ascend/pull/3908) |
`test_deepseek_v3_2_w8a8.py` | `DeepSeek-V3.2-W8A8.yaml` |
| [#5682](https://github.com/vllm-project/vllm-ascend/pull/5682) |
`test_kimi_k2_thinking.py` | `Kimi-K2-Thinking.yaml` |
| [#4111](https://github.com/vllm-project/vllm-ascend/pull/4111) |
`test_mtpx_deepseek_r1_0528_w8a8.py` | `MTPX-DeepSeek-R1-0528-W8A8.yaml`
|
| [#3733](https://github.com/vllm-project/vllm-ascend/pull/3733) |
`test_prefix_cache_deepseek_r1_0528_w8a8.py` |
`Prefix-Cache-DeepSeek-R1-0528-W8A8.yaml` |
| [#6543](https://github.com/vllm-project/vllm-ascend/pull/6543) |
`test_qwen3_235b_w8a8.py` | `Qwen3-235B-A22B-W8A8.yaml` |
| [#6543](https://github.com/vllm-project/vllm-ascend/pull/6543) |
`test_qwen3_235b_a22b_w8a8_eplb.py` | `Qwen3-235B-A22B-W8A8.yaml` |
| [#3973](https://github.com/vllm-project/vllm-ascend/pull/3973) |
`test_qwen3_30b_w8a8.py` | `Qwen3-30B-A3B-W8A8.yaml` |
| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) |
`test_qwen3_32b_int8.py` | `Qwen3-32B-Int8.yaml` |
| [#3757](https://github.com/vllm-project/vllm-ascend/pull/3757) |
`test_qwq_32b.py` | `QwQ-32B.yaml` |
| [#5616](https://github.com/vllm-project/vllm-ascend/pull/5616) |
`test_qwen3_next_w8a8.py` | `Qwen3-Next-80B-A3B-Instruct-W8A8.yaml` |
| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) |
`test_qwen2_5_vl_7b.py` | `Qwen2.5-VL-7B-Instruct.yaml` |
| [#5301](https://github.com/vllm-project/vllm-ascend/pull/5301) |
`test_qwen2_5_vl_7b_epd.py` | `Qwen2.5-VL-7B-Instruct-EPD.yaml` |
| [#3707](https://github.com/vllm-project/vllm-ascend/pull/3707) |
`test_qwen2_5_vl_32b.py` | `Qwen2.5-VL-32B-Instruct.yaml` |
| [#3676](https://github.com/vllm-project/vllm-ascend/pull/3676) |
`test_qwen3_32b_int8_a3_feature_stack3.py` |
`Qwen3-32B-Int8-A3-Feature-Stack3.yaml` |
| [#3709](https://github.com/vllm-project/vllm-ascend/pull/3709) |
`test_prefix_cache_qwen3_32b_int8.py` |
`Prefix-Cache-Qwen3-32B-Int8.yaml` |
| [#5395](https://github.com/vllm-project/vllm-ascend/pull/5395) |
`test_qwen3_next.py` | `Qwen3-Next-80B-A3B-Instruct-A2.yaml` |
| [#3474](https://github.com/vllm-project/vllm-ascend/pull/3474) |
`test_qwen3_32b.py` | `Qwen3-32B.yaml` |
| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) |
`test_qwen3_32b_int8.py` | `Qwen3-32B-Int8-A2.yaml` |
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-03 20:13:43 +08:00
starmountain1997
248d07566f [CI] nightly test timeout (#6912)
### What this PR does / why we need it?

The nightly test is currently failing due to a
[timeout](https://github.com/vllm-project/vllm-ascend/actions/runs/22547280169/job/65326335134).

As noted in #6778, this issue can be resolved by applying this fix.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

run nightly test.

Co-authored-by: guozr <guozr1997@hotmail.com>
2026-03-03 09:31:46 +08:00
starmountain1997
bc1622338c [CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536)
### What this PR does / why we need it?

This version has no divisibility constraint between tp and mtp+1.
However, cudagraph_capture_sizes must be a common multiple of tp and
mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed
cudagraph_capture_sizes.

We added a long-sequence test (64k input, 3k output) for the two-node
mixed deployment scenario. Due to the excessive time required for
performance benchmarking, we are only verifying functionality. The
single-node scenario is skipped because VRAM limitations prevent
launching the model with a max-model-len of 68,000.

and we also add aime2025 test for dual-node deepseek 3.2 nightly test.

### How was this patch tested?

test at nightly environment.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-26 10:58:50 +08:00
JIACHENG XU
64aea60f2e [EPLB][Nightly] Refactor UT (#6543)
### What this PR does / why we need it?
The basic configs are extracted and reused for eplb UT. This is done so
that if the basic configs are changed later, eplb UT does not need to be
modified repeatedly.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: bigsir007 <xujiacheng12@huawei.com>
Co-authored-by: bigsir007 <xujiacheng12@huawei.com>
2026-02-14 10:56:29 +08:00
wangyu
c63b7a1188 [Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301)
### What this PR does / why we need it?
This PR adds disaggregated encoder  tests for Qwen2.5-VL-7B-Instruct 
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
by running ci

- vLLM version: release/v0.12.0

---------

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
2026-02-06 17:30:17 +08:00
zhangxinyuehfad
81f3c09d6d [CI] Change A2 runner (#6557)
### What this PR does / why we need it?

This PR updates the CI runner from `linux-aarch64-a2-*` to
`linux-aarch64-a2b3-*` in various test configuration files. This change
is necessary to adapt to updates in the CI infrastructure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The changes are configuration updates for CI tests. The correctness will
be verified by the CI pipeline.

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-02-05 23:43:57 +08:00
Nengjun Ma
78fad4e348 [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442)
### What this PR does / why we need it?
Refactor MLP weight prefetch to consistency with MoE Model's prefetching
in terms of code and usage.
Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP,
VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and
VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following:

--additional-config '{"weight_prefetch_config": { "enabled": true,
"prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}'

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-04 09:08:18 +08:00
starmountain1997
b6256e8bc9 Revert "[CI] fix DS3.2 single node cudagraph_sizes config (#6241)" (#6497)
# What this PR does / why we need it?
This PR reverts commit 8134146ab6, which
modified the DeepSeek V3.2 (W8A8) single-node nightly test
configuration. as there is no limit between tp_size and MTP.
# Does this PR introduce any user-facing change?
No. This PR only affects CI/CD test configurations and does not
introduce any user-facing changes.
# How was this patch tested?
N/A for a revert PR. The changes restore the previously known working
configuration.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-03 08:42:58 +08:00
starmountain1997
8134146ab6 [CI] fix DS3.2 single node cudagraph_sizes config (#6241)
# What this PR does / why we need it?
This PR fixes the single-node nightly test for DeepSeek V3.2 (W8A8)
model to ensure CI stability. The changes include:
1. Simplified nightly test matrix (nightly_test_a3.yaml):
- Temporarily reduced to only run deepseek3_2-w8a8 test case for
debugging
- Changed trigger from schedule/workflow_dispatch to support
push/pull_request for faster iteration
2. Updated DeepSeek V3.2 test configuration
(test_deepseek_v3_2_w8a8.py):
- Adjusted cudagraph_capture_sizes from [3, 6, 9, 12] to [8, 16, 24, 32]
for better performance
- Increased max-num-seqs from 4 to 8
- Increased gpu-memory-utilization from 0.92 to 0.98
- Increased num_speculative_tokens from 2 to 3
3. Added PR checkout step (_e2e_nightly_single_node.yaml):
- Added ability to checkout a specific PR (#6241) for testing
# Does this PR introduce any user-facing change?
No. This PR only affects CI/CD test configurations and does not
introduce any user-facing changes.
# How was this patch tested?
Mock nightly test has passed, see
[here](https://github.com/vllm-project/vllm-ascend/actions/runs/21574655952/job/62159656622?pr=6241).

<img width="1053" height="714" alt="a2f2ee359febb13e1f6330b1bd3c116b"
src="https://github.com/user-attachments/assets/3262ad0f-adec-4c71-871f-d9cf2db06fbc"
/>


- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-02 11:47:32 +08:00
InSec
86b6ecac4c [CI][BugFix] Import error fix. (#6293)
### What this PR does / why we need it?
Fix the **import error** of qwen3-next nightly test.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: InSec <1790766300@qq.com>
2026-01-28 22:07:47 +08:00
InSec
595b57c4d4 [CI][BugFix] Qwen3-Next nightly test fix. (#6247)
### What this PR does / why we need it?
Qwen3-Next nightly test fix. Temporarily avoid the accuracy issue in the
**full graph** mode.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: InSec <1790766300@qq.com>
2026-01-26 19:53:53 +08:00
starmountain1997
6c73b88dd6 [CI] Enable FLASHCOMM1 with layer_sharding and FULL_DECODE_ONLY in ds32 testing (#6115)
### What this PR does / why we need it?

This PR enables FLASHCOMM1 communication optimization with layer
sharding for DeepSeek-V3.2 W8A8 model testing to
  validate PR #5702. The changes include:

  1. Enable FLASHCOMM1: Set VLLM_ASCEND_ENABLE_FLASHCOMM1=1
  improves performance for distributed inference
2. Add layer sharding: Configure layer_sharding: ["q_b_proj", "o_proj"]
4. Update baselines: Adjust performance baselines to reflect the
improvements from FLASHCOMM1 and layer sharding

### Does this PR introduce _any_ user-facing change?

No. This is a CI/test-only change that enables new communication
optimization features for testing purposes.

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-01-23 19:48:37 +08:00
Nengjun Ma
ab676413e6 Default enable MLAPO (#5952)
### What this PR does / why we need it?
1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD
disagregation D Instance, for example: DeepSeekV3-W8A8,
DeepSeek-R1-W8A8.
2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models,
currently is DeepSeek-V3.2-W8A8.

### Does this PR introduce _any_ user-facing change?
Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO
feature for deepseek w8a8 model

The effect of enabling MLAPO SFA model deployed on a single A3 Node:
Test
with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py
dataset: gsm8k-lite,without set MTP, FULL GRAPH, has 19% promote:
未默认开启 MLAPO 时:
├─────────────────────────┤
│                TTFT                      │ 14055.8836 ms   │
├─────────────────────────┤
│                ITL                         │ 66.8171 ms.          │
├─────────────────────────┤
│ Output Token Throughput  │ 104.9105 token/s │
├─────────────────────────┤
默认开启 MLAPO 时:
├─────────────────────────┤
│                TTFT                      │ 3753.1547 ms   │
├─────────────────────────┤
│                ITL.                        │ 61.4236  ms.       │
├─────────────────────────┤
│ Output Token Throughput  │ 125.2075 token/s│
├─────────────────────────┤

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-01-22 09:26:39 +08:00
Li Wang
839e03cbc9 [Nightly] Use Qwen repo for qwen3-next (#6064)
### What this PR does / why we need it?
Use Qwen repo for qwen3-next to make nightly test happy. see
https://github.com/vllm-project/vllm-ascend/actions/runs/21179025996/job/60915871441
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-21 10:39:12 +08:00
zhangxinyuehfad
750c06c78a [CI] Add DeepSeek-V3.2-W8A8 nightly ci test (#4633)
### What this PR does / why we need it?
Add DeepSeek-V3.2-W8A8 nightly ci test:

DeepSeek-V3.2-W8A8 1node DP2+TP8
:tests/e2e/nightly/models/test_deepseek_v3_2_w8a8.py

### Does this PR introduce _any_ user-facing change

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-20 21:05:15 +08:00
Icey
402872050a [Tests] move qwen3 performance test from nightly to e2e (#5980)
### What this PR does / why we need it?
Move the qwen3 performance test from nightly to e2e to intercept
performance degradation.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-20 17:08:43 +08:00
zhangxinyuehfad
372f979aa5 [CI] Add DeepSeek R1 W8A8 HMB nightly ci (#5874)
### What this PR does / why we need it?

Add DeepSeek R1 W8A8 HMB nightly ci

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-15 20:48:20 +08:00
LI SHENGYONG
da958ee386 [EPLB]Eplb Config Renaming (#5533)
### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817

### Does this PR introduce _any_ user-facing change?

before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`

after this pr: 
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`

### How was this patch tested?

#### test qwen3-235b eplb num_redundant_experts=16

without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-15 10:26:44 +08:00
SILONG ZENG
7a6fde80b1 [CI]Add Kimi k2 nightly test (#5682)
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.

#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
  - Optimization: **torchair graph**, **no-prefix-caching**.
  - Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
  - Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
  - Performance: vllm-ascend/GSM8K-in3500-bs2800.
  - Accuracy: vllm-ascend/gsm8k-lite.

**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
  - Optimization: **no-prefix-caching**
- Benchmarks:
  - Performance: vllm-ascend/GSM8K-in3500-bs400.
  - Accuracy: vllm-ascend/gsm8k-lite.


### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).

**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
         self.batch_size = aisbench_config["batch_size"]
         self.request_rate = aisbench_config.get("request_rate", 0)
+        self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
         self.temperature = aisbench_config.get("temperature")
         self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
         content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
                          content)
+        content = re.sub(r'trust_remote_code=.*',
+                         f'trust_remote_code={self.trust_remote_code},',
+                         content)
         content = content.replace("top_k", "#top_k")
         content = content.replace("seed", "#seed")
```

**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.

**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
    "case_type": "performance",
    "request_conf": "vllm_api_stream_chat",
    "trust_remote_code": True,  # New user-facing parameter
    ...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433

Result as following:

- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset    version    metric    mode      vllm-api-general-chat
---------  ---------  --------  ------  -----------------------
gsm8k      7cd45e     accuracy  gen                       96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average        │ Min            │ Max           │ Median         │ P75            │ P90            │ P99            │  N  │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL                     │ total   │ 34571.489 ms   │ 28657.8054 ms  │ 36294.1788 ms │ 34714.7329 ms  │ 35247.2724 ms  │ 35526.6758 ms  │ 36146.4314 ms  │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT                     │ total   │ 2043.9136 ms   │ 627.4718 ms    │ 3532.3978 ms  │ 1906.0194 ms   │ 2307.7979 ms   │ 2883.8528 ms   │ 3283.7012 ms   │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT                     │ total   │ 127.5591 ms    │ 106.4937 ms    │ 137.107 ms    │ 128.3135 ms    │ 129.5704 ms    │ 131.1332 ms    │ 134.1087 ms    │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL                      │ total   │ 126.5571 ms    │ 0.0095 ms      │ 1340.783 ms   │ 104.1398 ms    │ 110.1272 ms    │ 119.6124 ms    │ 950.2924 ms    │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens              │ total   │ 3516.6055      │ 3014.0         │ 3985.0        │ 3525.0         │ 3525.0         │ 3586.8         │ 3800.67        │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens             │ total   │ 256.0          │ 256.0          │ 256.0         │ 256.0          │ 256.0          │ 256.0          │ 256.0          │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput    │ total   │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric            │ Stage   │ Value             │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration       │ total   │ 279430.9375 ms    │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests           │ total   │ 512               │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests          │ total   │ 0                 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests         │ total   │ 512               │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency              │ total   │ 63.3452           │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency          │ total   │ 64                │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput       │ total   │ 1.8323 req/s      │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens       │ total   │ 1800502           │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total   │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens   │ total   │ 131072            │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput   │ total   │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput  │ total   │ 469.0676 token/s  │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput   │ total   │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```

- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset    version    metric    mode      vllm-api-general-chat
---------  ---------  --------  ------  -----------------------
gsm8k      7cd45e     accuracy  gen                      100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average        │ Min            │ Max            │ Median         │ P75            │ P90            │ P99            │  N  │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL                     │ total   │ 172384.3573 ms │ 34456.5517 ms  │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms  │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT                     │ total   │ 138740.3228 ms │ 655.1066 ms    │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT                     │ total   │ 131.9374 ms    │ 90.6331 ms     │ 135.4144 ms    │ 132.405 ms     │ 132.948 ms     │ 133.7549 ms    │ 135.2543 ms    │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL                      │ total   │ 130.9028 ms    │ 0.0099 ms      │ 960.3683 ms    │ 116.9623 ms    │ 122.3127 ms    │ 132.0522 ms    │ 886.4662 ms    │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens              │ total   │ 3514.575       │ 3014.0         │ 3843.0         │ 3525.0         │ 3525.0         │ 3588.0         │ 3801.08        │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens             │ total   │ 256.0          │ 256.0          │ 256.0          │ 256.0          │ 256.0          │ 256.0          │ 256.0          │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput    │ total   │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s  │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric            │ Stage   │ Value             │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration       │ total   │ 1166795.568 ms    │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests           │ total   │ 400               │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests          │ total   │ 0                 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests         │ total   │ 400               │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency              │ total   │ 59.0967           │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency          │ total   │ 64                │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput       │ total   │ 0.3428 req/s      │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens       │ total   │ 1405830           │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total   │ 25.332 token/s    │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens   │ total   │ 102400            │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput   │ total   │ 1204.864 token/s  │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput  │ total   │ 87.7617 token/s   │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput   │ total   │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2026-01-12 15:56:07 +08:00
1092626063
f63c1341d9 [Feature] GLM4.6 support mtp with fullgraph (#5460)
### What this PR does / why we need it?
GLM4.6 support mtp with fullgraph to improve performance

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
`
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE=AIV

vllm serve /weight/glm4.6_w8a8_with_float_mtp \
  --data-parallel-size 1 \
  --tensor-parallel-size 16 \
  --seed 1024 \
  --served-model-name glm \
  --max-model-len 35000 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 16 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
--speculative-config '{"num_speculative_tokens": 1,
"model":"/weight/glm4.6_w8a8_with_float_mtp", "method":"mtp"}' \
--compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32],
"cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --async-scheduling \
`

test case:
`
vllm bench serve \
  --backend vllm \
  --dataset-name prefix_repetition \
  --prefix-repetition-prefix-len 22400 \
  --prefix-repetition-suffix-len 9600 \
  --prefix-repetition-output-len 1024 \
  --num-prompts 1 \
  --prefix-repetition-num-prefixes 1 \
  --ignore-eos \
  --model glm \
  --tokenizer /weight/glm4.6_w8a8_with_float_mtp \
  --seed 1000 \
  --host 0.0.0.0 \
  --port 8000 \
  --endpoint /v1/completions \
  --max-concurrency 1 \
  --request-rate 1

`
- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: 1092626063 <1092626063@qq.com>
2026-01-09 16:07:42 +08:00
InSec
2d713fee93 [CI] Accuracy issue of qwen3-next-w8a8 nightly test fix. (#5746)
### What this PR does / why we need it?
Close the **Full Graph** mode to temporarily avoid accuracy issue for
**Qwen3-Next-80B-A3B-Instruct-W8A8**.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: InSec <1790766300@qq.com>
2026-01-09 15:55:13 +08:00
LeeWenquan
a3a74d6984 [CI] Add qwen3 next ci (#5395)
### What this PR does / why we need it?
Add Qwen3Next CI 

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2026-01-09 10:29:09 +08:00
Icey
137f28341d [Tests] Add qwen3-8b nightly test (#5597)
### What this PR does / why we need it?
Add qwen3-8b nightly test 

- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-07 18:42:05 +08:00
InSec
089ca2ddcc [Nightly][Test] Add Qwen3-Next-80B-A3B-Instruct-W8A8 nightly test (#5616)
### What this PR does / why we need it?
There was an accuracy issue with **Qwen3-Next-80B-A3B-Instruct-W8A8**
model in the old version of **Triton-Ascend**, so, we are now adding one
nightly test to maintain it.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: IncSec <1790766300@qq.com>
2026-01-06 17:36:00 +08:00
Li Wang
e760aae1df [1/N] Refactor nightly test structure (#5479)
### What this PR does / why we need it?
This patch is a series of refactoring actions, including clarifying the
directory structure of nightly tests, refactoring the config retrieval
logic, and optimizing the workflow, etc. This is the first step:
refactoring the directory structure of nightly to make it more readable
and logical.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-30 19:03:02 +08:00