628 Commits

Author SHA1 Message Date
meihanc
fbb93ad8f2 [bugfix]update bishengir source envs (#5582)
### What this PR does / why we need it?
Due to the update of the Bisheng version's installation path, the
corresponding source path in the environment variables needs to be
updated.

- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-01-05 09:13:40 +08:00
dsxsteven
3c7e6c6817 [CI] Add multi-nodes longseq configs of DeepSeek-R1-W8A8 & Qwen3-235B-W8A8 (#5381)
### What this PR does / why we need it?
add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and
longseq (PCP&DCP) scenario

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: daishixun <dsxsteven@sina.com>
2026-01-04 10:38:40 +08:00
dependabot[bot]
799b41a9f4 Bump actions/download-artifact from 4 to 7 (#5465)
Bumps
[actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 7.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-04 08:54:06 +08:00
dependabot[bot]
ad40494b84 Bump actions/upload-artifact from 4 to 6 (#5466)
Bumps
[actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 6.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-04 08:53:52 +08:00
Li Wang
32a56496cc [Nightly] Trigger image build for nightly (#5547)
### What this PR does / why we need it?
We should also trigger image build when nightly test related files are
changed to ensure the image is valid for nightly tests. Please note that
this only applies to image with the tag `main*`(which means build
triggered by PR).
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-04 08:50:57 +08:00
wjunLu
3c2d3e52e5 [Main2Main] Upgrade vllm commit to 1230 (#5495)
### What this PR does / why we need it?

Upgrade vllm commit to 1230

Affected by https://github.com/vllm-project/vllm/pull/27614 (and the
core PR https://github.com/vllm-project/vllm/pull/26866), we have to
make the following changes:

1. Modify `tests/e2e/multicard/test_aclgraph_capture_replay.py` to keep
compatible with both vllm version of `v0.13.0` and latest main commitID,
while vllm enables async scheduling by default
2. Skip `test_guided_decoding.py` due to xgrammar errors
(https://github.com/vllm-project/vllm-ascend/issues/5524)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2025-12-31 09:44:35 +08:00
lilinsiman
46862ce1af [main][test] Refactor the mtp and eagle test case (#5326)
### What this PR does / why we need it?
1. Refactor the current test with mtp and eagle cases
2. Add new necessary cases with mtp and eagle

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-31 09:22:58 +08:00
Li Wang
073097a9a1 [3/N][Nightly] Move ops tests to nightly (#5538)
### What this PR does / why we need it?
Follow https://github.com/vllm-project/vllm-ascend/pull/5479, since the
cases `tests/e2e/nightly/ops/triton/` have to been verified stable
enough, we should move it to nightly in principle

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-30 20:50:44 +08:00
Li Wang
e760aae1df [1/N] Refactor nightly test structure (#5479)
### What this PR does / why we need it?
This patch is a series of refactoring actions, including clarifying the
directory structure of nightly tests, refactoring the config retrieval
logic, and optimizing the workflow, etc. This is the first step:
refactoring the directory structure of nightly to make it more readable
and logical.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-30 19:03:02 +08:00
meihanc
8c4e9bb76b [CI]update triton ascend version (#5392)
### What this PR does / why we need it?
update triton-ascend version to 1229 and bisheng version in 1225;

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2025-12-30 09:51:45 +08:00
Nengjun Ma
5e96f94d2a Update corresponding vllm commit ID to 12 29 (#5475)
### What this PR does / why we need it?
- Fixes vllm break:
1. [[BugFix] register quant scale tensors as buffer #31395]
(https://github.com/vllm-project/vllm/pull/31395)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
5326c89803

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-12-29 22:48:05 +08:00
Ronald
e7e1a7dc05 [Feature] support eager mode in model runner v2 (#5210)
### What this PR does / why we need it?
#5051 only implement a basic framework for model runner v2, but there
are still some bugs for e2e functionality, this PR aim to enable basic
functionality.
model runner v2 plans:
https://github.com/vllm-project/vllm-ascend/issues/5208

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-12-29 15:28:34 +08:00
ZT-AIA
24328aaf00 update vllm pin to 12.27 (#5412)
### What this PR does / why we need it?
update vllm pin to 12.27
1、Fix Qwen2-MoE shared_expert_gate
:https://github.com/vllm-project/vllm/pull/31339
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
vLLM version: release/v0.13.0
vLLM main:
5326c89803
Co-authored-by: leo-pony [nengjunma@outlook.com](nengjunma@outlook.com)

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
2025-12-28 00:19:36 +08:00
Li Wang
1d81bfaed1 Fix nightly (#5413)
### What this PR does / why we need it?
This pacth mainly do the following things:
1. Bugfix for multi_node_tests log, log names must be unique when
uploading logs.
2. Optimize `get_cluster_ips` logic, increase the max retry times for
robustness
3. Abandoned the existing gh-proxy temporarily until it is stable
enough.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-27 18:16:46 +08:00
whx
3f33ad23fe [BugFix] Fix npu-cpu offloading interface change bug. (#5290)
### What this PR does / why we need it?
Last month the interface of `OffloadingSpec` has
changed(https://github.com/vllm-project/vllm/pull/27743). This PR fixes
this bug and adds e2e test for cpu offloading.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI passed with new added test.


- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-12-27 10:21:20 +08:00
Nengjun Ma
f5af6bbd1e [CI] Add qwen-235b-a22b a2 multi-node test (#5393)
### What this PR does / why we need it?
Qwen3-235B-A22B belongs to the TopN model, but there is currently a lack
of care for the test cases of the wen3-235B-A22B model on Atlas A2, and
most of the machines currently owned by users in the community are A2.
When users encounter problems, we currently have no way of knowing
whether the model runs normally on the corresponding version of the
code, so we added it. In addition, we currently see TopN models such as:
qwen-dense, qwen3-30b-a3b, Qwen3-Next, Qwen2.5-Omni, but Qwen3-235B-A22B
is missing.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
Test with multi-node, result as following:
1. Accuracy test (Time for executing this test case: 25 minutes)
test running successfully, accuracy as following:
```
dataset    version    metric    mode      vllm-api-general-chat
---------  ---------  --------  ------  -----------------------
gsm8k      7cd45e     accuracy  gen                       95.68
```
2. Perf test  (Time for executing this test case: 1h15 minutes)
test running successfully, throughput as following(This is the atlas A3,
for A2 the result about A3/1.3):
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤══════╕
│ Performance Parameters   │ Stage   │ Average        │ Min            │ Max            │ Median         │ P75            │ P90            │ P99            │  N   │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪══════╡
│ E2EL                     │ total   │ 384086.3958 ms │ 214767.0486 ms │ 528014.771 ms  │ 387621.5746 ms │ 388776.7492 ms │ 390164.3559 ms │ 488105.8512 ms │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ TTFT                     │ total   │ 159409.9868 ms │ 1849.4588 ms   │ 302439.6965 ms │ 162183.7007 ms │ 162965.477 ms  │ 164274.1936 ms │ 262578.6041 ms │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ TPOT                     │ total   │ 149.8842 ms    │ 130.2175 ms    │ 151.2625 ms    │ 150.473 ms     │ 150.6978 ms    │ 150.9102 ms    │ 151.2131 ms    │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ ITL                      │ total   │ 149.6789 ms    │ 0.0099 ms      │ 283.0242 ms    │ 150.3276 ms    │ 156.8649 ms    │ 168.1372 ms    │ 199.378 ms     │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ InputTokens              │ total   │ 3654.3079      │ 3108.0         │ 4280.0         │ 3629.0         │ 3728.0         │ 3842.1         │ 4079.0         │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ OutputTokens             │ total   │ 1500.0         │ 1500.0         │ 1500.0         │ 1500.0         │ 1500.0         │ 1500.0         │ 1500.0         │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ OutputTokenThroughput    │ total   │ 3.935 token/s  │ 2.8408 token/s │ 6.9843 token/s │ 3.8698 token/s │ 3.8799 token/s │ 3.9916 token/s │ 6.2137 token/s │ 2800 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧══════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric            │ Stage   │ Value             │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration       │ total   │ 4391524.3389 ms   │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests           │ total   │ 2800              │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests          │ total   │ 0                 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests         │ total   │ 2800              │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency              │ total   │ 244.8903          │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency          │ total   │ 256               │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput       │ total   │ 0.6376 req/s      │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens       │ total   │ 10232062          │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total   │ 22.924 token/s    │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens   │ total   │ 4200000           │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput   │ total   │ 2329.9568 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput  │ total   │ 956.3877 token/s  │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput   │ total   │ 3286.3445 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-12-26 23:46:09 +08:00
ZT-AIA
1d8aa892bf Update vllm pin to 12.26 (#5378)
### What this PR does / why we need it?
Update vllm pin to 12.26
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-26 23:44:48 +08:00
Wang Kunpeng
bc5b7a5fb5 [bugfix] Fix MHA model runtime error in aclgraph mode (#5397)
### What this PR does / why we need it?
Currently, MHA models (eg: minicpm-2b, Baichuan-7b) will encounter
errors when running in piecewise graph mode, with error messages similar
to:
```
(E89999):  When layout is TND and PA not enabled, keyT(8) and valueT(8) must be equal to the last element of actualSeqenceLengthKV(5)[FUNC:CheckInputShapeWhenLayoutIsTND][FILE:prompt_flash_attention_tiling.cpp][LINE:3618]
```
The error occurs because the qkv in the Prefill stage is also padded,
causing the shape to be inconsistent with actual_seq_lengths.
Add unpadding logic for kv.

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-12-26 21:37:28 +08:00
ZT-AIA
adaa89a7a5 Update vllm pin to 12.25 (#5342)
### What this PR does / why we need it?
- Fix vllm break in the pr:
1.[Drop v0.14 deprecations
]https://github.com/vllm-project/vllm/pull/31285
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
2025-12-26 14:05:40 +08:00
Li Wang
c2f776b846 [Nightly] Initial logging for nightly multi-node testing (#5362)
### What this PR does / why we need it?
Currently, our multi-node logs only show the master node's logs (via the
Kubernetes API), which is insufficient for effective problem
localization if other nodes experience issues. Therefore, this pull
request adds the ability to upload logs for other nodes.

Next plan: Output structured directory logs, including logs from each
node and the polog.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-26 11:39:07 +08:00
Mengqing Cao
4ce32c1a8d [CI] Skip failed test cases to recover CI (#5368)
### What this PR does / why we need it?
Skip `test_minicpm_2b` to recover CI. Not sure why this ci failed, but
we'd skip it quickly to recover CI.

test_minicpm_2b related failed PRs:

https://github.com/vllm-project/vllm-ascend/actions/runs/20502414919/job/58911802576?pr=5274

https://github.com/vllm-project/vllm-ascend/actions/runs/20502596934/job/58912315736?pr=5322

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-26 08:18:23 +08:00
Wang Kunpeng
13cd6362c6 [bugfix] fix Error 'ValueError: Duplicate layer name' (#5280)
### What this PR does / why we need it?
When matmul_and_reduce is enabled, the prefix attribute is required.
However, in some models, the prefix is not passed correctly, causing
errors when starting the service.
The issue of incorrect prefix passing will be fixed in vLLM in the
future.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-12-25 10:43:24 +08:00
Ascendyh
a90482803d [Kernel] add l2norm triton kernel (#4595)
### What this PR does / why we need it?
This pull request introduces an L2 normalization kernel implemented in
Triton, specifically optimized for Ascend NPUs.
### Does this PR introduce _any_ user-facing change?
No, this PR does not introduce any user-facing changes.
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: Ascendyh <hw7osiris@outlook.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-25 06:06:18 +08:00
Mengqing Cao
e54630e01c Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) (#5317)
### What this PR does / why we need it?
Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as
it causes deepseek v3.2 hang error


- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-24 22:24:17 +08:00
wangxiyuan
fb3d6ca08c Cleanup uesless env (#5270)
`VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is not used anywhere, let's
remove it.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-24 22:07:59 +08:00
Li Wang
2f03a2f4a4 [CI] Skip some failed ops tests (#5309)
### What this PR does / why we need it?
Skip some failed ops tests

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-24 18:29:34 +08:00
Nengjun Ma
42c989a437 Update vllm pin to 12.24 (#5307)
### What this PR does / why we need it?
Fix vllm break in the pr:
1. [Add MiMo-V2-Flash support]
(https://github.com/vllm-project/vllm/pull/30836)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Co-authored-by: zxwang [1476209578@qq.com](mailto:1476209578@qq.com)

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
2025-12-24 17:24:31 +08:00
zhangyiming
bd4fb871c6 [CI] Add skipped testcases. (#5254)
### What this PR does / why we need it?
Some E2E testcases are not in our CI workflow, this PR add them back.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-24 10:41:32 +08:00
Nengjun Ma
3b59f20a28 update to vllm 12-19 (#5223)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Fix vllm break:
1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4%
TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558)
Fix Solution: Add the now-necessary `all2all_backend` parameter. The
impact of this parameter on the original `set_splitting_ops_for_v1`
implementation is only that graph mode is disabled in `vllm` if
`deepep_high_throughput` is enabled; it has no effect on the
`vllm-ascend` logic.

2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention
interface ] (https://github.com/vllm-project/vllm/pull/30684)
Fix Solution: The reason why the GPU does not need to convert qkv to 3D
is that the GPU's flash_attention operator is compatible with 3D and 4D
(b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator
only supports 3D (s b ( h d)). Therefore, we need to introduce the
reshape_qkv_to_3d operation.

4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue
in upgrade vllm code:
https://github.com/vllm-project/vllm-ascend/issues/5297

### How was this patch tested?


Co-authored-by: zxwang <1476209578@qq.com>

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
2025-12-23 23:52:11 +08:00
zhangxinyuehfad
8ae7fca947 [CI] refect e2e ci test (#5246)
### What this PR does / why we need it?
efect e2e ci test:
1. tests/e2e/singlecard/pooling/test_embedding.py: remove the eager
parameter and rename test case
2. tests/e2e/singlecard/pooling/test_scoring.py: Rename test cases
3. tests/e2e/singlecard/pooling/test_classification.py: Rename test case
4. tests/e2e/singlecard/test_quantization.py: remove the eager parameter
and chage model to vllm-ascend/Qwen2.5-0.6B-W8A8 and Rename test case
5. tests/e2e/multicard/test_shared_expert_dp.py: Rename test cases
6. tests/e2e/singlecard/test_sampler.py: Rename test cases
7. tests/e2e/singlecard/test_aclgraph_accuracy.py: Rename test cases
8. tests/e2e/multicard/test_offline_inference_distributed.py: Rename
test cases and remove the eager parameter
9. tests/e2e/multicard/long_sequence/test_accuracy.py: Rename test cases
and remove the eager parameter
10. tests/e2e/multicard/long_sequence/test_basic.py: Rename test cases
and remove the eager parameter
11.tests/e2e/multicard/test_expert_parallel.py:remove the eager
parameter
12.tests/e2e/multicard/test_full_graph_mode.py:remove the eager
parameter
13.tests/e2e/multicard/test_ilama_lora_tp2.py:remove the eager parameter

14.tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py:remove
the eager parameter
15.tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py:remove the
eager parameter
16.tests/e2e/singlecard/test_aclgraph_accuracy.py:remove the eager
parameter
17.tests/e2e/singlecard/test_camem.py:remove the eager parameter
18.tests/e2e/singlecard/test_ilama_lora.py:remove the eager parameter

19.tests/e2e/singlecard/test_multistream_overlap_shared_expert.py:remove
the eager parameter
20.tests/e2e/singlecard/test_vlm.py:remove the eager parameter
21.tests/e2e/singlecard/test_xli:remove the eager parameter

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-23 18:42:35 +08:00
Li Wang
5d1f6daef6 [CI] Mock spawn for vlm tests (#5279)
### What this PR does / why we need it?
Using `spawn` in continuous testing scenarios
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-23 18:35:06 +08:00
SILONG ZENG
fa0c212bfa [test]Corrected the Qwen3-Omni-30B-A3B-Instruct accuracy test configuration in nightly tests. (#5195)
### What this PR does / why we need it?
Corrected the Qwen3-Omni-30B-A3B-Instruct accuracy test configuration in
nightly tests.
link: https://github.com/vllm-project/vllm-ascend/pull/4911

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2025-12-23 14:17:27 +08:00
SILONG ZENG
29a93daa82 [CI]refactor: standardize test case naming convention (#5243)
### What this PR does / why we need it?
- Standardize test case naming in `vllm-ascend/tests/e2e/multicard/` to
follow the `<model>_<feature>_<distributed>` convention.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2025-12-23 14:13:42 +08:00
meihanc
592cfb6a6f [CI] Add Triton Ascend in CI (#4921)
Add triton-ascend in UT and e2e

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2025-12-23 12:47:35 +08:00
Mengqing Cao
449f8f65a7 [KV-Sharing] Support KV-Sharing feature in CLA models (#4138)
### What this PR does / why we need it?
Support KV-Sharing feature in CLA (cross layer attention) models, which
sharing kv cache in some layers.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-23 10:48:31 +08:00
zhangsicheng5
78aa7f2693 [feature] support pcp + mtp in full graph (#4572)
1. support pcp + mtp in full graph
2. pcp/dcp related mtp bugfix
3. support pcp + mtpx

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-12-22 16:13:39 +08:00
dependabot[bot]
4861484b68 Bump actions/checkout from 4 to 6 (#5234)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-22 15:16:43 +08:00
dependabot[bot]
11a25497ce Bump actions/upload-artifact from 4 to 6 (#5233)
Bumps
[actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 6.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-22 15:15:45 +08:00
weiguihua2
74aa968a9f [e2e] add pcp e2e (#5141)
### What this PR does / why we need it?
add pcp accuracy e2e test case

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-12-20 16:56:46 +08:00
wangxiyuan
758d81dcb1 Drop 0.12.0 support (#5146)
We decided to release v0.13.0 soon. So no need to support 0.12.0 now.
Let's drop it.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-20 09:38:53 +08:00
Li Wang
243ab7d720 [CI] Use offline mode for nightly test (#5187)
### What this PR does / why we need it?
For single node test, the lack of a retry mechanism for accessing
ModelScope resulted in an HTTP 400 error sometimes. I recommend using a
local offline cache instead.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-19 21:21:42 +08:00
Li Wang
14931d2a86 [CI] Fix image merge bug (#5197)
### What this PR does / why we need it?
Some tiny bugfix for
https://github.com/vllm-project/vllm-ascend/pull/5175

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-19 17:30:48 +08:00
wangxiyuan
636265be6d [CI] Improve CI (#5078)
Raname workflow to be clear.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-19 15:34:35 +08:00
Li Wang
a6eaf816f1 [Image] Refactor image build (#5175)
### What this PR does / why we need it?

In the past time, we used a hybrid architecture cross-compilation
approach for image building. This method had a problem:
cross-compilation performance was very poor, leading to extremely long
build times(abort 4h) and even a probability of failure(see
https://github.com/vllm-project/vllm-ascend/actions/runs/20152861650/job/57849208186).
Therefore, I recommend using a separate architecture build followed by
manifest merging, which significantly reduces image build time(20min).

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-19 14:35:51 +08:00
LookAround0301
76e58d66be support basic long_seq feature st (#5140)
### What this PR does / why we need it?
support basic long_seq feature st 

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: LookAround <lixushi@huawei.com>
2025-12-19 10:50:01 +08:00
zhangxinyuehfad
cee9b715b5 [Bugfix] install trition for test_custom_op (#5112)
### What this PR does / why we need it?
1. install trition for test_custom_op
2. tests/e2e/nightly/ops test timeout, set timeout-minutes let it test
over:

https://github.com/vllm-project/vllm-ascend/actions/runs/20326482497/job/58392757707?pr=5112
3. ignore test_dispatch_ffn_combine until it is fixed @kiscad 

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-19 10:40:46 +08:00
ck-hw-1018
71e544e259 [test] add w4a8 accuracy case (#5110)
### What this PR does / why we need it?

This PR add w4a8  accuracy testcase for e2e test

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

By running the test

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: cuikai (C) <c00827167@china.huawei.com>
Co-authored-by: cuikai (C) <c00827167@china.huawei.com>
2025-12-18 14:10:14 +08:00
ZixuanWang
b1a853b0f6 Upgrade vllm commit hash to 1216 (#5053)
### What this PR does / why we need it?
Upstream vLLM PR #30212 https://github.com/vllm-project/vllm/pull/30212
refactored the attention backend selection interface, This PR adapts
vllm-ascend's get_attn_backend_cls to align with the new upstream
standard, ensuring compatibility and reducing maintenance overhead.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com)
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zxwang <1476209578@qq.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
2025-12-17 08:48:36 +08:00
whx
cee521bad5 [Nightly][BugFix] Install triton for nightly e2e op test. (#5096)
### What this PR does / why we need it?
This PR adds triton-ascend installation to nightly e2e single card
environment.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-12-16 21:31:53 +08:00
Li Wang
c6f60e8dd8 [Nightly] Upgrade single node test to latest main (#5101)
### What this PR does / why we need it?
Sync source code from vllm-ascend on nightly tests

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-16 21:28:45 +08:00