608 Commits

Author SHA1 Message Date
Li Wang
99e1ea0fe6 [v0.18.0][Misc] Upgrade torch_npu to pre-release built version (#7918)
### What this PR does / why we need it?
This PR upgrades the `torch_npu` (PTA) version in multiple Dockerfiles
to a pre-release build. It introduces logic to dynamically select the
correct wheel based on the Python version and system architecture.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with existing tests. The author should verify that the Docker
images build successfully for all supported architectures and Python
versions.

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-04-01 22:41:09 +08:00
weiguihua2
59a7526339 [CI][Misc] modify ds3.2+dcp ci (#7841)
### What this PR does / why we need it?

Due to the current dcp solution of allgathering the KV cache, the
performance deteriorates significantly, and the CI may get stuck. This
PR temporarily removes the performance and accuracy benchmarks for
DeepSeek-V3.2-W8A8-cp to prevent CI hangs until optimization is
complete.

pcik-from:https://github.com/vllm-project/vllm-ascend/pull/7842

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Verified that the configuration file remains valid and that the CI no
longer attempts to run the problematic benchmarks.

pick-from: https://github.com/vllm-project/vllm-ascend/pull/7842

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2026-04-01 08:58:21 +08:00
zhangxinyuehfad
af4278be35 [v0.18.0][CI] Close build image by pr (#7776)
### What this PR does / why we need it?
Close build image by pr

This PR is related to
https://github.com/vllm-project/vllm-ascend/pull/7775, please merge them
together

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-31 16:38:43 +08:00
zhangxinyuehfad
c1cefd26de [v0.18.0][CI] Add nightly- prefix to branch/PR image tags (#7765)
### What this PR does / why we need it?
Add nightly- prefix to branch/PR image tags 

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-28 11:31:16 +08:00
zhangxinyuehfad
2c175f5ed8 [v0.18.0][Bugfix] Fix pr triggers on branches for nightly test workflows (#7695)
### What this PR does / why we need it?
1. Allow PR triggers on `*-dev` and `releases/v*` branches for nightly
test workflows.
2.  fix image-tag in doc

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-27 15:17:06 +08:00
zhangxinyuehfad
d781902ce9 [v0.18.0][CI] Fix releases/v0.18.0 ci test only support vllm v0.18.0 (#7686)
### What this PR does / why we need it?
Fix releases/v0.18.0 ci test only support vllm v0.18.0 

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-26 18:36:04 +08:00
zhangxinyuehfad
124bb00158 [CI][v0.18.0] Build nightly image for releases/v0.18.0 per pr (#7662)
### What this PR does / why we need it?
This patch add per pr image build for branch `releases/v0.18.0`, Due to
the limitations of the quay naming convention, we should not name the
image tag the same as branch name, we name the image
tag`releases-v0.18.0` for daily build.

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-26 16:48:51 +08:00
meihanc
114ec75a06 [bugfix][CI] fix '_OpNamespace' 'vllm' object has no attribute 'qkv_rmsnorm_rope' (#7620)
### What this PR does / why we need it?
fix '_OpNamespace' 'vllm' object has no attribute 'qkv_rmsnorm_rope' by
uinstall triton

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-03-25 11:05:34 +08:00
Li Wang
8e3f8bab57 [Nightly] Nightly pre-build image (#7388)
### What this PR does / why we need it?
This pull request refactor nightly image build and simplify the logic of
multi workflows.
1. Nightly image build become the prerequisite when the test are
triggered by `schedule` or `workflow_dispatch`
2. Simplify the pull request select case logic
3. Next step: Implement replaceable nightly tests. Specifically, if
nightly tests are manually triggered, they can accept any optional
docker image to meet the needs of different commits(Which means the
image is customizable).
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-25 09:24:01 +08:00
drizzlezyk
54879467c4 [CI] refine issue triage rules, wan regex and update stale setting (#7531)
- Update issue labeler regex for wan to match numeric suffix only,
including both standalone wan label and multi-modality-generate
aggregate rule.
- Add title-based gate conditions in issue triage workflow so
auto-labeling runs only for expected issue templates ( [Bug]: ,
[Installation]: , [Usage]: , [Doc]: ).
- Adjust scheduled stale workflow configuration for the
awaiting-feedback processing block.

### What this PR does / why we need it?
- Update issue labeler regex for wan to match numeric suffixes only, in
both:
- standalone wan label rule
- multi-modality-generate aggregate rule
- Add title-based gate conditions in issue triage workflow so
auto-labeling runs only for expected templates:
 [Bug]:/ [Installation]:/ [Usage]:/ [Doc]:
- Adjust the scheduled stale workflow configuration for the
awaiting-feedback processing block.

### Does this PR introduce _any_ user-facing change?
- No runtime/API user-facing change.
- This PR only updates repository automation behavior in GitHub
workflows and issue labeling rules.

### How was this patch tested?
- Performed config-level validation by reviewing diffs and final YAML
content for:
- .github/issue-labeler.yml
- .github/workflows/bot_issue_manage.yaml
- .github/workflows/schedule_stale_manage.yaml
- Verified wan regex now requires numeric suffix (e.g., wan2 , wan2.1 )
and no longer matches alphabetic suffix forms (e.g., wana ).
- Verified triage workflow includes title-based if conditions for
expected issue templates.
- Verified stale workflow’s awaiting-feedback block reflects the
intended configuration adjustment.
- No unit/e2e tests were added because this PR changes GitHub Actions
and labeling configuration only.

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

---------

Signed-off-by: drizzlezyk <drizzlezyk@163.com>
2026-03-24 20:11:31 +08:00
SILONG ZENG
1e3c1e76bf [Lint]Add lint hooks for clang-format, shellcheck, forbidden imports, and boolean context manager checks (#7511)
### What this PR does / why we need it?
This PR introduces several upstream `vllm`-aligned lint hooks into
`vllm-ascend` and makes them part of the actual `pre-commit` flow.

Main changes in this PR:
- add `check-boolean-context-manager` to catch boolean expressions in
`with` statements
- add `check-forbidden-imports` to forbid direct `re` imports and
disallowed direct `triton` imports
- enable shell script linting through `tools/shellcheck.sh`
- add root `.clang-format` aligned with upstream `vllm`, enable
`clang-format` in `pre-commit`, temporarily **exclude all `csrc/**`**
from `clang-format` to avoid bringing a large native code reformat into
this PR

This PR focuses on landing the smaller and immediately useful lint
alignment first, without mixing in the larger requirements-management
migration.

### Does this PR introduce _any_ user-facing change?
No.

This PR only updates repository lint configuration, static checks, and
internal import/style enforcement. It does not change runtime behavior
or public interfaces.

### How was this patch tested?
Tested locally in the project virtual environment.

Commands used:
```bash
bash format.sh
```
Verified checks passed:
``` bash
ruff check...............................................................Passed
ruff format..............................................................Passed
codespell................................................................Passed
typos....................................................................Passed
clang-format.............................................................Passed
Lint GitHub Actions workflow files.......................................Passed
Lint shell scripts.......................................................Passed
Lint PNG exports from excalidraw.........................................Passed
Check for spaces in all filenames........................................Passed
Enforce __init__.py in Python packages...................................Passed
Check for forbidden imports..............................................Passed
Check for boolean ops in with-statements.................................Passed
Suggestion...............................................................Passed
- hook id: suggestion
- duration: 0s

To bypass pre-commit hooks, add --no-verify to git commit.
```
**note:**
clang-format is enabled but currently excludes all csrc/**


- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-24 20:03:01 +08:00
realliujiaxu
5d12446573 [Feat][SP] Suport SP for VL MoE models (#7044)
### What this PR does / why we need it?

2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712,
extend SP to VL MoE models.


### Does this PR introduce _any_ user-facing change?
remove `sp_threshold` in additional config and reuse `sp_min_token_num`
from vLLM.


### How was this patch tested?
- Model: Qwen3-VL-30B-A3B, 
- TP4 DP2
- 100 reqs
- max concurrency 1

| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|------------|---------------------|------------------------|
| 4k         | 429.40               | 323.3                  |
| 16k        | 1297.01              | 911.74                |

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-24 17:16:00 +08:00
LeeWenquan
9615bc33fd Fix Qwen3Next CI Config (#7561)
### What this PR does / why we need it?
This pr modifies qwen3Next nightly CI config. 
(1) Add a nightly CI .
(2) Set a more precise accuracy standard

- vLLM version: v0.18.0
- vLLM main:
6a9cceb219

Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
2026-03-24 17:08:17 +08:00
Nengjun Ma
fcba91a392 Main2main Upgrade vllm commit to 0320 17:00 (#7510)
### What this PR does / why we need it?
Main2main Upgrade vllm commit to 0320 17:00

1. fix vllm refactored `_moe_forward` to call
`runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True.
vllm PR:"[MoE Refactor] DefaultMoERunner simplification
[#33049](https://github.com/vllm-project/vllm/pull/33049)"

2.fix vllm moved the call to `self._set_compile_ranges()` in
`VllmConfig.__post_init__` from **before** `check_and_update_config()`
to **after** it (to allow platforms to lower `max_num_batched_tokens`
first). vllm PR: "fix(xpu): Re-compute compile ranges after
platform-specific config updates"
[#37523](https://github.com/vllm-project/vllm/pull/37523)


### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
2026-03-23 21:37:41 +08:00
meihanc
e344a53127 [bugfix][CI]Skip e2e log summary when the log file is missing or empty (#7552)
### What this PR does / why we need it?
Avoid failing `ci_log_summary.py` when the e2e log file is missing or
empty.

Test in CI
:https://github.com/vllm-project/vllm-ascend/actions/runs/23428406256/job/68149271871
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-03-23 20:25:59 +08:00
zhangxinyuehfad
886756aea0 [Bugfix][CI] Fix aisbench installation to avoid Gitee authentication (#7536)
### What this PR does / why we need it?
- Pass GITEE_USERNAME (var) and GITEE_TOKEN (secret) as Docker build
  args in nightly image build so Dockerfile can authenticate to Gitee
- In Dockerfile.nightly.a2/a3, embed credentials into clone URL to
  avoid auth failure during `git clone`
- In single-node and multi-node PR test workflows, backup the
  pre-installed benchmark from the nightly image before wiping
  vllm-ascend, then restore it instead of re-cloning from Gitee,
  which is inaccessible from fork PR contexts

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-23 20:16:51 +08:00
liuhy1213-cell
fb283b5820 [CI] Add nightly CI test cases for the GLM-5 (#7429)
### What this PR does / why we need it?
Add nightly CI test cases for the GLM-5
Add model download for the GLM-5

https://github.com/vllm-project/vllm-ascend/actions/runs/23286178651/job/67710409642#logs
- vLLM version: v0.17.0
- vLLM main:
b31e9326a7
---------
Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com>
Signed-off-by: liuhy1213-cell <liuhy1213@gmail.com>
Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>
2026-03-23 19:14:19 +08:00
Nengjun Ma
8e2c59e1ee Main2main upgrade vllm commit to 03 19 17:00 (#7478)
### What this PR does / why we need it?
Upgrade vllm commit to 2026.03.19.

1.Fix socket removed from StatelessProcessGroup. Upstream vLLM PR
[#36330](https://github.com/vllm-project/vllm/pull/36330) ("elastic_ep:
Fix stateless group port races") refactored StatelessProcessGroup and
removed the socket: socket.socket | None field. The socket ownership was
moved to a new create_tcp_store() helper instead of being stored as a
field on the dataclass.

2.fix `virtual_engine` parameter removed from `set_forward_context().
Upstream [V0 Deprecation] Deprecate virtual engine
[#37195](https://github.com/vllm-project/vllm/pull/37195)

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-03-23 16:25:57 +08:00
dependabot[bot]
da866cc168 [CI] Bump docker/build-push-action from 6 to 7 (#7541)
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6 to 7.

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-23 15:46:12 +08:00
Qiu
71df17f4e6 bugfix(MC2): refactor the comm group of MC2 to be compatible with PP (#7291)
### What this PR does / why we need it?
This PR refactors the communication group of MC2 to keep it consistent
with vllm's EP group, making it compatible with PP.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-03-23 15:44:21 +08:00
dependabot[bot]
8527b49764 [CI] Bump docker/setup-buildx-action from 3 to 4 (#7542)
Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 3 to 4.

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-23 15:44:14 +08:00
Shanshan Shen
5c0d02f689 [Bugfix] Fix multi-instance serving OOM on single card (#7427)
### What this PR does / why we need it?
Fix https://github.com/vllm-project/vllm-ascend/issues/7308.

Subtracting `init_non_torch_memory` (maybe used by the first instance)
from the total `non_torch_memory` when calculating
`available_kv_cache_memory`.

Directly use `non_torch_memory_increase` (contained in
`non_kv_cache_memory`) to calculate `available_kv_cache_memory`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Launch tow vllm-ascend instances sequentially on single card.

```bash
# Launch first instance
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \
--port 8100 \
--host 0.0.0.0 \
--additional-config='{"enable_cpu_binding":true}'  \
--gpu-memory-utilization 0.3 \
--max-num-seqs 1 \
--max-model-len 2048 \
--max-num-batched-tokens 2048 \
--no-enable-prefix-caching \
--enforce-eager

# Launch second instance
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \
--port 8101 \
--host 0.0.0.0 \
--additional-config='{"enable_cpu_binding":true}'  \
--gpu-memory-utilization 0.3 \
--max-num-seqs 1 \
--max-model-len 2048 \
--max-num-batched-tokens 2048 \
--no-enable-prefix-caching \
--enforce-eager
```

**Before this PR:**

```bash
# First instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.2340388298034668 GiB
init_non_torch_memory: 0.3616676330566406 GiB
non_torch_memory_before_empty_cache: 0.3896217346191406 GiB
non_torch_memory_increase: 0.0279541015625 GiB
non_torch_memory_cleared_by_empty_cache: 0.3616676330566406 GiB
------------------------------------------------------------------

# Second instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.2336344718933105 GiB
init_non_torch_memory: 18.37220001220703 GiB
non_torch_memory_before_empty_cache: 18.399906158447266 GiB
non_torch_memory_increase: 0.02754974365234375 GiB
non_torch_memory_cleared_by_empty_cache: 18.372356414794922 GiB
------------------------------------------------------------------
# available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache
Available KV cache memory: -1.32 GiB
```

**After this PR:**

```bash
# First instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.2340540885925293 GiB
init_non_torch_memory: 0.36182403564453125 GiB
non_torch_memory_before_empty_cache: 0.38979339599609375 GiB
non_torch_memory_increase: 0.0279693603515625 GiB
non_torch_memory_cleared_by_empty_cache: 0.0 GiB
------------------------------------------------------------------

# Second instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.233344554901123 GiB
init_non_torch_memory: 18.74309539794922 GiB
non_torch_memory_before_empty_cache: 18.770355224609375 GiB
non_torch_memory_increase: 0.02725982666015625 GiB
non_torch_memory_cleared_by_empty_cache: 0.0 GiB
------------------------------------------------------------------
# available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache
Available KV cache memory: 17.05 GiB
```

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2026-03-23 14:22:59 +08:00
Li Wang
75fae619d5 [Misc] Refactor aclgraph accuracy test to use logprob-based comparison (#7455)
### What this PR does / why we need it?

Replace text-match assertions with a two-tier logprob accuracy check:

- Prefill (token 0): assert token ID is identical between eager baseline
and compiled mode, then verify logprob matches within `atol`.
- Decode (tokens 1-2): if chosen tokens match, compare logprobs
directly; if they differ, cross-lookup the baseline token in the
compiled model's top-20 distribution and assert the assigned logprob is
within `decode_atol` (defaults to 2x atol). This tolerates minor argmax
drift caused by floating-point differences while still catching
distribution divergence.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
8a680463fa

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-23 09:08:21 +08:00
meihanc
bff4fbfca5 upgrade to 0.18.0 (#7502)
### What this PR does / why we need it?
1. upgrade to 0.18.0
2. ensure kernel_block_sizes is int for Eagle drafter
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-03-21 16:05:38 +08:00
Li Wang
6ad74e8c80 [CI] Add git safe repo (#7501)
### What this PR does / why we need it?
Add git safe repo to avoid dubious ownership error

- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-20 16:40:24 +08:00
vllm-ascend-ci
95e1dc11d8 [CI]: Auto-update estimated test times in config.yaml (#7413)
## Summary

This PR was auto-generated by the **Update estimated test times**
[workflow](https://github.com/vllm-project/vllm-ascend/actions/runs/23226502411).

It updates the `estimated_time` values in
`.github/workflows/scripts/config.yaml` based on actual elapsed times
collected from CI workflow runs.

### Methodology

- Each e2e test job uploads its elapsed time as a `timing-data-*`
artifact upon completion.
- The workflow aggregates all collected timing artifacts across jobs.
- For each test, the **median** elapsed time is computed to reduce
outlier impact.
- A **10% safety buffer** is applied and the result is rounded to the
nearest 10 seconds.

### Review Checklist

- [ ] Verify that updated `estimated_time` values are within a
reasonable range.
- [ ] Confirm no test entries are missing or unexpectedly removed.

> If the new values look reasonable, feel free to merge. Otherwise,
leave a comment describing the anomaly.
- vLLM version: v0.17.0
- vLLM main:
4497431df6

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-03-19 19:01:16 +08:00
Nengjun Ma
ee804ce23e Main2main upgrade vllm to 0318 commit (#7412)
### What this PR does / why we need it?
Upgrade vllm commit to 0318. 

Main content: Added a pre-operation for cleaning up and waiting(default
max 50s) for the completion of the clean up of the NPU memory to some
test cases that failed due to the failure to release the NPU memory in a
timely manner when the previous test cases were executed.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-03-19 17:17:36 +08:00
aipaes
87d6424b2e [CI] Add nightly CI test cases for the GLM-4.7 model. (#7391)
### What this PR does / why we need it?
Add acc nightly CI test cases for the GLM-4.7 model.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
through CI

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
2026-03-19 16:43:29 +08:00
aipaes
0261d1b1c6 [CI] add glm4.7 weights download (#7395)
### What this PR does / why we need it?
Download GLM4.7 w8a8 weights for CI
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
through CI
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Signed-off-by: aipaes <82140963+aipaes@users.noreply.github.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
2026-03-19 16:43:15 +08:00
zhangxinyuehfad
ce239db4fb [CI] Add multi-hardware wheel build and release workflow (#7312)
### What this PR does / why we need it?
Adds a scheduled CI workflow (schedule_release_code_and_wheel.yml) to
automatically build and release vllm-ascend source packages and binary
wheels for multiple Ascend hardware targets.

Key features:

1. Source release: Builds tar.gz sdist and uploads to PyPI on version
tag push
2. Multi-hardware wheel builds: Supports three hardware targets in
parallel:
2.1 A2 (Ascend 910B): x86_64 + ARM64, Python 3.10 / 3.11
2.2 A3 (Ascend 910C): x86_64 + ARM64, Python 3.10 / 3.11
2.3 310P: x86_64 + ARM64, Python 3.10 / 3.11
3. Wheel repair: Uses auditwheel to produce manylinux-compatible wheels,
excluding Ascend NPU runtime libs (libascend*.so, libtorch*.so, etc.)
that must be provided by the runtime environment
4. Variant wheels: Generates hardware-variant wheels via variantlib for
hardware-specific distribution
5. OBS upload: Aggregates all variant wheels and a combined index JSON,
then uploads to Huawei OBS for hosting

### Does this PR introduce _any_ user-facing change?
Yes. Users will be able to install hardware-specific vllm-ascend wheels
from PyPI or the OBS variant index, eliminating the need to build from
source.

### How was this patch tested?
1. CI verification only — workflow syntax and job dependency logic
reviewed manually
2. Wheel build steps validated against existing Dockerfiles
(Dockerfile.buildwheel.a2/a3/310p)
3. auditwheel exclusion list verified against known Ascend runtime
shared libraries


- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: YanZhicong <mryanzhicong@163.com>
Co-authored-by: YanZhicong <mryanzhicong@163.com>
2026-03-19 11:06:17 +08:00
LoganJane
270c5cb8cd [CI] Add nightly CI test cases for the Kimi-K2.5 (#7416)
### What this PR does / why we need it?
Add nightly CI test cases for the Kimi-K2.5.

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: LoganJane <loganJane73@hotmail.com>
Signed-off-by: LoganJane <42287016+LoganJane@users.noreply.github.com>
2026-03-19 11:02:29 +08:00
meihanc
ab9cd2e305 [CI]Add CI summary log (#7202)
### What this PR does / why we need it?
This PR adds a new CI log summarizer, `ci_log_summary.py`, and wires it
into unit-test and e2e workflows so failed jobs publish a structured
failure summary to the GitHub step summary.
Examples:
- `python3 .github/workflows/scripts/ci_log_summary.py --log-file
/tmp/unit-test.log --mode ut --step-name "Unit test"`
- `python3 .github/workflows/scripts/ci_log_summary.py --run-id
23127187822 --format json`

A maintenance note is added to `ci_utils.py` to clarify that the `START`
/ `PASSED` / `FAILED (exit code X)` log lines are parsed by
`ci_log_summary.py`, so any future format changes must be coordinated
with the corresponding summarizer regexes.

🤖 Generated with [Codex]<noreply@openai.com>
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: meihanc <jcccx.cmh@gmail.com>
Co-authored-by: Codex <noreply@openai.com>
2026-03-19 09:32:06 +08:00
Nengjun Ma
8b79d4de52 Main2main upgrade to vllm 0317 afternoon (#7409)
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](https://github.com/vllm-project/vllm/pull/37031)

4.fix [Support multiple KV groups in OffloadingSpec
](https://github.com/vllm-project/vllm/pull/36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
8a680463fa

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
2026-03-18 23:24:27 +08:00
jiangmengyu18
305820f1a9 [Bugfix] fix bug about model type of qwen3_vl_8b_instruct_w8a8 (#7383)
### What this PR does / why we need it?
Adapt to the model type of Qwen3-VL-8B-Instruct-W8A8

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
2026-03-18 20:30:03 +08:00
SparrowMu
fb8e22ec00 [DOC] MiniMax-M2.5 model intro (#7296)
### What this PR does / why we need it?
1. Add nightly test on MiniMax-M2.5 with deployment method on A3
2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
2026-03-18 20:14:36 +08:00
LoganJane
2916601e6c [CI] add Kimi-K2.5 weights download (#7406)
### What this PR does / why we need it?
Add Kimi-K2.5 weights download.

- vLLM version: v0.17.0
- vLLM main:
4497431df6

Signed-off-by: LoganJane <loganJane73@hotmail.com>
2026-03-18 18:29:37 +08:00
dependabot[bot]
1ff9e3f25f [CI] Bump docker/login-action from 3 to 4 (#7299)
Bumps [docker/login-action](https://github.com/docker/login-action) from3 to 4.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-18 17:06:48 +08:00
dependabot[bot]
b3206cd6f6 [CI] Bump actions/setup-python from 5 to 6 (#7298)
Bumps [actions/setup-python](https://github.com/actions/setup-python)from 5 to 6.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-18 17:06:28 +08:00
Li Wang
5894a27bfd [CI] Add PAT_TOKEN when checkout (#7400)
### What this PR does / why we need it?
When we checkout the fork repo and wanna to submit push to the fork
repo, the pat_token is needed

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-18 10:31:32 +08:00
zhangyiming
1c954ff264 [main2main] upgrade vllm to 0308 (#7213)
### What this PR does / why we need it?
Update main2main to vllm 0308.
breaks:

* https://github.com/vllm-project/vllm/pull/30681
* https://github.com/vllm-project/vllm/pull/35552 remove
self.cudagraph_batch_sizes
* https://github.com/vllm-project/vllm/pull/35158 clear_metadata ->
defer_finalize
* https://github.com/vllm-project/vllm/pull/36006 remove
CacheConfig.cpu_offload_gb
* https://github.com/vllm-project/vllm/pull/35472
* https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder
* https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens
* https://github.com/vllm-project/vllm/pull/28053 

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: MrZ20 <2609716663@qq.com>
2026-03-18 09:24:43 +08:00
drizzlezyk
79ef41a53d [CI] add scheduled stale issue management (#7354)
### What this PR does / why we need it?

1. issue with "resolved", 7 days stale, 14 days closed after stale with
`stale` and `resolved` label.
2. issue with "awaiting-feedback", 7 days stale, 14 days closed after
stale with `stale` and `awaiting-feedback` label.

Change items:
- Add a scheduled stale-management workflow to process resolved and
awaiting-feedback issues independently.
- Automatically mark inactive issues as stale , post tailored reminder
messages, and close issues after a grace period.
- Remove source labels when issues become active again, and disable PR
stale handling so the automation remains issue-scoped.

### Does this PR introduce _any_ user-facing change?
- No API or runtime behavior changes.
- This PR only updates GitHub issue automation (labeling and stale
management workflow).

### How was this patch tested?
- Test locally

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: drizzlezyk <drizzlezyk@163.com>
2026-03-17 23:28:29 +08:00
rjg-lyh
4d443b9228 [bugfix] restore pr-7029 and fix patch error (#7294)
### What this PR does / why we need it?
This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using
the `lightning_indexer_quant` ops in the pd-mix stage.

The original PR was reverted by #7288 because the patch did not work
with the recompute scheduler.

This PR also fixes the patching issue so that it works correctly with
the recompute scheduler.

### Does this PR introduce _any_ user-facing change?
Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to
`"true"` in `additional_config`.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-16 15:39:42 +08:00
zhaomingyu13
9320365dab [Test][Feature] Add e2e test for QuaRot model with eagle3 (#7128)
### What this PR does / why we need it?
Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot
model and the float model, and then compares their acceptance rates. The
QuaRot model adapting eagle3 PR(#6914, #7038)

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2026-03-16 15:35:55 +08:00
Mengqing Cao
e20f0b1a0d [ReleaseNote] Add release note for v0.17.0rc1 (#7240)
### What this PR does / why we need it?
This pull request adds the release notes for `v0.17.0rc1`. It also
updates version numbers across various documentation files, including
`README.md`, `README.zh.md`,
`docs/source/community/versioning_policy.md`, and `docs/source/conf.py`
to reflect the new release.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
2026-03-15 22:47:47 +08:00
pppeng
7e85f2ff97 [CI] Add test_qwen3_5.py (#7133)
### What this PR does / why we need it?
Add test_qwen3_5.py for base scenarios tp4 on Qwen3.5-27B and
Qwen3.5-35B-A3B.

- vLLM version: main
- vLLM main:
4034c3d32e
---------
Signed-off-by: pppeng <zepengliu912@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-03-15 22:19:02 +08:00
Mengqing Cao
0c299f79b9 Revert "[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)" (#7288)
### What this PR does / why we need it?
This reverts commit 7ed9e9de69, which
introduces an issue that the patch doesn't work with recompute scheduler
enabled.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-15 20:19:09 +08:00
yupeng
29f195a91c [Bugfix][LoRA] Fix the bug when runs Qwen3-Reranker-0.6B with LoRA. (#7156)
### What this PR does / why we need it?
Fix the error that reports while initializing qwen3-reranker-0.6b model
with `--enable-lora`.
And add a testcase to verify the fix.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-03-15 17:55:42 +08:00
Mengqing Cao
986cd45397 [Version] Drop 0.16.0 support (#7153)
### What this PR does / why we need it?
Drop 0.16.0 support in main
- Fix eagle proposer break introduced by
https://github.com/vllm-project/vllm/pull/34552. Mainly change to use
the draft attention group to initialize the attention metadata builder.
- Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes`
error, which is a bug in vLLM v0.17.0, and fixed by a later pr
https://github.com/vllm-project/vllm/pull/30515

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-13 16:14:15 +08:00
rjg-lyh
7ed9e9de69 [Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)
### What this PR does / why we need it?
This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops
in pd-mix stage mainly.

Because the code for the current PD-disaggregated scenario is still
under refactoring and cleanup, this PR prioritizes ensuring the C8
functionality in the pd-mix scenario.

The next steps are planned in two parts:
① Once the optimized scatter operator is updated, we will replace the
original operator to improve the performance of storing k_scale.
② Once the code logic for the PD-disaggregated scenario becomes stable,
we will carry out more comprehensive validation and make appropriate
adaptations.
③ Because enabling C8 currently introduces several new operators whose
performance still needs improvement, performance may regress in some
scenarios. Therefore, only after all the operators are fully ready can
we ensure that this feature does not cause any performance degradation.
At that point, we will enable this feature by default and remove the
switch in `additional_config`.


### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-13 14:47:42 +08:00
pppeng
6ee7ffb98a Add Qwen3_5 to model list (#7130)
### What this PR does / why we need it?
The pr aims to add new models like Qwen3.5-35B-A3B/Qwen3.5-27B to model
list for testing.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>
2026-03-13 11:42:28 +08:00