### What this PR does / why we need it?
- Pass GITEE_USERNAME (var) and GITEE_TOKEN (secret) as Docker build
args in nightly image build so Dockerfile can authenticate to Gitee
- In Dockerfile.nightly.a2/a3, embed credentials into clone URL to
avoid auth failure during `git clone`
- In single-node and multi-node PR test workflows, backup the
pre-installed benchmark from the nightly image before wiping
vllm-ascend, then restore it instead of re-cloning from Gitee,
which is inaccessible from fork PR contexts
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
8b6325758c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
1. Increase nightly multi-node test max-parallel from 1 to 2, and fix
resource conflicts that arise when tests run concurrently.
2. Fix parse-trigger job: Add an if condition so it only runs on
schedule, workflow_dispatch, or PRs labeled nightly-test
3. Adjust nightly schedule: Shift trigger time from 24:00 to 23:45
(UTC+8)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR refactors the nightly CI workflows (A2 and A3) to support
running tests against a specific PR's code, in addition to the existing
scheduled/dispatch runs using pre-built images.
#### Motivation:
Previously, nightly tests could only be triggered by schedule or
workflow_dispatch, always using the pre-built nightly image. This change
allows developers to trigger nightly tests against their own PR's source
code, enabling early validation without waiting for a nightly build.
#### Changes
Trigger logic (parse-trigger job)
A new parse-trigger job is introduced in both
schedule_nightly_test_a2.yaml and schedule_nightly_test_a3.yaml to
centralize trigger evaluation:
`schedule / workflow_dispatch`: runs all tests with the pre-built image
(existing behavior preserved)
`pull_request (labeled + synchronize)`: runs only when:The PR has the
nightly-test label, and /nightly [test-names] comment exists (latest one
wins)
1. /nightly or /nightly all — runs all tests
2. /nightly test1 test2 — runs only named tests (comma-wrapped for exact
matching)
#### How to trigger
1. Add the nightly-test label to your PR
2. Comment /nightly (all tests) or /nightly test1 test2 (specific tests)
4. Re-triggering: add another /nightly comment and push a new commit
(synchronize event)
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Add DeepSeek-V3.2 nightly ci
Fix PD routing to exclude headless nodes when collecting
prefiller/decoder IPs
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
[CI] Upgrade CANN to 8.5.1
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
The incorrect regular expression syntax `.*[UE4M3|ue4m3].*` actually
ignores all words containing any of the following characters: `u, e, 4,
m, 3, |`
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*",
".*ot.*", ".*[Tt]h[rR].*"]
```
===fix===>
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*",
".*ot.*", ".*[Tt]h[rR].*"]
```
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
9562912cea
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
Fixes the issue where nightly multi-node tests hang during the "wait for
pod ready" stage due to strict shell mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
13397841ab
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
This patch bump the mooncake version to the latest
[release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
test is locally
>>> from mooncake.engine import TransferEngine
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Install clang in dokerfile for triton ascend
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
update triton ascend version in 3.2.0
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
Add DeepSeek-V3.2-W8A8 nightly ci test:
DeepSeek-V3.2-W8A8 1node DP2+TP8
:tests/e2e/nightly/models/test_deepseek_v3_2_w8a8.py
### Does this PR introduce _any_ user-facing change
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Since we have upgrade all the nodes' `cann` HDK version to `25.3rc1`, we
should not limit nightly schedule to the specific nodes
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
fix bug : https://github.com/vllm-project/vllm-ascend/issues/5634
Intermittent CI failure due to a compilation error in the triton
operator
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
This patch initial testing involved connecting two nodes from the HK
region to nightly A2.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Add triton ascend in nightly
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
This patch is a series of refactoring actions, including clarifying the
directory structure of nightly tests, refactoring the config retrieval
logic, and optimizing the workflow, etc. This is the first step:
refactoring the directory structure of nightly to make it more readable
and logical.
- vLLM version: v0.13.0
- vLLM main:
5326c89803
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This pacth mainly do the following things:
1. Bugfix for multi_node_tests log, log names must be unique when
uploading logs.
2. Optimize `get_cluster_ips` logic, increase the max retry times for
robustness
3. Abandoned the existing gh-proxy temporarily until it is stable
enough.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Currently, our multi-node logs only show the master node's logs (via the
Kubernetes API), which is insufficient for effective problem
localization if other nodes experience issues. Therefore, this pull
request adds the ability to upload logs for other nodes.
Next plan: Output structured directory logs, including logs from each
node and the polog.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
1. Optimize multi-node waiting logic
2. Remove the `tee` pipeline for logs, which will lead to hang issue
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
- skip the nightly image build when the github event is pull_request
- set imagepullpolicy as alway for multi_node test
- move multi_node tests ahead to have some resource clean first
- do not relevant nightly image build with nightly tests for tolerance
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Given the current excessively long build time of our nightly-ci, I
recommend installing necessary, confirmed versions of packages in the
Docker image to reduce the time required for integration testing.
Including Mooncake vllm with fixed tags, This is expected to reduce
nightly-ci duration by 2 hours.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Since we have upgraded to CANN 8.3rc1, we will no longer use the
privately maintained Mooncake repository, but instead use the official
release released by Mooncake:
https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.7.post2 .
Next step: this is only a temporary solution. We will integrate mooncake
into the vllm-ascend base image later for easier use. see
https://github.com/vllm-project/vllm-ascend/pull/3989
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version
check logic.
TODO: we notice that UT runs failed with CANN 8.3 image. So the base
image for UT is still 8.2. We'll fix it later.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Because the previous commit hash was accidentally deleted or
overwritten. This patch correct the commit hash available for
https://github.com/AscendTransport/Mooncake to make nightly ci happy
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This patch mainly fix the the problem of not being able to determine the
exit status of the pod's entrypoint script and some other tiny
optimizations:
1. Shorten wait for server timeout
2. fix typo
3. fix the issue of ais_bench failing to correctly access the proxy URL
in a PD separation scenario.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This patch optimize nightly CI:
1. Bug fixes ais_bench get None repo_type error
2. Fix A2 install kubectl error with arm arch
3. Fix the multi_node CI unable to determine whether the job was
successful error
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This patch add multi-node test case for a2
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR adds the 2P1D multi node func/acc/perf test cases, we need test
them daily
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This pull request mainly do the following things:
1. Add a doc for multi-node CI, The main content is the mechanism
principle and how to contribute
2. Simplify the config yaml for more developer-friendly
3. Optimized the mooncake installation script to prevent accidental
failures during installation
4. Fix the workflow to ensure the kubernetes can be apply correctly
5. Add Qwen3-235B-W8A8 disaggregated_prefill test
6. Add GLM-4.5 multi dp test
7. Add 2p1d 4nodes disaggregated_prefill test
8. Refactor nightly tests
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR adds a jinja template for the k8s configuration file, prepare
for the upcoming 4-node CI
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Refactor the multi-machine CI use case. The purpose of this PR is to
increase the ease of adding multi-machine CI use cases, allowing
developers to add multi-machine cluster model testing use cases
(including PD separation) by simply adding a new YAML configuration
file.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: wangli <wangli858794774@gmail.com>