xc-llm-ascend

Author	SHA1	Message	Date
Li Wang	2ee17e50a1	[2/N] Upgrade nightly doc (#5534 ) ### What this PR does / why we need it? Follow up https://github.com/vllm-project/vllm-ascend/pull/5479, upgrade the corresponding doc for developers - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-31 09:11:42 +08:00
Li Wang	e760aae1df	[1/N] Refactor nightly test structure (#5479 ) ### What this PR does / why we need it? This patch is a series of refactoring actions, including clarifying the directory structure of nightly tests, refactoring the config retrieval logic, and optimizing the workflow, etc. This is the first step: refactoring the directory structure of nightly to make it more readable and logical. - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-30 19:03:02 +08:00
Li Wang	1d81bfaed1	Fix nightly (#5413 ) ### What this PR does / why we need it? This pacth mainly do the following things: 1. Bugfix for multi_node_tests log, log names must be unique when uploading logs. 2. Optimize `get_cluster_ips` logic, increase the max retry times for robustness 3. Abandoned the existing gh-proxy temporarily until it is stable enough. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-27 18:16:46 +08:00
Li Wang	c2f776b846	[Nightly] Initial logging for nightly multi-node testing (#5362 ) ### What this PR does / why we need it? Currently, our multi-node logs only show the master node's logs (via the Kubernetes API), which is insufficient for effective problem localization if other nodes experience issues. Therefore, this pull request adds the ability to upload logs for other nodes. Next plan: Output structured directory logs, including logs from each node and the polog. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-26 11:39:07 +08:00
Li Wang	0f92d34a70	[CI] Pull latest vllm-ascend src before tests (#4988 ) ### What this PR does / why we need it? Currently, our image build suffers from errors during cross-compilation, which causing the image to fail to build sometimes(see https://github.com/vllm-project/vllm-ascend/actions/runs/20152861650/job/57849208186). This results in the nightly test code not being the latest version. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-13 19:04:14 +08:00
Li Wang	283bc5c7ba	[Nightly] Optimize nightly CI (#4509 ) ### What this PR does / why we need it? 1. Optimize multi-node waiting logic 2. Remove the `tee` pipeline for logs, which will lead to hang issue ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-04 22:31:07 +08:00
zhangxinyuehfad	8813832387	[Test] Add GLM-4.5 nightly test (#4225 ) ### What this PR does / why we need it? Add GLM-4.5 nightly test - vLLM version: v0.11.2 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-01 22:31:56 +08:00
zhangxinyuehfad	67f2b3a031	[Test] Add deepseek v3.2 exp nightly test (#4191 ) ### What this PR does / why we need it? - skip the nightly image build when the github event is pull_request - set imagepullpolicy as alway for multi_node test - move multi_node tests ahead to have some resource clean first - do not relevant nightly image build with nightly tests for tolerance - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com>	2025-11-14 15:46:10 +08:00
Li Wang	7294f89e43	[CI] Add daily images build for nightly ci (#3989 ) ### What this PR does / why we need it? Given the current excessively long build time of our nightly-ci, I recommend installing necessary, confirmed versions of packages in the Docker image to reduce the time required for integration testing. Including Mooncake vllm with fixed tags, This is expected to reduce nightly-ci duration by 2 hours. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-13 20:10:12 +08:00
zhangxinyuehfad	b77b4f1abf	[Test] Add nightly test for DeepSeek-V3.2-Exp (#3908 ) ### What this PR does / why we need it? Add nightly test for DeepSeek-V3.2-Exp ### How was this patch tested? test action： https://github.com/vllm-project/vllm-ascend/actions/runs/19156153634/job/54757008557?pr=3908 - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-11 10:29:57 +08:00
Li Wang	259eb25f88	[CI] Quick fix mooncake for nightly-ci (#4028 ) ### What this PR does / why we need it? Since we have upgraded to CANN 8.3rc1, we will no longer use the privately maintained Mooncake repository, but instead use the official release released by Mooncake: https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.7.post2 . Next step: this is only a temporary solution. We will integrate mooncake into the vllm-ascend base image later for easier use. see https://github.com/vllm-project/vllm-ascend/pull/3989 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-06 18:46:00 +08:00
wangxiyuan	cc2cd42ad3	Upgrade CANN to 8.3.rc1 (#3945 ) ### What this PR does / why we need it? This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version check logic. TODO: we notice that UT runs failed with CANN 8.3 image. So the base image for UT is still 8.2. We'll fix it later. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-03 20:21:07 +08:00
Li Wang	8f222f21f1	[CI][Nightly] Fix mooncake build (#3958 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/pull/3943 - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-03 20:07:47 +08:00
Li Wang	d0cc9c1203	[CI][Nightly] Correct the commit hash available for mooncake (#3943 ) ### What this PR does / why we need it? Because the previous commit hash was accidentally deleted or overwritten. This patch correct the commit hash available for https://github.com/AscendTransport/Mooncake to make nightly ci happy ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-01 21:52:16 +08:00
Li Wang	eb0a2ee2d0	[CI] Optimize nightly CI (#3898 ) ### What this PR does / why we need it? This patch mainly fix the the problem of not being able to determine the exit status of the pod's entrypoint script and some other tiny optimizations: 1. Shorten wait for server timeout 2. fix typo 3. fix the issue of ais_bench failing to correctly access the proxy URL in a PD separation scenario. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-30 23:42:20 +08:00
Li Wang	4a2ab13743	[CI] Optimize nightly CI (#3858 ) ### What this PR does / why we need it? This patch optimize nightly CI: 1. Bug fixes ais_bench get None repo_type error 2. Fix A2 install kubectl error with arm arch 3. Fix the multi_node CI unable to determine whether the job was successful error ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-29 22:30:19 +08:00
Li Wang	90ae114569	[CI] Fix nightly CI (#3821 ) ### What this PR does / why we need it? This patch fix the nightly CI runs [failure](https://github.com/vllm-project/vllm-ascend/actions/runs/18848144365) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-28 20:40:03 +08:00
Li Wang	f846bd20e4	[CI] Add multi-node test case for a2 (#3805 ) ### What this PR does / why we need it? This patch add multi-node test case for a2 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-27 23:10:17 +08:00
jiangyunfan1	9030106a14	[TEST]Add 2P1D multi node cases for nightly test (#3764 ) ### What this PR does / why we need it? This PR adds the 2P1D multi node func/acc/perf test cases, we need test them daily ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com>	2025-10-27 23:09:15 +08:00
Li Wang	7f73c28a24	[CI][Doc] Optimize multi-node CI (#3565 ) ### What this PR does / why we need it? This pull request mainly do the following things: 1. Add a doc for multi-node CI, The main content is the mechanism principle and how to contribute 2. Simplify the config yaml for more developer-friendly 3. Optimized the mooncake installation script to prevent accidental failures during installation 4. Fix the workflow to ensure the kubernetes can be apply correctly 5. Add Qwen3-235B-W8A8 disaggregated_prefill test 6. Add GLM-4.5 multi dp test 7. Add 2p1d 4nodes disaggregated_prefill test 8. Refactor nightly tests ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-25 09:23:47 +08:00
Li Wang	286ae9003d	[CI] Multi-Node CI scalable (#3611 ) ### What this PR does / why we need it? This PR adds a jinja template for the k8s configuration file, prepare for the upcoming 4-node CI ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-22 14:18:43 +08:00
Li Wang	4c4a8458a5	[CI] Refator multi-node CI (#3487 ) ### What this PR does / why we need it? Refactor the multi-machine CI use case. The purpose of this PR is to increase the ease of adding multi-machine CI use cases, allowing developers to add multi-machine cluster model testing use cases (including PD separation) by simply adding a new YAML configuration file. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-17 09:04:31 +08:00

22 Commits