[CI][Doc] Optimize multi-node CI (#3565)

### What this PR does / why we need it?
This pull request mainly do the following things:
1. Add a doc for multi-node CI, The main content is the mechanism
principle and how to contribute
2. Simplify the config yaml for more developer-friendly
3. Optimized the mooncake installation script to prevent accidental
failures during installation
4. Fix the workflow to ensure the kubernetes can be apply correctly
5. Add Qwen3-235B-W8A8 disaggregated_prefill test
6. Add GLM-4.5 multi dp test
7. Add 2p1d 4nodes disaggregated_prefill test
8. Refactor nightly tests
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
Li Wang
2025-10-25 09:23:47 +08:00
committed by GitHub
parent 292cf339c3
commit 7f73c28a24
21 changed files with 1165 additions and 378 deletions

View File

@@ -8,7 +8,10 @@ def test_multi_node() -> None:
env_dict = config.envs
# perf_cmd = config.perf_cmd
# acc_cmd = config.acc_cmd
server_port = config.server_port if not config.disaggregated_prefill else config.proxy_port
nodes_info = config.nodes_info
disaggregated_prefill = config.disaggregated_prefill
server_port = config.server_port
proxy_port = config.proxy_port
server_host = config.cluster_ips[0]
with config.launch_server_proxy(DISAGGREGATED_PREFILL_PROXY_SCRIPT):
with RemoteOpenAIServer(
@@ -18,6 +21,9 @@ def test_multi_node() -> None:
server_host=server_host,
env_dict=env_dict,
auto_port=False,
proxy_port=proxy_port,
disaggregated_prefill=disaggregated_prefill,
nodes_info=nodes_info,
max_wait_seconds=2000,
) as remote_server:
# base_url = remote_server.url_root