[CI][Doc] Optimize multi-node CI (#3565)

### What this PR does / why we need it? This pull request mainly do the following things: 1. Add a doc for multi-node CI, The main content is the mechanism principle and how to contribute 2. Simplify the config yaml for more developer-friendly 3. Optimized the mooncake installation script to prevent accidental failures during installation 4. Fix the workflow to ensure the kubernetes can be apply correctly 5. Add Qwen3-235B-W8A8 disaggregated_prefill test 6. Add GLM-4.5 multi dp test 7. Add 2p1d 4nodes disaggregated_prefill test 8. Refactor nightly tests ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: 17c540a993 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-25 09:23:47 +08:00
parent 292cf339c3
commit 7f73c28a24
21 changed files with 1165 additions and 378 deletions
--- a/tests/e2e/nightly/multi_node/scripts/lws.yaml.jinja2
+++ b/tests/e2e/nightly/multi_node/scripts/lws.yaml.jinja2
@@ -17,19 +17,24 @@ spec:
          - name: vllm-leader
            image: {{ image | default("m.daocloud.io/quay.io/ascend/cann:8.2.rc1-a3-ubuntu22.04-py3.11") }}
            env:
+              - name: CONFIG_YAML_PATH
+                value: {{ config_file_path | default("tests/e2e/nightly/multi_node/config/models/DeepSeek-V3.yaml") }}
              - name: WORKSPACE
                value: "/root/workspace"
              # Set vLLM version and vLLM-Ascend version here, once there is a new release, update here.
              - name: VLLM_VERSION
                value: "v0.11.0"
              - name: VLLM_ASCEND_VERSION
-                value: "main"
+                value: {{ vllm_ascend_ref | default("main") }}
+              - name: VLLM_ASCEND_REMOTE_URL
+                value: {{ vllm_ascend_remote_url | default("https://github.com/vllm-project/vllm-ascend.git") }}
+              - name: RESULT_FILE_PATH
+                value: {{ result_file_path | default("/root/.cache/tests/ret/test_result.txt") }}
            command:
              - sh
              - -c
              - |
                bash /root/.cache/tests/run.sh
-                tail -f /dev/null
            resources:
              limits:
                huawei.com/ascend-1980: "16"
@@ -70,19 +75,24 @@ spec:
          - name: vllm-worker
            image: {{ image | default("m.daocloud.io/quay.io/ascend/cann:8.2.rc1-a3-ubuntu22.04-py3.11") }}
            env:
+              - name: CONFIG_YAML_PATH
+                value: {{ config_file_path | default("tests/e2e/nightly/multi_node/config/models/DeepSeek-V3.yaml") }}
              - name: WORKSPACE
                value: "/root/workspace"
              # Set vLLM version and vLLM-Ascend version here, once there is a new release, update here.
              - name: VLLM_VERSION
                value: "v0.11.0"
              - name: VLLM_ASCEND_VERSION
-                value: "main"
+                value: {{ vllm_ascend_ref | default("main") }}
+              - name: VLLM_ASCEND_REMOTE_URL
+                value: {{ vllm_ascend_remote_url | default("https://github.com/vllm-project/vllm-ascend.git") }}
+              - name: RESULT_FILE_PATH
+                value: {{ result_file_path | default("/root/.cache/tests/ret/test_result.txt") }}
            command:
              - sh
              - -c
              - |
                bash /root/.cache/tests/run.sh
-                tail -f /dev/null
            resources:
              limits:
                huawei.com/ascend-1980: "16"