Files
xc-llm-ascend/.github/workflows/_e2e_nightly_multi_node.yaml
zhangxinyuehfad 1e4017e3fa [CI] support nightly ci for per pr by labels (#6483)
### What this PR does / why we need it?

This PR refactors the nightly CI workflows (A2 and A3) to support
running tests against a specific PR's code, in addition to the existing
scheduled/dispatch runs using pre-built images.

#### Motivation:
Previously, nightly tests could only be triggered by schedule or
workflow_dispatch, always using the pre-built nightly image. This change
allows developers to trigger nightly tests against their own PR's source
code, enabling early validation without waiting for a nightly build.

#### Changes
Trigger logic (parse-trigger job)

A new parse-trigger job is introduced in both
schedule_nightly_test_a2.yaml and schedule_nightly_test_a3.yaml to
centralize trigger evaluation:

`schedule / workflow_dispatch`: runs all tests with the pre-built image
(existing behavior preserved)
`pull_request (labeled + synchronize)`: runs only when:The PR has the
nightly-test label, and /nightly [test-names] comment exists (latest one
wins)

1. /nightly or /nightly all — runs all tests
2. /nightly test1 test2 — runs only named tests (comma-wrapped for exact
matching)

#### How to trigger
1. Add the nightly-test label to your PR
2. Comment /nightly (all tests) or /nightly test1 test2 (specific tests)
4. Re-triggering: add another /nightly comment and push a new commit
(synchronize event)

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-05 16:46:37 +08:00

295 lines
10 KiB
YAML

name: 'e2e nightly test multi_node'
on:
workflow_call:
inputs:
soc_version:
required: true
type: string
description: use a2 or a3
runner:
required: false
type: string
default: linux-aarch64-a3-0
image:
required: false
type: string
description: base image for pods
default: "swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-910b-ubuntu22.04-py3.11"
config_file_path:
required: true
type: string
description: the model config for multi_node test
replicas:
required: false
default: "1"
type: string
description: replicas of the k8s cluster
size:
required: false
default: "2"
type: string
description: how many pods will be pulled up via lws.yaml, indicates number of nodes we need
vllm_version:
required: false
default: "v0.16.0"
type: string
description: vllm version to use
vllm_ascend_remote_url:
required: false
default: https://github.com/vllm-project/vllm-ascend.git
type: string
description: used for pr level tests
vllm_ascend_ref:
required: false
default: main
type: string
description: used for pr level tests
is_pr_test:
required: true
type: boolean
is_run:
required: true
type: boolean
secrets:
KUBECONFIG_B64:
required: true
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
# and ignore the lint / 8 cards test type
concurrency:
group: ascend-nightly-${{ github.workflow_ref }}-${{ github.ref }}-${{ inputs.soc_version }}
cancel-in-progress: true
jobs:
e2e:
name: ${{ inputs.config_file_path }}
# This is the runner with no NPU for k8s controller
runs-on: ${{ inputs.runner }}
if: ${{ inputs.is_run }}
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-cpu
env:
KUBECONFIG: /tmp/kubeconfig
NAMESPACE: vllm-project
LEADER_POD: vllm-0
steps:
- name: Decode kubeconfig from secrets
run: |
# Decode and save kubeconfig
if [ "${{ inputs.is_pr_test }}" = "true" ]; then
echo "PR test mode"
if [ "${{ inputs.soc_version }}" = "a3" ]; then
echo "Using A3 cached kubeconfig"
cp /root/.cache/.kube/kubeconfig.yaml "$KUBECONFIG"
else
echo "Using A2 cached kubeconfig"
cp /root/.cache/.kube/hk_001_kb.yaml "$KUBECONFIG"
fi
else
echo "Decoding kubeconfig from secrets"
echo "${{ secrets.KUBECONFIG_B64 }}" | base64 -d > "$KUBECONFIG"
fi
- name: Checkout code
uses: actions/checkout@v6
- name: Prepare scripts
run: |
# prepare for lws entrypoint scripts
install -D tests/e2e/nightly/multi_node/scripts/run.sh /root/.cache/tests/run.sh
- name: Clear resources
run: |
set -euo pipefail
CRD_NAME="${CRD_NAME:-vllm}"
TIMEOUT=${TIMEOUT:-120}
SLEEP_INTERVAL=2
echo "Deleting leaderworkerset [$CRD_NAME] in namespace [$NAMESPACE]..."
kubectl delete leaderworkerset "$CRD_NAME" -n "$NAMESPACE" --ignore-not-found
echo "Waiting for all pods starting with 'vllm' to be deleted..."
START_TIME=$(date +%s)
while true; do
NOW=$(date +%s)
ELAPSED=$((NOW - START_TIME))
if [[ $ELAPSED -ge $TIMEOUT ]]; then
echo "Timeout reached ($TIMEOUT seconds), some pods still exist:"
kubectl get pods -n "$NAMESPACE" | grep '^vllm' || true
exit 1
fi
PODS_EXIST=$(kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n' | grep '^vllm' || true)
if [[ -z "$PODS_EXIST" ]]; then
echo "All vllm pods deleted."
break
else
echo "Waiting for pods to be deleted: $PODS_EXIST"
sleep $SLEEP_INTERVAL
fi
done
- name: Launch cluster
id: launcher
run: |
set -e
size="${{ inputs.size }}"
replicas="${{ inputs.replicas }}"
image="${{ inputs.image }}"
config_file_path="${{ inputs.config_file_path }}"
fail_tag=FAIL_TAG_"${{ inputs.config_file_path }}"
is_pr_test="${{ inputs.is_pr_test }}"
vllm_version="${{ inputs.vllm_version }}"
vllm_ascend_ref="${{ inputs.vllm_ascend_ref }}"
vllm_ascend_remote_url="${{ inputs.vllm_ascend_remote_url }}"
echo "FAIL_TAG=${fail_tag}" >> $GITHUB_ENV
required_params=("size" "replicas" "image" "config_file_path" "is_pr_test" "vllm_version" "vllm_ascend_ref" "vllm_ascend_remote_url")
for param in "${required_params[@]}"; do
if [ -z "${!param}" ]; then
echo "Error: Parameter '$param' is required but empty"
exit 1
fi
done
if [ "${{ inputs.soc_version }}" = "a3" ]; then
npu_per_node=16
TEMPLATE_FILE="tests/e2e/nightly/multi_node/scripts/lws.yaml.jinja2"
else
npu_per_node=8
TEMPLATE_FILE="tests/e2e/nightly/multi_node/scripts/lws-a2.yaml.jinja2"
fi
jinja2 $TEMPLATE_FILE \
-D size="$size" \
-D replicas="$replicas" \
-D image="$image" \
-D config_file_path="$config_file_path" \
-D npu_per_node="$npu_per_node" \
-D fail_tag="$fail_tag" \
-D is_pr_test="$is_pr_test" \
-D vllm_version="$vllm_version" \
-D vllm_ascend_ref="$vllm_ascend_ref" \
-D vllm_ascend_remote_url="$vllm_ascend_remote_url" \
--outfile lws.yaml
kubectl apply -f ./lws.yaml
- name: Waiting for pod ready
run: |
POD_PREFIX="${POD_PREFIX:-vllm-0}"
SIZE="${{ inputs.size }}"
TIMEOUT=1200 # default timeout 20 minutes
echo "Waiting for Pods in namespace [$NAMESPACE] to become Running and Ready (timeout ${TIMEOUT}s)..."
START_TIME=$(date +%s)
while true; do
NOW=$(date +%s)
ELAPSED=$((NOW - START_TIME))
if [[ $ELAPSED -ge $TIMEOUT ]]; then
echo "Timeout reached after ${ELAPSED}s"
echo "Dumping pod status for debugging:"
kubectl get pods -n "$NAMESPACE"
kubectl describe pod "$LEADER_POD" -n "$NAMESPACE"
exit 1
fi
# 1) check follower pods
ALL_FOLLOWERS_READY=true
for ((i=1; i<SIZE; i++)); do
POD="${POD_PREFIX}-${i}"
PHASE=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || echo "NotFound")
READY=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[*].ready}' 2>/dev/null)
echo "Follower [$POD] phase=$PHASE ready=$READY"
if [[ "$PHASE" != "Running" || "$READY" != "true" ]]; then
echo "Follower [$POD] not Ready yet..."
ALL_FOLLOWERS_READY=false
break
fi
done
# 2) check leader pod
LEADER_PHASE=$(kubectl get pod "$LEADER_POD" -n "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || echo "NotFound")
LEADER_READY=$(kubectl get pod "$LEADER_POD" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[*].ready}' 2>/dev/null)
echo "Leader [$LEADER_POD] phase=$LEADER_PHASE ready=$LEADER_READY"
if [[ "$LEADER_PHASE" != "Running" || "$LEADER_READY" != "true" ]]; then
echo "Leader not Ready yet..."
ALL_FOLLOWERS_READY=false
fi
if [[ "$ALL_FOLLOWERS_READY" == "true" ]]; then
echo "All follower pods and leader pod are Running and Ready — continuing."
break
fi
sleep 2
done
- name: Stream logs
run: |
set -euo pipefail
size="${{ inputs.size }}"
pids=()
cleanup() {
echo "Cleaning up background log streams..."
for pid in "${pids[@]}"; do
kill "$pid" 2>/dev/null || true
done
}
trap cleanup EXIT
for i in $(seq 1 $((size - 1))); do
POD="vllm-0-${i}"
echo "==== Collecting logs from worker pod: $POD ===="
kubectl logs -f "$POD" -n "$NAMESPACE" \
> "/tmp/${POD}_logs.txt" 2>&1 &
pids+=($!)
done
echo "==== Streaming logs from leader pod: $LEADER_POD ===="
echo "Looking for logs containing: $FAIL_TAG"
kubectl logs -f "$LEADER_POD" -n "$NAMESPACE" | while IFS= read -r line; do
echo "$line"
if echo "$line" | grep -q "$FAIL_TAG"; then
exit 1
fi
done
- name: Upload logs
if: always()
uses: actions/upload-artifact@v7
with:
name: ${{ inputs.config_file_path }}-pod-logs
path: /tmp/vllm*_logs.txt
retention-days: 7
- name: Post process
if: always()
run: |
kubectl get pods -n "$NAMESPACE" --ignore-not-found=true
kubectl delete -f ./lws.yaml --ignore-not-found=true || true