### What this PR does / why we need it?
This PR refactors the E2E multicard test suite to improve test case
identification and maintainability. Specifically, it renames various
test functions to be more descriptive (explicitly indicating model
families like Qwen/DeepSeek and parallelism strategies like DP/TP/PP/EP)
and cleans up outdated or redundant test configurations in the offline
distributed inference tests.
**Key Changes:**
1. Test Function Renaming (Standardization): Renamed multiple test
functions across **`tests/e2e/multicard/`** to include clear
suffixes/prefixes regarding the model and parallel strategy. This helps
differentiate test cases in CI logs and prevents naming collisions.
**`test_aclgraph_capture_replay.py`:**
- `test_aclgraph_capture_replay_dp2` ->
`test_aclgraph_capture_replay_metrics_dp2`
**`test_data_parallel.py`:**
- `test_data_parallel_inference` -> `test_qwen_inference_dp2`
**`test_data_parallel_tp2.py`:**
- `test_data_parallel_inference` -> `test_qwen_inference_dp2_tp2`
**`test_expert_parallel.py`:**
- `test_e2e_ep_correctness` -> `test_deepseek_correctness_ep`
**`test_external_launcher.py`:**
- `test_external_launcher` -> `test_qwen_external_launcher`
- `test_moe_external_launcher` -> `test_qwen_moe_external_launcher_ep`
- `test_external_launcher_and_sleepmode` ->
`test_qwen_external_launcher_with_sleepmode`
- `test_external_launcher_and_sleepmode_level2` ->
`test_qwen_external_launcher_with_sleepmode_level2`
- `test_mm_allreduce` ->
`test_qwen_external_launcher_with_matmul_allreduce`
**`test_full_graph_mode.py`:**
- `test_models_distributed_Qwen3_MOE_TP2_WITH_FULL_DECODE_ONLY` ->
`test_qwen_moe_with_full_decode_only`
- `test_models_distributed_Qwen3_MOE_TP2_WITH_FULL` ->
`test_qwen_moe_with_full`
**`test_fused_moe_allgather_ep.py`:**
- `test_generate_with_allgather `->
`test_deepseek_moe_fused_allgather_ep`
- `test_generate_with_alltoall` -> `test_deepseek_moe_fused_alltoall_ep`
**`test_offline_weight_load.py`:**
- `test_offline_weight_load_and_sleepmode` ->
`test_qwen_offline_weight_load_and_sleepmode`
**`test_pipeline_parallel.py`:**
- `test_models` -> `test_models_pp2`
2. Distributed Inference Cleanup
(**`test_offline_inference_distributed.py`**):
**model list changes:**
```
QWEN_DENSE_MODELS = [
- "vllm-ascend/Qwen3-8B-W8A8", "vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8"
+ "vllm-ascend/Qwen3-8B-W8A8",
]
```
```
- QWEN_W4A8_OLD_VERSION_MODELS = [
- "vllm-ascend/Qwen3-8B-W4A8",
- ]
- QWEN_W4A8_NEW_VERSION_MODELS = [
- "vllm-ascend/DeepSeek-V3-W4A8-Pruing",
- "vllm-ascend/DeepSeek-V3.1-W4A8-puring",
- ]
+ DEEPSEEK_W4A8_MODELS = [
+ "vllm-ascend/DeepSeek-V3.1-W4A8-puring",
+ ]
```
**Test Function Changes:**
- removed `test_models_distributed_QwQ`
- removed `test_models_distributed_Qwen3_W8A8`
- removed `test_models_distributed_Qwen3_W4A8DYNAMIC_old_version`
- `test_models_distributed_Qwen3_W4A8DYNAMIC_new_version` ->
`test_models_distributed_Qwen3_W4A8DYNAMIC`
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
287 lines
12 KiB
YAML
287 lines
12 KiB
YAML
name: 'e2e test'
|
|
|
|
on:
|
|
workflow_call:
|
|
inputs:
|
|
vllm:
|
|
required: true
|
|
type: string
|
|
runner:
|
|
required: true
|
|
type: string
|
|
image:
|
|
required: true
|
|
type: string
|
|
type:
|
|
required: true
|
|
type: string
|
|
|
|
jobs:
|
|
e2e:
|
|
name: singlecard
|
|
runs-on: ${{ inputs.runner }}-1
|
|
container:
|
|
image: ${{ inputs.image }}
|
|
env:
|
|
VLLM_LOGGING_LEVEL: ERROR
|
|
VLLM_USE_MODELSCOPE: True
|
|
TRANSFORMERS_OFFLINE: 1
|
|
steps:
|
|
- name: Check npu and CANN info
|
|
run: |
|
|
npu-smi info
|
|
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
|
|
|
- name: Config mirrors
|
|
run: |
|
|
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
|
|
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
|
|
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
|
|
apt-get update -y
|
|
apt install git -y
|
|
|
|
- name: Checkout vllm-project/vllm-ascend repo
|
|
uses: actions/checkout@v6.0.1
|
|
|
|
- name: Install system dependencies
|
|
run: |
|
|
apt-get -y install `cat packages.txt`
|
|
apt-get -y install gcc g++ cmake libnuma-dev
|
|
|
|
- name: Checkout vllm-project/vllm repo
|
|
uses: actions/checkout@v6.0.1
|
|
with:
|
|
repository: vllm-project/vllm
|
|
ref: ${{ inputs.vllm }}
|
|
path: ./vllm-empty
|
|
fetch-depth: 1
|
|
|
|
- name: Install vllm-project/vllm from source
|
|
working-directory: ./vllm-empty
|
|
run: |
|
|
VLLM_TARGET_DEVICE=empty pip install -e .
|
|
|
|
- name: Install vllm-project/vllm-ascend
|
|
env:
|
|
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
|
|
run: |
|
|
pip install -r requirements-dev.txt
|
|
pip install -v -e .
|
|
|
|
- name: Run vllm-project/vllm-ascend test
|
|
env:
|
|
VLLM_WORKER_MULTIPROC_METHOD: spawn
|
|
VLLM_USE_MODELSCOPE: True
|
|
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
|
|
if: ${{ inputs.type == 'light' }}
|
|
run: |
|
|
# pytest -sv tests/e2e/singlecard/test_aclgraph_accuracy.py
|
|
# pytest -sv tests/e2e/singlecard/test_quantization.py
|
|
pytest -sv tests/e2e/singlecard/test_vlm.py::test_multimodal_vl
|
|
pytest -sv tests/e2e/singlecard/pooling/test_classification.py::test_classify_correctness
|
|
|
|
- name: Run e2e test
|
|
env:
|
|
VLLM_WORKER_MULTIPROC_METHOD: spawn
|
|
VLLM_USE_MODELSCOPE: True
|
|
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
|
|
if: ${{ inputs.type == 'full' }}
|
|
run: |
|
|
# We found that if running aclgraph tests in batch, it will cause AclmdlRICaptureBegin error. So we run
|
|
# the test separately.
|
|
|
|
pytest -sv tests/e2e/singlecard/test_completion_with_prompt_embeds.py
|
|
pytest -sv tests/e2e/singlecard/test_aclgraph_accuracy.py
|
|
pytest -sv tests/e2e/singlecard/test_aclgraph_mem.py
|
|
pytest -sv tests/e2e/singlecard/test_camem.py
|
|
pytest -sv tests/e2e/singlecard/test_guided_decoding.py
|
|
# torch 2.8 doesn't work with lora, fix me
|
|
#pytest -sv tests/e2e/singlecard/test_ilama_lora.py
|
|
pytest -sv tests/e2e/singlecard/test_profile_execute_duration.py
|
|
pytest -sv tests/e2e/singlecard/test_quantization.py
|
|
pytest -sv tests/e2e/singlecard/test_sampler.py
|
|
pytest -sv tests/e2e/singlecard/test_vlm.py
|
|
pytest -sv tests/e2e/singlecard/test_xlite.py
|
|
pytest -sv tests/e2e/singlecard/pooling/
|
|
pytest -sv tests/e2e/singlecard/compile/test_norm_quant_fusion.py
|
|
|
|
# ------------------------------------ v1 spec decode test ------------------------------------ #
|
|
pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py
|
|
pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py
|
|
|
|
e2e-2-cards:
|
|
name: multicard-2
|
|
runs-on: linux-aarch64-a3-2
|
|
container:
|
|
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-a3-ubuntu22.04-py3.11
|
|
env:
|
|
VLLM_LOGGING_LEVEL: ERROR
|
|
VLLM_USE_MODELSCOPE: True
|
|
HCCL_BUFFSIZE: 1024
|
|
TRANSFORMERS_OFFLINE: 1
|
|
steps:
|
|
- name: Check npu and CANN info
|
|
run: |
|
|
npu-smi info
|
|
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
|
|
|
- name: Config mirrors
|
|
run: |
|
|
# Fix me: use nginx cache rather than the pypi
|
|
# sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
|
|
# pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
|
|
# pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
|
|
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
|
apt-get update -y
|
|
apt install git -y
|
|
|
|
- name: Checkout vllm-project/vllm-ascend repo
|
|
uses: actions/checkout@v6.0.1
|
|
|
|
- name: Install system dependencies
|
|
run: |
|
|
apt-get -y install `cat packages.txt`
|
|
apt-get -y install gcc g++ cmake libnuma-dev
|
|
|
|
- name: Checkout vllm-project/vllm repo
|
|
uses: actions/checkout@v6.0.1
|
|
with:
|
|
repository: vllm-project/vllm
|
|
ref: ${{ inputs.vllm }}
|
|
path: ./vllm-empty
|
|
fetch-depth: 1
|
|
|
|
- name: Install vllm-project/vllm from source
|
|
working-directory: ./vllm-empty
|
|
run: |
|
|
VLLM_TARGET_DEVICE=empty pip install -e .
|
|
|
|
- name: Install vllm-project/vllm-ascend
|
|
env:
|
|
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
|
|
run: |
|
|
pip install -r requirements-dev.txt
|
|
pip install -v -e .
|
|
|
|
- name: Run vllm-project/vllm-ascend test (light)
|
|
env:
|
|
VLLM_WORKER_MULTIPROC_METHOD: spawn
|
|
VLLM_USE_MODELSCOPE: True
|
|
if: ${{ inputs.type == 'light' }}
|
|
run: |
|
|
pytest -sv tests/e2e/multicard/test_qwen3_moe.py::test_models_distributed_Qwen3_MOE_TP2_WITH_EP
|
|
|
|
- name: Run vllm-project/vllm-ascend test (full)
|
|
env:
|
|
VLLM_WORKER_MULTIPROC_METHOD: spawn
|
|
VLLM_USE_MODELSCOPE: True
|
|
if: ${{ inputs.type == 'full' }}
|
|
run: |
|
|
pytest -sv tests/e2e/multicard/test_quantization.py
|
|
pytest -sv tests/e2e/multicard/test_aclgraph_capture_replay.py
|
|
pytest -sv tests/e2e/multicard/test_full_graph_mode.py
|
|
pytest -sv tests/e2e/multicard/test_data_parallel.py
|
|
pytest -sv tests/e2e/multicard/test_expert_parallel.py
|
|
pytest -sv tests/e2e/multicard/test_external_launcher.py
|
|
pytest -sv tests/e2e/multicard/test_single_request_aclgraph.py
|
|
pytest -sv tests/e2e/multicard/test_fused_moe_allgather_ep.py
|
|
# torch 2.8 doesn't work with lora, fix me
|
|
#pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py
|
|
|
|
# To avoid oom, we need to run the test in a single process.
|
|
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_multistream_moe
|
|
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC
|
|
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC
|
|
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_sp_for_qwen3_moe
|
|
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_fc2_for_qwen3_moe
|
|
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen_Dense_with_flashcomm_v1
|
|
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen_Dense_with_prefetch_mlp_weight
|
|
|
|
pytest -sv tests/e2e/multicard/test_pipeline_parallel.py
|
|
pytest -sv tests/e2e/multicard/test_prefix_caching.py
|
|
pytest -sv tests/e2e/multicard/test_qwen3_moe.py
|
|
pytest -sv tests/e2e/multicard/test_offline_weight_load.py
|
|
|
|
e2e-4-cards:
|
|
name: multicard-4
|
|
needs: [e2e, e2e-2-cards]
|
|
if: ${{ needs.e2e.result == 'success' && needs.e2e-2-cards.result == 'success' && inputs.type == 'full' }}
|
|
runs-on: linux-aarch64-a3-4
|
|
container:
|
|
image: m.daocloud.io/quay.io/ascend/cann:8.3.rc2-a3-ubuntu22.04-py3.11
|
|
env:
|
|
VLLM_LOGGING_LEVEL: ERROR
|
|
VLLM_USE_MODELSCOPE: True
|
|
TRANSFORMERS_OFFLINE: 1
|
|
steps:
|
|
- name: Check npu and CANN info
|
|
run: |
|
|
npu-smi info
|
|
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
|
|
|
|
- name: Config mirrors
|
|
run: |
|
|
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
|
|
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
|
apt-get update -y
|
|
apt install git wget curl -y
|
|
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
|
|
|
|
- name: Checkout vllm-project/vllm-ascend repo
|
|
uses: actions/checkout@v6.0.1
|
|
with:
|
|
path: ./vllm-ascend
|
|
|
|
- name: Install system dependencies
|
|
run: |
|
|
apt-get -y install `cat packages.txt`
|
|
apt-get -y install gcc g++ cmake libnuma-dev
|
|
|
|
- name: Checkout vllm-project/vllm repo
|
|
uses: actions/checkout@v6.0.1
|
|
with:
|
|
repository: vllm-project/vllm
|
|
ref: ${{ inputs.vllm }}
|
|
path: ./vllm-empty
|
|
|
|
- name: Install vllm-project/vllm from source
|
|
working-directory: ./vllm-empty
|
|
run: |
|
|
VLLM_TARGET_DEVICE=empty pip install -e .
|
|
|
|
- name: Install vllm-project/vllm-ascend
|
|
working-directory: ./vllm-ascend
|
|
run: |
|
|
export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi
|
|
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
|
|
pip install -r requirements-dev.txt
|
|
pip install -v -e .
|
|
|
|
- name: Run vllm-project/vllm-ascend test for V1 Engine
|
|
working-directory: ./vllm-ascend
|
|
env:
|
|
VLLM_WORKER_MULTIPROC_METHOD: spawn
|
|
VLLM_USE_MODELSCOPE: True
|
|
run: |
|
|
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_multistream_moe
|
|
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC
|
|
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Kimi_K2_Thinking_W4A16
|
|
# pytest -sv tests/e2e/multicard/test_qwen3_moe.py::test_models_distributed_Qwen3_MOE_TP2_WITH_EP
|
|
# pytest -sv tests/e2e/multicard/test_qwen3_moe.py::test_models_distributed_Qwen3_MOE_W8A8_WITH_EP
|
|
pytest -sv tests/e2e/multicard/test_data_parallel_tp2.py
|
|
- name: Install Ascend toolkit & triton_ascend (for Qwen3-Next-80B-A3B-Instruct)
|
|
shell: bash -l {0}
|
|
run: |
|
|
. /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh
|
|
python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27_aarch64.whl"
|
|
|
|
- name: Run vllm-project/vllm-ascend Qwen3 Next test
|
|
working-directory: ./vllm-ascend
|
|
shell: bash -el {0}
|
|
env:
|
|
VLLM_WORKER_MULTIPROC_METHOD: spawn
|
|
VLLM_USE_MODELSCOPE: True
|
|
run: |
|
|
. /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh
|
|
pytest -sv tests/e2e/multicard/test_qwen3_next.py
|