Files
xc-llm-ascend/.github/workflows/_e2e_test.yaml
huangning1995 ce11fd49f3 [Feature] Batch invariant torch.compile (#6107)
### What this PR does / why we need it?
Building upon https://github.com/vllm-project/vllm-ascend/pull/5517 to
enable batch-invariant in vllm-ascend, we observed that the performance
of BI in eager mode remains suboptimal.

This PR further integrates batch-invariant with torch.compile, which
improves inference performance by 350% when tested with Qwen3-0.6B.

### Does this PR introduce _any_ user-facing change?
Previously, enabling both aclgraph and Batch-Invariant would cause an
"ub overflow" error. This occurred because transposed input tensors
could produce incorrect stride() values.

To fix this, we now call .contiguous() on the input tensors before
passing them to Triton kernels. This ensures a contiguous memory layout
and prevents transposed tensors from causing incorrect stride
calculations.

### Test Plan
pytest -sv --durations=0
tests/e2e/singlecard/test_aclgraph_batch_invariant.py

### Test Result
```
============================================================================ slowest durations ============================================================================
87.37s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_v1_generation_is_deterministic_across_batch_sizes_with_needle
77.39s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN
74.04s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_without_batch_invariance_should_fail
73.59s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_simple_generation

(8 durations < 0.005s hidden.  Use -vv to show these durations.)
================================================================ 4 passed, 3 warnings in 312.45s (0:05:12) ================================================================
```
### Performance
export VLLM_BATCH_INVARIANT=1
vllm serve /home/Qwen3-0.6B \
--served-model-name qwen \
--port 8000 \
--max-num-seqs 256 \
--tensor-parallel-size 1 \
--max-model-len 5500 \
--max-num-batched-tokens 5500 \
--reasoning-parser qwen3 \
--gpu-memory-utilization 0.9 \
--compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY",
"cudagraph_capture_sizes":[1,2,4,8,16,32]}' \
--additional-config
'{"ascend_scheduler_config":{"enabled":true},"enable_weight_nz_layout":true}'

vllm bench serve --served-model-name qwen --trust-remote-code --backend
vllm --model /home/Qwen3-0.6B/ --endpoint /v1/completions --dataset-name
random --random-input-len 512 --random-output-len 256 --num-prompts 800
--max-concurrency 8

torch.compile batch invariant performance:
```
============ Serving Benchmark Result ============
Successful requests:                     800       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  477.21    
Total input tokens:                      409600    
Total generated tokens:                  204800    
Request throughput (req/s):              1.68      
Output token throughput (tok/s):         429.16    
Peak output token throughput (tok/s):    472.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1287.48   
---------------Time to First Token----------------
Mean TTFT (ms):                          285.53    
Median TTFT (ms):                        312.70    
P99 TTFT (ms):                           324.22    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.59     
Median TPOT (ms):                        17.50     
P99 TPOT (ms):                           18.44     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.59     
Median ITL (ms):                         17.45     
P99 ITL (ms):                            18.76     
==================================================
```
Eager
```
============ Serving Benchmark Result ============
Successful requests:                     800       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  1694.70   
Total input tokens:                      409600    
Total generated tokens:                  204800    
Request throughput (req/s):              0.47      
Output token throughput (tok/s):         120.85    
Peak output token throughput (tok/s):    136.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          362.54    
---------------Time to First Token----------------
Mean TTFT (ms):                          164.29    
Median TTFT (ms):                        129.71    
P99 TTFT (ms):                           1961.66   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.81     
Median TPOT (ms):                        65.15     
P99 TPOT (ms):                           72.27     
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.81     
Median ITL (ms):                         64.64     
P99 ITL (ms):                            75.72     
==================================================
```

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: huangning1995 <huangning12@huawei.com>
2026-01-26 09:15:06 +08:00

431 lines
18 KiB
YAML

name: 'e2e test'
on:
workflow_call:
inputs:
vllm:
required: true
type: string
runner:
required: true
type: string
image:
required: true
type: string
type:
required: true
type: string
contains_310:
required: true
type: boolean
jobs:
e2e:
name: singlecard
runs-on: linux-aarch64-a2b3-1
container:
image: ${{ inputs.image }}
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
TRANSFORMERS_OFFLINE: 1
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev clang-15
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-15 20
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-15 20
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend test
env:
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
VLLM_WORKER_MULTIPROC_METHOD: spawn
if: ${{ inputs.type == 'light' }}
run: |
pytest -sv --durations=0 tests/e2e/singlecard/test_aclgraph_accuracy.py::test_piecewise_res_consistency
pytest -sv --durations=0 tests/e2e/singlecard/test_quantization.py::test_qwen3_w8a8_quant
- name: Run e2e test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
if: ${{ inputs.type == 'full' }}
run: |
# We found that if running aclgraph tests in batch, it will cause AclmdlRICaptureBegin error. So we run
# the test separately.
# basic
pytest -sv --durations=0 tests/e2e/singlecard/test_auto_fit_max_mode_len.py
pytest -sv --durations=0 tests/e2e/singlecard/test_aclgraph_accuracy.py
pytest -sv --durations=0 tests/e2e/singlecard/test_aclgraph_batch_invariant.py
pytest -sv --durations=0 tests/e2e/singlecard/test_aclgraph_mem.py
pytest -sv --durations=0 tests/e2e/singlecard/test_async_scheduling.py
pytest -sv --durations=0 tests/e2e/singlecard/test_batch_invariant.py
pytest -sv --durations=0 tests/e2e/singlecard/test_camem.py
pytest -sv --durations=0 tests/e2e/singlecard/test_completion_with_prompt_embeds.py
pytest -sv --durations=0 tests/e2e/singlecard/test_cpu_offloading.py
pytest -sv --durations=0 tests/e2e/singlecard/test_guided_decoding.py
pytest -sv --durations=0 tests/e2e/singlecard/test_ilama_lora.py
pytest -sv --durations=0 tests/e2e/singlecard/test_llama32_lora.py
pytest -sv --durations=0 tests/e2e/singlecard/test_qwen3_multi_loras.py
pytest -sv --durations=0 tests/e2e/singlecard/test_models.py
pytest -sv --durations=0 tests/e2e/singlecard/test_multistream_overlap_shared_expert.py
pytest -sv --durations=0 tests/e2e/singlecard/test_profile_execute_duration.py
pytest -sv --durations=0 tests/e2e/singlecard/test_quantization.py
pytest -sv --durations=0 tests/e2e/singlecard/test_sampler.py
pytest -sv --durations=0 tests/e2e/singlecard/test_vlm.py
pytest -sv --durations=0 tests/e2e/singlecard/test_xlite.py
# compile
pytest -sv --durations=0 tests/e2e/singlecard/compile/test_norm_quant_fusion.py
# model_runner_v2
# pytest -sv --durations=0 tests/e2e/singlecard/model_runner_v2/test_basic.py
# pooling
pytest -sv --durations=0 tests/e2e/singlecard/pooling/test_classification.py
pytest -sv --durations=0 tests/e2e/singlecard/pooling/test_embedding.py
pytest -sv --durations=0 tests/e2e/singlecard/pooling/test_scoring.py
# spec_decode
pytest -sv --durations=0 tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py
pytest -sv --durations=0 tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py
e2e-2-cards:
name: multicard-2
runs-on: linux-aarch64-a3-2
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
HCCL_BUFFSIZE: 1024
TRANSFORMERS_OFFLINE: 1
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev clang-15
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-15 20
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-15 20
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend test (light)
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
if: ${{ inputs.type == 'light' }}
run: |
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_qwen3_moe.py::test_qwen3_moe_distributed_mp_tp2_ep
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek3_2_w8a8_pruning_mtp_tp2_ep
- name: Run vllm-project/vllm-ascend test (full)
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
if: ${{ inputs.type == 'full' }}
run: |
# this test fail with triton. Fix me.
# pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_aclgraph_capture_replay.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_qwen3_performance.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_data_parallel.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_expert_parallel.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_external_launcher.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_full_graph_mode.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_ilama_lora_tp2.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py
# To avoid oom, we need to run the test in a single process.
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek_multistream_moe_tp2
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_w4a8_dynamic_tp2
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_moe_sp_tp2
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek_w4a8_accuracy_tp2
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_moe_fc2_tp2
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek_v2_lite_fc1_tp2
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_dense_fc1_tp2
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_dense_prefetch_mlp_weight_tp2
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek3_2_w8a8_pruning_mtp_tp2_ep
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_w4a4_distributed_tp2
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_offline_weight_load.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_pipeline_parallel.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_prefix_caching.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_quantization.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_qwen3_moe.py
# This test is broken, fix me
#pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_shared_expert_dp.py
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_single_request_aclgraph.py
- name: Run vllm-project/vllm-ascend test (non triton)
if: ${{ inputs.type == 'full' }}
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
run: |
python3 -m pip uninstall -y triton-ascend
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_aclgraph_capture_replay.py
e2e-4-cards:
name: multicard-4
needs: [e2e-2-cards]
if: ${{ needs.e2e-2-cards.result == 'success' && inputs.type == 'full' }}
runs-on: linux-aarch64-a3-4
container:
image: m.daocloud.io/quay.io/ascend/cann:8.5.0-a3-ubuntu22.04-py3.11
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
TRANSFORMERS_OFFLINE: 1
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git wget curl -y
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
with:
path: ./vllm-ascend
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev clang-15
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-15 20
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-15 20
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
working-directory: ./vllm-ascend
run: |
export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend test for V1 Engine
working-directory: ./vllm-ascend
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
run: |
pytest -sv --durations=0 tests/e2e/multicard/4-cards/test_data_parallel_tp2.py
pytest -sv --durations=0 tests/e2e/multicard/4-cards/test_kimi_k2.py
pytest -sv --durations=0 tests/e2e/multicard/4-cards/test_qwen3_next.py
# recover once aclgraph stream bug fixed.
# long_sequence
# pytest -sv --durations=0 tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py
# pytest -sv --durations=0 tests/e2e/multicard/4-cards/long_sequence/test_basic.py
# pytest -sv --durations=0 tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py
# pytest -sv --durations=0 tests/e2e/multicard/4-cards/long_sequence/test_mtp.py
# # spec_decode
# pytest -sv --durations=0 tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py
e2e_310p:
name: 310p singlecard
runs-on: linux-aarch64-310p-1
if: ${{ inputs.contains_310 }}
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-310p-ubuntu22.04-py3.11
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
TRANSFORMERS_OFFLINE: 1
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend test
env:
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
VLLM_WORKER_MULTIPROC_METHOD: spawn
run: |
pytest -sv --durations=0 tests/e2e/310p/test_offline_inference_310p.py
e2e_310p-4cards:
name: 310p multicards 4cards
runs-on: linux-aarch64-310p-4
if: ${{ inputs.contains_310 }}
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-310p-ubuntu22.04-py3.11
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
TRANSFORMERS_OFFLINE: 1
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend test
env:
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
VLLM_WORKER_MULTIPROC_METHOD: spawn
run: |
pytest -sv --durations=0 tests/e2e/310p/test_offline_inference_parallel_310p.py