### What this PR does / why we need it?
Integrates uv: Significantly accelerates pip install execution and
resolves concurrency issues caused by traditional pip caching
mechanisms.
Why pip install uc-manager is explicitly added:
This project depends on uc-manager. However, installing it via uv pip
install uc-manager currently fails due to a known issue. An issue has
already been filed with the upstream uv repository to address this.
Consequently, we explicitly invoke pip install uc-manager as a temporary
workaround to ensure the build succeeds.
https://github.com/ModelEngine-Group/unified-cache-management/issues/736
Why use UV_SYSTEM_PYTHON: 1:
No virtual environment has been created yet; this configuration has the
same effect as directly using `pip install`.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: tfhddd <2272751277@qq.com>
### What this PR does / why we need it?
Add e2e test cases for the Qwen-VL model adaptation to Ascend 310p
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: gcw_61wqY8cy <wanghengkang1@huawei.com>
### What this PR does / why we need it?
This patch add a schedule triggered workflow for auto upgrade e2e
estimated-time for batter load balance
1. The workflow will run the full e2e test to get the duration of each
test.
2. The script `update_estimated_time.py` will upgrade the
[config.json](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/scripts/config.yaml)
according to the latest time
3. The workflow will submit a pull request that includes changes to
`config.json` automatically
<img width="2484" height="764" alt="image"
src="https://github.com/user-attachments/assets/02f3459c-bb3b-4f8e-9966-8bb2e5c1bbea"
/>
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
-
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
[CI] Upgrade CANN to 8.5.1
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Revert speedup image building and CI Installation related PRs
git revert 8835236181
git revert 64fba51275
git revert 263c2f8e8d
git revert 84b00695f8
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
1. Refactor image workflow using cache-from to speedup builds

Simultaneously refactored all Dockerfiles by placing layers that rarely
change before those that change frequently, improving build cache hit
rate.
2. Refactor E2E test using vllm-ascend container images, to skip C
compile while no C code are changed

In this case, the job will only replace the source code of vllm-ascend
and install `requirements-dev.txt`, saving about 10min before tests
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
9562912cea
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
9562912cea
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
This pull request significantly enhances the test suite by adding new
end-to-end test cases for Qwen3 models on the 310P hardware platform.
The primary goal is to ensure the stability and correctness of these
models under diverse operational conditions, including various
parallelism strategies, data types, and quantization methods.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
E2E test
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
### What this PR does / why we need it?
- This PR removes several self-hosted runner labels from the
`actionlint.yaml` configuration file. These runners are likely no longer
in use, so this change cleans up the configuration and ensures
`actionlint` has an accurate list of available runners.
- Move all Action dockerfiles to one folder
- remove useless `runner` input for e2e test.
- update workflow option version
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This is a configuration change for the CI linter. The correctness will
be verified by `actionlint` running in CI on subsequent pull requests.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Introduced 310P W8A8 Quantization Support: New modules and methods have
been added to enable W8A8 static quantization specifically for the
Ascend 310P platform.
Platform-Specific Quantization Configuration Loading: The system now
dynamically loads the appropriate quantization configurations
(AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether
the current hardware is an Ascend 310P device.
Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization
method for 310P is provided, handling the specifics of weight and
activation quantization, including input parameter broadcasting and
weight data manipulation.
Extended AscendModelSlimConfig for 310P: A specialized configuration
class for 310P integrates the new W8A8 linear method for both standard
linear layers and vocabulary parallel embeddings, ensuring proper
quantization application.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Shaoxu Cheng <2906339855@qq.com>
### What this PR does / why we need it?
1. Disable the feature to exit early upon encountering an error in order
to complete all tests.
2. Within each partition, tests are re-sorted by `estimated_time` in
ascending order. This allows the CI to cover as many test cases as
possible in the early stages.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
This patch add auto-partition feat for tests, for example, before this
pr, we are running e2e single card test for 2h40min, after the auto
partition, test case is automatically allocated into the required n
parts based on its test duration (greedy strategy) and run in parallel.
The advantage of doing this is that our overall test duration will
become 1/n of the original.
### Does this PR introduce _any_ user-facing change?
Before:
e2e single card test spend 2h40min
After:
e2e single card test spend 1h13min
### How was this patch tested?
```shell
python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 0
args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=0, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600)
+----------------+--------------------+
| Suite | Partition |
|----------------+--------------------|
| e2e-singlecard | 1/2 (0-based id=0) |
+----------------+--------------------+
✅ Enabled 13 test(s) (est total 4020.0s):
- tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py (est_time=1800)
- tests/e2e/singlecard/test_aclgraph_accuracy.py (est_time=480)
- tests/e2e/singlecard/test_guided_decoding.py (est_time=354)
- tests/e2e/singlecard/test_batch_invariant.py (est_time=320)
- tests/e2e/singlecard/pooling/test_embedding.py (est_time=270)
- tests/e2e/singlecard/test_quantization.py (est_time=200)
- tests/e2e/singlecard/test_llama32_lora.py (est_time=162)
- tests/e2e/singlecard/test_cpu_offloading.py (est_time=132)
- tests/e2e/singlecard/pooling/test_classification.py (est_time=120)
- tests/e2e/singlecard/test_camem.py (est_time=77)
- tests/e2e/singlecard/compile/test_norm_quant_fusion.py (est_time=70)
- tests/e2e/singlecard/test_auto_fit_max_mode_len.py (est_time=25)
- tests/e2e/singlecard/test_profile_execute_duration.py (est_time=10)
(base) wangli@Mac-mini vllm-ascend % python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 1
args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=1, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600)
+----------------+--------------------+
| Suite | Partition |
|----------------+--------------------|
| e2e-singlecard | 2/2 (0-based id=1) |
+----------------+--------------------+
✅ Enabled 13 test(s) (est total 4025.0s):
- tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py (est_time=1500)
- tests/e2e/singlecard/pooling/test_scoring.py (est_time=500)
- tests/e2e/singlecard/test_aclgraph_batch_invariant.py (est_time=410)
- tests/e2e/singlecard/test_vlm.py (est_time=354)
- tests/e2e/singlecard/test_models.py (est_time=300)
- tests/e2e/singlecard/test_multistream_overlap_shared_expert.py (est_time=200)
- tests/e2e/singlecard/test_sampler.py (est_time=200)
- tests/e2e/singlecard/test_async_scheduling.py (est_time=150)
- tests/e2e/singlecard/test_aclgraph_mem.py (est_time=130)
- tests/e2e/singlecard/test_ilama_lora.py (est_time=95)
- tests/e2e/singlecard/test_completion_with_prompt_embeds.py (est_time=76)
- tests/e2e/singlecard/test_qwen3_multi_loras.py (est_time=65)
- tests/e2e/singlecard/test_xlite.py (est_time=45)
```
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This patch purpose to add the `update_max_model_len` interface.
- vLLM version: v0.14.0
- vLLM main:
d68209402d
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This patch add new runner labels for the HK region, and e2e single-card
testing has been migrated to this runner.
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Use nginx for package cache to speed up CI
- vLLM version: v0.14.0
- vLLM main:
d68209402d
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Install clang in dokerfile for triton ascend
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
update triton ascend version in 3.2.0
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
Move the qwen3 performance test from nightly to e2e to intercept
performance degradation.
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
1. Fix DeepSeek-V3.2-W8A8-Pruning mtp
2. Add DeepSeek-V3.2-W8A8-Pruning e2e test
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Upgrade vllm commit to releases/v0.14.0
- Re-open cases in `tests/e2e/singlecard/pooling/test_scoring.py`, since
the errors before have been fixed by
https://github.com/vllm-project/vllm/pull/32243
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: wjunLu <wjunlu217@gmail.com>
This PR add 310 e2e test back to ensure the related PR will be tested on
310.
1. for light e2e, we'll run 310p test if the changed files are located
in `vllm_ascend/_310p`
2. for full e2e, we'll always run 310p test
3. for main2main test, we'll stop run 310p test
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
fix bug : https://github.com/vllm-project/vllm-ascend/issues/5634
Intermittent CI failure due to a compilation error in the triton
operator
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
Fixed an accuracy problem when using eagle3 with sp.
The problem is described in
https://github.com/vllm-project/vllm-ascend/issues/5825.
It also adds a much more precise way to determine whether drafter should
use `sp` or not.
Also, it changes the `eager` of drafter to be a real `eager` in frontend
to avoid a `fx-graph` problem.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
For simpilicity, we test it as in
https://github.com/vllm-project/vllm-ascend/issues/5825.
And we get the same result of `eagle3` with `sp` disabled.
```text
--------------------------------------------------
total_num_output_tokens: 1000
num_drafts: 437
num_draft_tokens: 1311
num_accepted_tokens: 564
mean acceptance length: 2.29
--------------------------------------------------
acceptance at token 0: 0.62
acceptance at token 1: 0.40
acceptance at token 2: 0.27
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00
```
* vLLM version: v0.13.0
* vLLM main:
2f4e6548ef
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
This PR depends on PR
https://github.com/vllm-project/vllm-ascend/pull/4046. And only if the
latter merged, it will work.
This PR aims to solve the issue
https://github.com/vllm-project/vllm-ascend/issues/3240.
The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the
senarios that the LoRA weights are added to q_proj, v_proj, k_proj,
o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_llama2_lora.py
pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: paulyu12 <507435917@qq.com>
### What this PR does / why we need it?
The customized ascend operator sgmv_expand and sgmv_shrink applies only
to the scenario where rank is 8,16,32,64. When rank >= 128, the operator
is out of range, causing the model to report an error.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Depends on this commit https://github.com/vllm-project/vllm/pull/31408
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
### What this PR does / why we need it?
Correcting some outdated use cases:
`tests/e2e/singlecard/test_aclgraph_accuracy.py::test_models_output` ->
`tests/e2e/singlecard/test_aclgraph_accuracy.py::test_piecewise_res_consistency`
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangli <wangli858794774@gmail.com>
1. speed up e2e light test.
2. create `2-cards` and `4-cards` folder in multicard
3. move ops to nightly
4. run test in Alphabetical Order
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
update bisheng version in 20260105
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
update triton-ascend version to 20260105
- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
This fixes a bug that occurred when running `test_camem.py` in the
triton-ascend environment `NPU function error:
aclrtGetMemInfo(ACL_HBM_MEM, &device_free, &device_total)`
- vLLM version: v0.13.0
- vLLM main:
5326c89803
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
Due to the update of the Bisheng version's installation path, the
corresponding source path in the environment variables needs to be
updated.
- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
Upgrade vllm commit to 1230
Affected by https://github.com/vllm-project/vllm/pull/27614 (and the
core PR https://github.com/vllm-project/vllm/pull/26866), we have to
make the following changes:
1. Modify `tests/e2e/multicard/test_aclgraph_capture_replay.py` to keep
compatible with both vllm version of `v0.13.0` and latest main commitID,
while vllm enables async scheduling by default
2. Skip `test_guided_decoding.py` due to xgrammar errors
(https://github.com/vllm-project/vllm-ascend/issues/5524)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
---------
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
1. Refactor the current test with mtp and eagle cases
2. Add new necessary cases with mtp and eagle
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
update triton-ascend version to 1229 and bisheng version in 1225;
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
#5051 only implement a basic framework for model runner v2, but there
are still some bugs for e2e functionality, this PR aim to enable basic
functionality.
model runner v2 plans:
https://github.com/vllm-project/vllm-ascend/issues/5208
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
Last month the interface of `OffloadingSpec` has
changed(https://github.com/vllm-project/vllm/pull/27743). This PR fixes
this bug and adds e2e test for cpu offloading.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
CI passed with new added test.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Currently, MHA models (eg: minicpm-2b, Baichuan-7b) will encounter
errors when running in piecewise graph mode, with error messages similar
to:
```
(E89999): When layout is TND and PA not enabled, keyT(8) and valueT(8) must be equal to the last element of actualSeqenceLengthKV(5)[FUNC:CheckInputShapeWhenLayoutIsTND][FILE:prompt_flash_attention_tiling.cpp][LINE:3618]
```
The error occurs because the qkv in the Prefill stage is also padded,
causing the shape to be inconsistent with actual_seq_lengths.
Add unpadding logic for kv.
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
When matmul_and_reduce is enabled, the prefix attribute is required.
However, in some models, the prefix is not passed correctly, causing
errors when starting the service.
The issue of incorrect prefix passing will be fixed in vLLM in the
future.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>