Commit Graph

11 Commits

Author SHA1 Message Date
realliujiaxu
5def28dcd3 [Feat]support sequence parallelism by pass for VL models (#5632) 2026-02-27 08:27:41 +08:00
Li-Yongwen
2870f7c8ad [Feat] Support routing replay (#6696)
### What this PR does / why we need it?

[Feat] Support routing replay
same as https://github.com/vllm-project/vllm-ascend/pull/6666
resubmit  because of DOC failure

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: liyongwen <1310439159@qq.com>
Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-26 10:22:47 +08:00
bowenli
e3927cc8f5 [Bugfix] fix bug for mtp (#6514)
### What this PR does / why we need it?
fix(mtp): resolve MTP core bugs and enhance eager mode test cases
1. Resolved critical issues in eager mode MTP core execution logic;
2. Fixed functional bugs in the _update_states_after_model_execute
function;
3. Updated and released test_mtp_qwen3_next.py to validate eager mode
acceptance rate.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: Bowen-Leee <caoshankuangren@gmail.com>
2026-02-25 17:50:57 +08:00
jiahao.quan
7221045777 [Attention] add gpt-oss support (#5901)
### What this PR does / why we need it?
Please refer to the following link for the historical conversation
https://github.com/vllm-project/vllm-ascend/pull/4467. We have made
updates in light of the comments from the prior PR review. Given the
refactoring of the attention_v1 component, we have carried out necessary
adjustments to fit the newly revised code.

### Does this PR introduce _any_ user-facing change?

1. Modified the code in the Attention section to adapt to the SWA and
Sink features required by gpt-oss.
2. Modified the code in the MoE section to add support for bias and
swigluoai.

### How was this patch tested?
Please refer to the
https://github.com/vllm-project/vllm-ascend/pull/4467 for performance
tests, on the basis of which the accuracy tests from AIME2024 have been
newly added.

![img_v3_02tu_501e88e3-2217-4565-8edf-b9acf4f43f2g](https://github.com/user-attachments/assets/024f8283-18ab-4d4d-ab12-27917b5d7d06)


- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: mikequan0425 <mikequan0425@foxmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leon_tao <taoyao2@huawei.com>
Co-authored-by: nurxat <738457498@qq.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: mikequan <199741451@qq.com>
Co-authored-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com>
Co-authored-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Co-authored-by: pu-zhe <zpuaa@outlook.com>
Co-authored-by: luomin2005 <luomin2005@huawei.com>
Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: Cao Yi <slightwindsec@gmail.com>
Co-authored-by: Icey <1790571317@qq.com>
Co-authored-by: SILONG ZENG <2609716663@qq.com>
2026-02-12 10:55:34 +08:00
wangyu
c63b7a1188 [Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301)
### What this PR does / why we need it?
This PR adds disaggregated encoder  tests for Qwen2.5-VL-7B-Instruct 
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
by running ci

- vLLM version: release/v0.12.0

---------

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
2026-02-06 17:30:17 +08:00
Feng Liu
03a18ad6fd [E2E] add E2E for Prefix Caching cp & Chunked Prefill cp (#5149)
### What this PR does / why we need it?
Add E2E for Prefix Caching cp & Chunked Prefill cp 
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: F.Liu <liufeng248@huawei.com>
Signed-off-by: Feng Liu <46866849+ader47@users.noreply.github.com>
Co-authored-by: F.Liu <liufeng248@huawei.com>
2026-02-03 15:04:14 +08:00
SILONG ZENG
f4a72f0d16 [CI]Disable early exit to complete all tests (#6482)
### What this PR does / why we need it?
1. Disable the feature to exit early upon encountering an error in order
to complete all tests.
2. Within each partition, tests are re-sorted by `estimated_time` in
ascending order. This allows the CI to cover as many test cases as
possible in the early stages.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-02-03 11:25:51 +08:00
wangxiyuan
b4aafd4293 [Core][Misc] Clean up ProfileExecuteDuration (#6461)
### What this PR does / why we need it?
This PR removes the custom `ProfileExecuteDuration` utility and its
usages across the codebase. This utility was used for profiling
execution duration of different stages in the inference process. It is
replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`,
which integrates with PyTorch's profiler.

This change simplifies the code by removing a custom implementation in
favor of an upstream utility, improving maintainability. Associated
documentation and tests for `ProfileExecuteDuration` are also removed.

### Does this PR introduce _any_ user-facing change?
`VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now.

### How was this patch tested?
CI passed. The changes are a cleanup and replacement with a standard
utility. Existing tests cover the functionality. The removed feature had
its own tests which are also removed.

Related RFC: #5304

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-01 20:06:01 +08:00
CodeCat
b2857de43f [ST]Add e2e test for Npugraphex_pass (#6388)
### What this PR does / why we need it?
We found the custom passes of NPUGraphEX have implemented fusion
operator features, which still require E2E test case validation and
guard. This PR implements E2E test cases for the AddRMSNormQuant and
SplitQKVNormRope operator fusions under NPUGraphEX that are already in
the codebase.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: cjian <2318164299@qq.com>
2026-01-30 09:14:07 +08:00
wjunLu
4970de4242 [CI] Enable the skipped cases when HDK is upgraded to 25.5.0 (#6195)
### What this PR does / why we need it?
Enable the tests that were skipped due to an outdated driver version:
- tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py
- tests/e2e/multicard/4-cards/long_sequence/test_basic.py
- tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py

and some cases in
- tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py
- tests/e2e/multicard/2-cards/test_external_launcher.py
- tests/e2e/multicard/2-cards/test_offline_weight_load.py
- tests/e2e/multicard/2-cards/test_quantization.py
- tests/e2e/multicard/4-cards/test_data_parallel_tp2.py

TODO:
- tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py
- tests/e2e/multicard/4-cards/long_sequence/test_mtp.py
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-29 22:41:41 +08:00
Li Wang
e35f304419 [CI] Auto partition for test cases (#6379)
### What this PR does / why we need it?
This patch add auto-partition feat for tests, for example, before this
pr, we are running e2e single card test for 2h40min, after the auto
partition, test case is automatically allocated into the required n
parts based on its test duration (greedy strategy) and run in parallel.
The advantage of doing this is that our overall test duration will
become 1/n of the original.

### Does this PR introduce _any_ user-facing change?
Before:
e2e single card test spend 2h40min
After:
e2e single card test spend 1h13min

### How was this patch tested?

```shell
python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 0 
args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=0, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600)
+----------------+--------------------+
| Suite          | Partition          |
|----------------+--------------------|
| e2e-singlecard | 1/2 (0-based id=0) |
+----------------+--------------------+
 Enabled 13 test(s) (est total 4020.0s):
  - tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py (est_time=1800)
  - tests/e2e/singlecard/test_aclgraph_accuracy.py (est_time=480)
  - tests/e2e/singlecard/test_guided_decoding.py (est_time=354)
  - tests/e2e/singlecard/test_batch_invariant.py (est_time=320)
  - tests/e2e/singlecard/pooling/test_embedding.py (est_time=270)
  - tests/e2e/singlecard/test_quantization.py (est_time=200)
  - tests/e2e/singlecard/test_llama32_lora.py (est_time=162)
  - tests/e2e/singlecard/test_cpu_offloading.py (est_time=132)
  - tests/e2e/singlecard/pooling/test_classification.py (est_time=120)
  - tests/e2e/singlecard/test_camem.py (est_time=77)
  - tests/e2e/singlecard/compile/test_norm_quant_fusion.py (est_time=70)
  - tests/e2e/singlecard/test_auto_fit_max_mode_len.py (est_time=25)
  - tests/e2e/singlecard/test_profile_execute_duration.py (est_time=10)

(base) wangli@Mac-mini vllm-ascend % python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 1 
args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=1, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600)
+----------------+--------------------+
| Suite          | Partition          |
|----------------+--------------------|
| e2e-singlecard | 2/2 (0-based id=1) |
+----------------+--------------------+
 Enabled 13 test(s) (est total 4025.0s):
  - tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py (est_time=1500)
  - tests/e2e/singlecard/pooling/test_scoring.py (est_time=500)
  - tests/e2e/singlecard/test_aclgraph_batch_invariant.py (est_time=410)
  - tests/e2e/singlecard/test_vlm.py (est_time=354)
  - tests/e2e/singlecard/test_models.py (est_time=300)
  - tests/e2e/singlecard/test_multistream_overlap_shared_expert.py (est_time=200)
  - tests/e2e/singlecard/test_sampler.py (est_time=200)
  - tests/e2e/singlecard/test_async_scheduling.py (est_time=150)
  - tests/e2e/singlecard/test_aclgraph_mem.py (est_time=130)
  - tests/e2e/singlecard/test_ilama_lora.py (est_time=95)
  - tests/e2e/singlecard/test_completion_with_prompt_embeds.py (est_time=76)
  - tests/e2e/singlecard/test_qwen3_multi_loras.py (est_time=65)
  - tests/e2e/singlecard/test_xlite.py (est_time=45)
```
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-29 20:28:10 +08:00