Commit Graph

119 Commits

Author SHA1 Message Date
zhangxinyuehfad
06ccce1ddf [FOLLOWUP] fix name and format in accuracy test (#1288) (#1435)
### What this PR does / why we need it?
fix accuracy test:
1. fix accuracy report
like:https://vllm-ascend--1429.org.readthedocs.build/en/1429/developer_guide/evaluation/accuracy_report/Qwen2.5-7B-Instruct-V0.html
2. fix create pr for report

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-26 00:26:54 +08:00
sharonyunyun
941269a6c5 adjusting the communication method in graph mode (#1194)
### What this PR does / why we need it?
Communication performance optimization: replace allreduce with
reduce_scatter+all_gather in MLA layer's TP group,to remove
stridedsliced and all_gather in MOE layer.
when tp > 1, It is enabled during the decode phase of the graph mode
when enable_multistream_moe、MLA, use_v1, and MC2 are used.
According to the end-to-end RL inference test results, this PR can bring
3% gain in the decode stage.

**Before Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/1bb5dfa1-809b-410a-90c9-c5fd23cff003)
Evaluation

![image](https://github.com/user-attachments/assets/0b8ea0c7-88e7-410f-9ef4-f0cfe910cdc7)

![image](https://github.com/user-attachments/assets/94fde910-c125-4c2e-8de4-88fc3fafc057)

**After Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/55fac0e0-11f2-4654-8fd4-287949e0b29e)
Evaluation

![image](https://github.com/user-attachments/assets/e923f74b-29c4-4171-9382-40a00cf05df0)

![image](https://github.com/user-attachments/assets/5dba7967-07ea-4926-a8be-804bfd34e3e4)

### Does this PR introduce _any_ user-facing change?
Users need to configure enable_multistream_moe=True

### How was this patch tested?
Add e2e test cases to cover code logic

Signed-off-by: sharonyunyun <zhangying134@huawei.com>
2025-06-25 19:56:49 +08:00
zhangxinyuehfad
0060886a37 [CI]Update accuracy report test (#1288)
### What this PR does / why we need it?
Update accuracy report test
1. Add Record commit hashes and GitHub links for both vllm and
vllm-ascend in accuracy reports
2. Add accuracy result verification checks to ensure output correctness
3. Creat PR via forked repository workflow

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
dense-accuracy-test:
https://github.com/vllm-project/vllm-ascend/actions/runs/15745619485
create pr via forked repository workflow:
https://github.com/zhangxinyuehfad/vllm-ascend/actions/runs/15747013719/job/44385134080
accuracy report pr:
https://github.com/vllm-project/vllm-ascend/pull/1292

Currently, the accuracy report used is old and needs to be merged into
pr, retest, update new report, then close #1292 .


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-25 14:10:34 +08:00
Li Wang
15df8be937 [Doc] Add sleep mode doc (#1295)
### What this PR does / why we need it?
Add sleep related doc and example

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-25 14:07:14 +08:00
Mengqing Cao
52317f92cb [DP] Tiny fix of dp and update example (#1273)
### What this PR does / why we need it?
Add `max_num_tokens_across_dp` to AscendMetadata to fix dp

This pr fixes the bug introduced by
https://github.com/vllm-project/vllm-ascend/pull/1229, which add an arg
`max_num_tokens_across_dp` when dp_size > 1.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-25 11:03:04 +08:00
Mengqing Cao
20767a043c [CI/UT] Fix disaggregated prefill ci (#1313)
### What this PR does / why we need it?
Use eager mode to run disaggregated prefill ci

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new existing test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-24 17:11:00 +08:00
Yikun Jiang
917c6b71af [TEST][DOC] Fix doctest and add system package installation (#1375)
### What this PR does / why we need it?
- Fix
[doctest](https://github.com/vllm-project/vllm-ascend/actions/workflows/vllm_ascend_doctest.yaml?query=event%3Aschedule)
- add system package installation
- Add doc for run doctests
- Cleanup all extra steps in .github/workflows/vllm_ascend_doctest.yaml
- Change schedule job from 4 ---> 12 hours

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- doctest CI passed
- Local test with
`/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`.

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-23 20:50:33 +08:00
zxdukki
f04c6763d8 [Bugfix] fix env variable in dbo (#1284)
### What this PR does / why we need it?
Fix env variable in dbo to enable dbo in DeepSeek-V3 model. Besides, we
have fixed an known issue in deepseek-dbo.


### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
This patch can be tested with newly added e2e tests:
[tests/multicard/test_offline_inference_distributed.py](https://github.com/vllm-project/vllm-ascend/pull/1285/files#diff-7cd2e6b1bda6b8ad1bedb3276971fe7064aeae4dc0efd41c301c4ede2158c57e).
It can be verified with pytest.

---------

Signed-off-by: zhuohuan <zxdu1997@gmail.com>
2025-06-23 09:07:57 +08:00
Shanshan Shen
21fb68a03a [CI] Update guided decoding ut (#1312)
### What this PR does / why we need it?
Update guided decoding ut.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-23 09:06:20 +08:00
wemaster
339d6894f6 [CI/UT][bugfix] fix v0 spec decode (#1321)
### What this PR does / why we need it?
1. [PR913](https://github.com/vllm-project/vllm-ascend/pull/913)
introduced an error that caused V0's spec decode function to fail.
[PR1109](https://github.com/vllm-project/vllm-ascend/pull/1109) wanted
to fix this problem. Unfortunately, the fix broke the ngram function. I
fixed the ngram function in this PR. **PS**: Q: Why is there a problem
when ngram is not found when pr1109 is merged? A: The newly introduced
problem will only appear when tp>1, and the use cases on CI are all tp=1
2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to
avoid CI taking too long, including eagle speculative UTs, which made CI
unable to take care of the eagle function. I added
it(`test_eagle_correctness.py`) back in this PR
3. Because of the reason mentioned in 2, the current version of Eagle
has a problem. I located and fixed this problem. It was because vllm's
`draft_model_runner.py` was changed and vllm-ascend was not synchronized
in time.
4. Currently, the UTs of v0 and v1 are mixed in the spec_decode
directory. I split them into two directories: spec_decode_v0 and
spec_decode_v1.
5. i found
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor`
and
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace`
have changed in vllm, so i remove it in this pr.

### Does this PR introduce _any_ user-facing change?
This PR fixes the functions of ngram and eagle spec decode in the v0
engine

### How was this patch tested?
tested by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-06-23 09:05:13 +08:00
Yikun Jiang
a95afc011e [CI] Enable merge trigger unit test and accuracy test schedule job (#1345)
### What this PR does / why we need it?
- Enable merge trigger unit test and accuracy test schedule job
- Pin lm-eval==0.4.8 to resovle Qwen3 8B accuracy
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-22 17:21:57 +08:00
Yikun Jiang
c30ddb8331 Bump v0.9.1rc1 release (#1349)
### What this PR does / why we need it?
Bump v0.9.1rc1 release

Closes: https://github.com/vllm-project/vllm-ascend/pull/1341
Closes: https://github.com/vllm-project/vllm-ascend/pull/1334

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed


---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-22 13:15:36 +08:00
Yikun Jiang
097e7149f7 [Platform] Add initial experimental support for Altlas 300I series (#1333)
### What this PR does / why we need it?
Add initial experimental support for Ascend 310P, this patch squash
below PR into one to help validation:

- https://github.com/vllm-project/vllm-ascend/pull/914
- https://github.com/vllm-project/vllm-ascend/pull/1318
- https://github.com/vllm-project/vllm-ascend/pull/1327


### Does this PR introduce _any_ user-facing change?
User can run vLLM on Altlas 300I DUO series

### How was this patch tested?
CI passed with:
- E2E image build for 310P
- CI test on A2 with e2e test and longterm test
- Unit test missing because need a real 310P image to have the test,
will add in a separate PR later.
- Manually e2e test:
- Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B:
https://github.com/vllm-project/vllm-ascend/pull/914#issuecomment-2942989322
  - Pangu MGoE 72B


The patch has been tested locally on Ascend 310P hardware to ensure that
the changes do not break existing functionality and that the new
features work as intended.

#### ENV information

CANN, NNAL version: 8.1.RC1
> [!IMPORTANT]  
> PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ
format and calling NNAL operators on 310P

#### Code example

##### Build vllm-ascend from source code

```shell
# download source code as vllm-ascend
cd vllm-ascend
export SOC_VERSION=Ascend310P3
pip install -v -e .
cd ..
```

##### Run offline inference

```python
from vllm import LLM, SamplingParams
prompts = ["水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。",
           "水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。"]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10)
# Create an LLM.
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,
    max_num_seqs=4,
    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P
    disable_custom_all_reduce=True,
    trust_remote_code=True,
    tensor_parallel_size=2,
    compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]},
)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

```

---------

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: Vincent Yuan <farawayboat@gmail.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-21 09:00:16 +08:00
Yikun Jiang
2009fdb8da [Test] Enable code cov for V1 and enable push trigger (#1164)
### What this PR does / why we need it?
- Enable code cov for V1
- Enable push triggered job

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-21 00:01:05 +08:00
yuancaoyaoHW
00ae250f3c [V1][eagle3] Support eagle3 proposer for v1 (#1032)
### What this PR does / why we need it?
This PR implements the Eagle Pososer feature for vLLM v1, which enables
more efficient speculative decoding by using a draft model to predict
potential future tokens.
- The implementation includes the core Eagle algorithm integration with
vLLM's existing architecture, allowing for faster inference while
maintaining output quality.
- This is needed to significantly improve the generation speed of large
language models without compromising on the quality of generated text.

### Does this PR introduce any user-facing change?
Yes, this PR introduces a new speculative decoding mode that can be
enabled via configuration.
- Users can now choose to use Eagle Pososer by setting appropriate flags
in the inference configuration.
- The API remains backward compatible, with the new functionality being
opt-in.

### How was this patch tested?
CI passed with new unit tests added for the Eagle Pososer functionality.
- Benchmark tests were conducted comparing generation speed and quality
with and without Eagle Pososer.
- Integration tests were performed with various model architectures to
ensure compatibility.
- Manual testing was done using different prompt scenarios to verify
output quality remains consistent.
- we test accept rate on one Ascend 910B npu, The acceptance rate
results are basically consistent with those shown here:
https://github.com/vllm-project/vllm/pull/16937
- Currently, we support scenarios where num_spec_tokens <= 2. When
num_spec_tokens > 2, issues such as insufficient GPU memory and operator
computation errors may occur. We will address this in subsequent
updates.
- We will add support for Eagle v1 in future updates.

### Acceptance Test Script
```bash
SCRIPT="/offline/eagle.py"
DATASET="ShareGpt"
MODEL=Meta-Llama-3.1-8B-Instruct
DRAFT=EAGLE3-LLaMA3.1-Instruct-8B

CUDA_VISIBLE_DEVICES="0" VLLM_USE_V1=1 $PYTHON $SCRIPT \
    --dataset $DATASET \
    --num_spec_tokens 2 \
    --max_num_seqs 1 \
    --model_dir $MODEL \
    --eagle_dir $DRAFT \
    --tp 1 \
    --num_prompts 80
```
### Acceptance Test Results
```bash
██████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [21:22<00:00, 16.03s/it, est. speed input: 4.72 toks/s, output: 13.56 toks/s]
-------------------------------------------------------------------------------------
mean acceptance length: 1.63
-------------------------------------------------------------------------------------
total_counts: 8062
acceptance at token 0: 1.00 (8062 times)
acceptance at token 1: 0.70 (5612 times)
acceptance at token 2: 0.47 (3765 times)
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1004

---------

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-20 17:19:54 +08:00
Li Wang
f8029945c3 [Bugfix] Remove cuda related lines and add additional pip mirror (#1252)
### What this PR does / why we need it?
- For npu environment, we should use `PYTORCH_NPU_ALLOC_CONF ` rather
than `PYTORCH_CUDA_ALLOC_CONF`
- Add `PIP_EXTRA_INDEX_URL` to make nightly_benchmarks happy


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-17 21:25:40 +08:00
Yikun Jiang
9d3cbc0953 [Doctest] add installation doctest (#1179)
### What this PR does / why we need it?
Install doctest

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Related: https://github.com/vllm-project/vllm-ascend/pull/983

Co-authored-by: wangli <wangli858794774@gmail.com>

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
2025-06-17 08:52:26 +08:00
Mengqing Cao
96fa7ff63b [DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235)
### What this PR does / why we need it?
1. Fix rank set in DP scenario. The new poc version of torch-npu support
setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the
rank set in `DPEngineCoreProc` directly instead of calculating local
rank across dp by hand in the patched `_init_data_parallel`

Closes: https://github.com/vllm-project/vllm-ascend/issues/1170

2. Bump torch-npu version to 2.5.1.post1.dev20250528

Closes: https://github.com/vllm-project/vllm-ascend/pull/1242
Closes: https://github.com/vllm-project/vllm-ascend/issues/1232


### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-06-16 23:09:53 +08:00
wangxiyuan
69b817ed65 [CI] Add unit test framework (#1201)
This PR added the unit test framework to enable ut for vLLM Ascend. Unit
test runs on CPU machines. It'll be ran once lint check is passed the
same as e2e test.

For unit test, this PR created a new folder called `ut` under `tests`
module. All the test file in `ut` should keep the same with the code in
`vllm-ascend`. The file name should be start with `test_` prefix. For
example, in this PR. the `test_ascend_config.py` is added for
`ascend_config.py` test.

A new fille `worker/test_worker_v1.py` is also added as the placeholder.
This file should be the unit test for `vllm-ascend/worker/worker_v1.py`.

Additional, a new `fake_weight` folder is added, it contains the
config.json from `facebook/opt-125m`, so that the test will not always
visit huggingface.

TODO:
We should add all the unit test file one by one in the future.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-16 18:32:28 +08:00
Yikun Jiang
966557a2a3 [Build] Speedup image build (#1216)
### What this PR does / why we need it?
1. Rename workflow name to show OS info
2. Speedup image build:
- PR: only arm64 build on openEuler arm64, only amd64 build on Ubuntu
amd64
- Push/Tag: still keep origin logic use qemu on amd64

This PR actually drop the e2e image build per PR but I think it's fine
consider it's stable enough, if we still meet some problem we can revert
this PR

43-44mins ---> about 8-10 mins

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-16 09:02:53 +08:00
Yikun Jiang
4ce860a2be [CI] Make e2e test to be preemptible and simple (#1217)
### What this PR does / why we need it?
This PR make e2e test to be simple, even bring some repeat code between
single card and multicard, but we will not struggle with across
max-parallel, matrix and concurrency:
1. This PR make e2e test to be preemptible and simple:
- lint ---> e2e (2 parallel) ---> e2e multi-card (1 parallel)
- Anytime you push another PR will cancel previous job, whatever the job
is lint / e2e / multi-cards
2. Use Modelscope rather than hf-mirror
3. Resolve some error like `Canceling since a higher priority waiting
request for pr-XXXX-limit-npu-4 exists`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed
- lint ---> e2e (2 parallel) ---> e2e multi-card (1 parallel)
- e2e test will canceled by update patch

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-15 22:07:43 +08:00
whx
47b507b180 [CI] Recover ut for ascend scheduler only in ci of v1. (#1180)
Last PR [#943 ](https://github.com/vllm-project/vllm-ascend/pull/943)
wrongly open ut of AscendScheduler in V0 ci, this PR fixes this problem
and only run ut of it in V1 ci.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-13 07:51:23 +08:00
Li Wang
37f4469a03 [CI][Benchmark] Add qwen2.5-7b test (#1104)
### What this PR does / why we need it?
- Add qwen2.5-7b performance benchmark, this is a sub pr of #1099, for
v1 test, need more verify
- Fix get commit time after checkout

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-12 10:47:30 +08:00
Li Wang
dd207cb261 [CI][Benchmark] Add new model and v1 test to perf benchmarks (#1099)
### What this PR does / why we need it?
- Add qwen2.5-7b-instruct test
- Add v1 test
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-12 10:46:41 +08:00
whx
3393d53b36 [Scheduler][MTP] Add support for speculative decoding in AsecendScheduler. (#943)
This PR adds support for speculative decoding in AsecendScheduler.
Also inculde part of support for disaggregated prefill, full support
will be merged in follow-up PR.

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-11 20:55:44 +08:00
wangxiyuan
4f5964420e [CI] Upgrade vllm to 0.9.1 (#1165)
1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now.
keep doc to 0.9.0 until we release the first 0.9.1 release.
2. disable V0 test for PR
3. move actionlint check to lint job

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-11 16:33:11 +08:00
sdmyzlp
7bdc606677 Support multistream of shared experts in FusedMoE (#997)
Contains on #1111 for completeness.

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Implement multi-stream parallelism for MoE layers with shared experts,
where computation of shared experts will be overlapped with expert token
dispatch and combine. Also, when multi-stream is enabled, weights of
shared experts will be force to replicate across all cards, regardless
of any tensor parallelism configurations, to avoid AllReduce operations.

With the expected overlaping being:
```
| shared gate_up | shared act |              | shared down |
|    dispatch    | routed gate_up, act, down |   combine   |
```

<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
No.

<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-11 09:18:38 +08:00
Mengqing Cao
04abfd8721 [CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make longterm CI pass (#1163)
[CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make
longterm CI pass

Related: https://github.com/vllm-project/vllm-ascend/issues/1162

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-11 07:31:13 +08:00
wangxiyuan
95414bae70 [CI] Run e2e after pre check pass (#1132)
Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-10 17:18:09 +08:00
zhangxinyuehfad
e68e81f2ce [CI] Make accuarcy CI and report work (#1078)
### What this PR does / why we need it?
Make accuarcy CI and report work

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manaully review

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-10 14:35:44 +08:00
Yikun Jiang
8d00775fce [SpecDecode][CI] Set default values to fix spec decode and fix multicard CI (#1109)
### What this PR does / why we need it?
- Set default values to fix spec decode
- To avoid oom, we need to run the test in a single process

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed, espcecially multicards CI
- For spec decode test, long term CI passed

Closes: https://github.com/vllm-project/vllm-ascend/pull/1105

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
2025-06-07 11:23:30 +08:00
Li Wang
a2552e10e4 [Worker][V1] Support sleep mode for v1 (#1084)
### What this PR does / why we need it?
 Support sleep mode for v1

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-06 21:54:02 +08:00
Li Wang
11a7df4270 [ModelRunner] Support embedding inputs (#916)
### What this PR does / why we need it?
- Adds support for passing prompt_embeds to LLM.generate as
```bash
llm.generate({"prompt_embeds": input_embeds}, sampling_params)
```
or
```bash
llm.generate(
    [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)
```
- Add `prompt_embeds` to examples

### How was this patch tested?
CI passed with new added/existing test.
and I have test with the example script in this pr, and the output seems
looks good:
```bash

[Single Inference Output]
------------------------------
The capital of France is Paris. Paris is the largest city in France and is
------------------------------
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3966.87it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.99it/s, est. speed input: 177.08 toks/s, output: 63.91 toks/s]

[Batch Inference Outputs]
------------------------------
Q1: Please tell me about the capital of France.
A1: The capital of France is Paris. It is located in the northern part of the

Q2: When is the day longest during the year?
A2: The day is longest during the year at the summer solstice. This typically occurs

Q3: Where is bigger, the moon or the sun?
A3: The sun is significantly bigger than the moon. 

The sun has a diameter of

------------------------------
```

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-06 20:21:13 +08:00
wangxiyuan
e1ab6d318e [Misc] Refactor additional_config (#1029)
More and more config options are added to additional_config. This PR
provide a new AscendConfig to manage these config options by an easier
way to make code cleaner and readable.

 This PR also added the `additional_config` doc for users.

Added the test_ascend_config.py to make sure the new AscendConfig works
as expect.

TODO: Add e2e test with torchair and deepseek once the CI resource is
available.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-05 16:28:01 +08:00
Li Wang
b4cb0eecb6 [CI] Hotfix on benchmark results path (#1076)
### What this PR does / why we need it?
Fix benchmark results path

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-05 12:53:46 +08:00
Li Wang
31dd471574 [CI] Add workflow_dispatch and use main benchmarks directly (#1071)
### What this PR does / why we need it?

This is for the benchmark iteration, which will change the benchmark
scripts while checkouting each commit. So we need ensure the benchmark
scripts always available.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manaully

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-05 10:29:30 +08:00
Yikun Jiang
9e855b70be Adjust concurrency group for each npu workflow (#1068)
### What this PR does / why we need it?
Adjust concurrency group for each npu workflow
- for pd and benchmarks share the static-08-01, so only one job can runs
on
- other job one PR/schedule should have only 1 job runs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-05 09:17:04 +08:00
Mengqing Cao
afc4c0cd03 [Bugfix] Fix deepseek percision issue and add acc ci for it (#905)
### What this PR does / why we need it?
Fix deepseek percision issue on V0 and add acc ci for it
Fixes https://github.com/vllm-project/vllm-ascend/issues/1062
### How was this patch tested?
CI passed with new added test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-04 20:26:44 +08:00
Li Wang
517811449e [CI] Re-enable sleep mode test and skip failure breaking CI (#990)
### What this PR does / why we need it?

- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream
[change](https://github.com/vllm-project/vllm/pull/18654)
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-04 16:24:16 +08:00
Li Wang
eb2701e0b2 [CI] Remove workflow_dispatch and change schedule time (#1056)
### What this PR does / why we need it?

- Remove workflow_dispatch 
-  Change schedule time to 2:00 UTC+8
### Does this PR introduce _any_ user-facing change?


### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <858794774@qq.com>
Co-authored-by: wangli <858794774@qq.com>
2025-06-04 01:19:20 +08:00
Li Wang
06fb5a8d81 [CI][Bugfix] Upgrade escli to v0.2.1 to fix benchmark deps (#1055)
### What this PR does / why we need it?

Update escli-tool to v.0.2.1 to fix deps bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: wangli <858794774@qq.com>
2025-06-04 01:03:56 +08:00
Li Wang
76dacf3fa0 [CI][Benchmark] Optimize performance benchmark workflow (#1039)
### What this PR does / why we need it?

This is a post patch of #1014, for some convenience optimization
- Set cached dataset path for speed
- Use pypi to install escli-tool
- Add benchmark results convert script to have a developer-friendly
result
- Patch the `benchmark_dataset.py` to disable streaming load for
internet
- Add more trigger ways for different purpose, `pr` for debug,
`schedule` for daily test, `dispatch` and `pr-labled` for manual testing
of a single(current) commit
- Disable latency test for `qwen-2.5-vl`, (This script does not support
multi-modal yet)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-03 23:38:34 +08:00
wangxiyuan
543380ceae [CI] Add merge conflict label job (#1050)
Add bot to label merge conflicts, it helps developer and maintainer to
do code review and update clear.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-03 17:32:31 +08:00
Yikun Jiang
f24375f318 Enable accuracy test for PR labeled with "*accuracy-test" (#1040)
### What this PR does / why we need it?
This PR enable accuracy test for PR labeled with "*accuracy-test" and
workflow_dispatch.

Only one model test running for each type test to reduce excution time.

- The dense test costs about `25mins` to complete (gsm8k 7mins, ~mmlu
3h24mins,~ cEval 18mins)
- The vl test costs about `40mins` to complete


In futute, we might consider enable all job test as nightly schedule
job.

Below is mainly changes:
- the dense/vl accuracy test will be triggered by lableling
`accuracy-test` and `ready-for-test`
- the dense accuracy test will be triggered by lableling
`dense-accuracy-test` and `ready-for-test`
- the vl accuracy test will be triggered by lableling `vl-accuracy-test`
and `ready-for-test`
- accuracy test will also be triggered by workflow_dispatch
- Support V1 and V0 for qwen and V0 for VL

For PR test we also generate summary in test summary.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed with accuracy-test label
- Preview:
https://github.com/vllm-project/vllm-ascend/actions/runs/15407628722?pr=1040

Closes: https://github.com/vllm-project/vllm-ascend/pull/953

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2025-06-03 15:38:13 +08:00
Yikun Jiang
92bc5576d8 Skip benchmarks/** in vllm ascend test (#1041)
### What this PR does / why we need it?
Skip benchmarks/** in vllm ascend test to reduce CI cost

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-01 19:01:26 +08:00
NeverRaR
507ae627ca feat: support compile torchair graph while warming up (#839)
### What this PR does / why we need it?
feat: support compile torchair graph while warming up

Signed-off-by: boying <897013703@qq.com>
2025-05-31 06:03:03 +08:00
Li Wang
d9fb027068 [CI] Add benchmark workflows (#1014)
### What this PR does / why we need it?

Add benchmark workflows

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Run locally

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-05-30 22:42:44 +08:00
XWFAlone
3442fbdb23 [1/N][UT][v1 MTP] add basic v1 mtp features (#890)
### What this PR does / why we need it?
add basic v1 mtp features
please merge it after
https://github.com/vllm-project/vllm-ascend/pull/874 and
https://github.com/vllm-project/vllm-ascend/pull/844.

### Does this PR introduce _any_ user-facing change?
now, we supported basic v1 mtp, only supported tp only、eager mode and
k=1
we will continue to expand more scenarios.

### How was this patch tested?
local tested

Signed-off-by: XWFAlone <xuewenfei2@huawei.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
Co-authored-by: JC-ut0 <xuyexiong@huawei.com>
2025-05-30 08:59:58 +08:00
Mengqing Cao
6eddbd2521 [CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889)
Initialize PD Disaggreate UT

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-29 10:17:12 +08:00
wangxiyuan
f6e5decc10 [CI] upgrade to vllm 0.9.0 (#959)
Upgrade to vllm 0.9.0.
0.8.5 will not be supported any more.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-28 21:18:41 +08:00