Commit Graph

21 Commits

Author SHA1 Message Date
Li Wang
f60bb474f9 [CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (#2065)
### What this PR does / why we need it?
Currently our workflow run time takes about 3 hours in total, which
seriously affects the developer experience, so it is urgent to have a
optimization, after this pr, It is expected that the running time of the
full CI can be shortened to 1h40min.

- Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB)
- Change TP4 ---> TP2 * 2 max-parallel
- Move DeepSeek-V2-Lite-W8A8 to single card test

### Does this PR introduce _any_ user-facing change?
No


- vLLM version: v0.10.0
- vLLM main:
a2480251ec

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-29 18:59:05 +08:00
Mengqing Cao
ed2ab8a197 [CI/Build] Upgrade CANN to 8.2.RC1 (#1653)
### What this PR does / why we need it?
Upgrade CANN to 8.2.rc1

Backport: https://github.com/vllm-project/vllm-ascend/pull/1653

### Does this PR introduce _any_ user-facing change?
Yes, docker image will use 8.2.RC1

### How was this patch tested?
CI passed

- vLLM version: v0.10.0
- vLLM main:
7728dd77bb

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-26 22:37:46 +08:00
Yikun Jiang
17a430f7b8 Upgrade vLLM to v0.10.0 (#1927)
### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
f3a683b7c9
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
7728dd77bb

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-26 15:43:29 +08:00
li chaoran
ff97740b8d Use mirror images (#1912)
### What this PR does / why we need it?
More discussion can be found
[here](https://github.com/ascend-gha-runners/docs/issues/23).

The infra team deployed a internal registry since both `m.daocloud.io`
and `quay.io` suffered a unstable connect quality.

CI will benefit both the connection and download speed by switching to
the internal registry.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
tested locally

- vLLM version: v0.9.2
- vLLM main:
6b46c4b653

---------

Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>
2025-07-24 10:47:05 +08:00
li chaoran
3e39d7234c [CI] Switching to infra cache server to reduce network pressure (#1792)
### What this PR does / why we need it?
This PR introduce the infra cache server to speed up apt/pip package
installation

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
Tested locally, with this config, the network bandwith reduce from 100%
to 5% usage when a new PR was submitted.
<img width="807" height="334" alt="image"
src="https://github.com/user-attachments/assets/16f03bce-4531-4c71-ab6e-8308dc2c022c"
/>


- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

---------

Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>
2025-07-18 18:39:25 +08:00
zhangxinyuehfad
1b4a2f3817 [CI] Add accuracy ci for DP and EP and TP and ETP (#1140)
### What this PR does / why we need it?

Add accuracy ci for DP and EP and TP

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.9.2
- vLLM main:
35514b682a

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-11 17:25:17 +08:00
Shanshan Shen
6af35f60cc [Bugfix][CI] Remove V0 Spec Decode CI (#1656)
### What this PR does / why we need it?

To solve the error in the CI of long term test:

```bash
modelscope - ERROR - Repo JackFram/llama-68m not exists on either https://www.modelscope.cn/ or https://www.modelscope.ai/
```

Replace the hf model with modelscope model.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
71d1d75b7a

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2025-07-09 15:53:58 +08:00
Yikun Jiang
e4e9ea02ab Upgrade vLLM version to v0.9.2 (#1652)
### What this PR does / why we need it?

This patch upgrade vLLM version to v0.9.2, this patch didn't remove the
v0.9.1 compatible code to easy review.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
14601f5fba
- Accuracy test with 0.9.2:
https://github.com/vllm-project/vllm-ascend/actions/runs/16121612087

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-08 14:18:17 +08:00
Mengqing Cao
dd22ac38b2 [CI/UT][Refactor] move e2e spec decode and deepseek acc test to per pr (#1136)
### What this PR does / why we need it?
1. run deepseek acc ut per pr --- multicard CI time increased by 9 min
2. run spec decode e2e test on v1 per pr --- singlecard CI time
increased by 3 min (partly is disabled due to not work now)
~~3. align the output of whether dbo is enabled or not~~
    The generated results with and without dbo cannot be aligned.

https://github.com/vllm-project/vllm-ascend/actions/runs/15822900528/job/44600029405?pr=1136
4. skip V0 mtp test due to failure in
https://github.com/vllm-project/vllm-ascend/actions/runs/16012172833/job/45171988816
5. fix some version conflicts
### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-04 18:05:45 +08:00
zhangxinyuehfad
4e910186de [CI/UT] Unify model usage via ModelScope in CI (#1207)
### What this PR does / why we need it?
Unify Model Usage via ModelScope

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-04 10:52:17 +08:00
wemaster
339d6894f6 [CI/UT][bugfix] fix v0 spec decode (#1321)
### What this PR does / why we need it?
1. [PR913](https://github.com/vllm-project/vllm-ascend/pull/913)
introduced an error that caused V0's spec decode function to fail.
[PR1109](https://github.com/vllm-project/vllm-ascend/pull/1109) wanted
to fix this problem. Unfortunately, the fix broke the ngram function. I
fixed the ngram function in this PR. **PS**: Q: Why is there a problem
when ngram is not found when pr1109 is merged? A: The newly introduced
problem will only appear when tp>1, and the use cases on CI are all tp=1
2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to
avoid CI taking too long, including eagle speculative UTs, which made CI
unable to take care of the eagle function. I added
it(`test_eagle_correctness.py`) back in this PR
3. Because of the reason mentioned in 2, the current version of Eagle
has a problem. I located and fixed this problem. It was because vllm's
`draft_model_runner.py` was changed and vllm-ascend was not synchronized
in time.
4. Currently, the UTs of v0 and v1 are mixed in the spec_decode
directory. I split them into two directories: spec_decode_v0 and
spec_decode_v1.
5. i found
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor`
and
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace`
have changed in vllm, so i remove it in this pr.

### Does this PR introduce _any_ user-facing change?
This PR fixes the functions of ngram and eagle spec decode in the v0
engine

### How was this patch tested?
tested by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-06-23 09:05:13 +08:00
yuancaoyaoHW
00ae250f3c [V1][eagle3] Support eagle3 proposer for v1 (#1032)
### What this PR does / why we need it?
This PR implements the Eagle Pososer feature for vLLM v1, which enables
more efficient speculative decoding by using a draft model to predict
potential future tokens.
- The implementation includes the core Eagle algorithm integration with
vLLM's existing architecture, allowing for faster inference while
maintaining output quality.
- This is needed to significantly improve the generation speed of large
language models without compromising on the quality of generated text.

### Does this PR introduce any user-facing change?
Yes, this PR introduces a new speculative decoding mode that can be
enabled via configuration.
- Users can now choose to use Eagle Pososer by setting appropriate flags
in the inference configuration.
- The API remains backward compatible, with the new functionality being
opt-in.

### How was this patch tested?
CI passed with new unit tests added for the Eagle Pososer functionality.
- Benchmark tests were conducted comparing generation speed and quality
with and without Eagle Pososer.
- Integration tests were performed with various model architectures to
ensure compatibility.
- Manual testing was done using different prompt scenarios to verify
output quality remains consistent.
- we test accept rate on one Ascend 910B npu, The acceptance rate
results are basically consistent with those shown here:
https://github.com/vllm-project/vllm/pull/16937
- Currently, we support scenarios where num_spec_tokens <= 2. When
num_spec_tokens > 2, issues such as insufficient GPU memory and operator
computation errors may occur. We will address this in subsequent
updates.
- We will add support for Eagle v1 in future updates.

### Acceptance Test Script
```bash
SCRIPT="/offline/eagle.py"
DATASET="ShareGpt"
MODEL=Meta-Llama-3.1-8B-Instruct
DRAFT=EAGLE3-LLaMA3.1-Instruct-8B

CUDA_VISIBLE_DEVICES="0" VLLM_USE_V1=1 $PYTHON $SCRIPT \
    --dataset $DATASET \
    --num_spec_tokens 2 \
    --max_num_seqs 1 \
    --model_dir $MODEL \
    --eagle_dir $DRAFT \
    --tp 1 \
    --num_prompts 80
```
### Acceptance Test Results
```bash
██████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [21:22<00:00, 16.03s/it, est. speed input: 4.72 toks/s, output: 13.56 toks/s]
-------------------------------------------------------------------------------------
mean acceptance length: 1.63
-------------------------------------------------------------------------------------
total_counts: 8062
acceptance at token 0: 1.00 (8062 times)
acceptance at token 1: 0.70 (5612 times)
acceptance at token 2: 0.47 (3765 times)
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1004

---------

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-20 17:19:54 +08:00
Mengqing Cao
96fa7ff63b [DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235)
### What this PR does / why we need it?
1. Fix rank set in DP scenario. The new poc version of torch-npu support
setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the
rank set in `DPEngineCoreProc` directly instead of calculating local
rank across dp by hand in the patched `_init_data_parallel`

Closes: https://github.com/vllm-project/vllm-ascend/issues/1170

2. Bump torch-npu version to 2.5.1.post1.dev20250528

Closes: https://github.com/vllm-project/vllm-ascend/pull/1242
Closes: https://github.com/vllm-project/vllm-ascend/issues/1232


### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-06-16 23:09:53 +08:00
wangxiyuan
69b817ed65 [CI] Add unit test framework (#1201)
This PR added the unit test framework to enable ut for vLLM Ascend. Unit
test runs on CPU machines. It'll be ran once lint check is passed the
same as e2e test.

For unit test, this PR created a new folder called `ut` under `tests`
module. All the test file in `ut` should keep the same with the code in
`vllm-ascend`. The file name should be start with `test_` prefix. For
example, in this PR. the `test_ascend_config.py` is added for
`ascend_config.py` test.

A new fille `worker/test_worker_v1.py` is also added as the placeholder.
This file should be the unit test for `vllm-ascend/worker/worker_v1.py`.

Additional, a new `fake_weight` folder is added, it contains the
config.json from `facebook/opt-125m`, so that the test will not always
visit huggingface.

TODO:
We should add all the unit test file one by one in the future.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-16 18:32:28 +08:00
wangxiyuan
4f5964420e [CI] Upgrade vllm to 0.9.1 (#1165)
1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now.
keep doc to 0.9.0 until we release the first 0.9.1 release.
2. disable V0 test for PR
3. move actionlint check to lint job

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-11 16:33:11 +08:00
Mengqing Cao
04abfd8721 [CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make longterm CI pass (#1163)
[CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make
longterm CI pass

Related: https://github.com/vllm-project/vllm-ascend/issues/1162

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-11 07:31:13 +08:00
Yikun Jiang
9e855b70be Adjust concurrency group for each npu workflow (#1068)
### What this PR does / why we need it?
Adjust concurrency group for each npu workflow
- for pd and benchmarks share the static-08-01, so only one job can runs
on
- other job one PR/schedule should have only 1 job runs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-05 09:17:04 +08:00
Mengqing Cao
afc4c0cd03 [Bugfix] Fix deepseek percision issue and add acc ci for it (#905)
### What this PR does / why we need it?
Fix deepseek percision issue on V0 and add acc ci for it
Fixes https://github.com/vllm-project/vllm-ascend/issues/1062
### How was this patch tested?
CI passed with new added test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-04 20:26:44 +08:00
XWFAlone
3442fbdb23 [1/N][UT][v1 MTP] add basic v1 mtp features (#890)
### What this PR does / why we need it?
add basic v1 mtp features
please merge it after
https://github.com/vllm-project/vllm-ascend/pull/874 and
https://github.com/vllm-project/vllm-ascend/pull/844.

### Does this PR introduce _any_ user-facing change?
now, we supported basic v1 mtp, only supported tp only、eager mode and
k=1
we will continue to expand more scenarios.

### How was this patch tested?
local tested

Signed-off-by: XWFAlone <xuewenfei2@huawei.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
Co-authored-by: JC-ut0 <xuyexiong@huawei.com>
2025-05-30 08:59:58 +08:00
wangxiyuan
f6e5decc10 [CI] upgrade to vllm 0.9.0 (#959)
Upgrade to vllm 0.9.0.
0.8.5 will not be supported any more.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-28 21:18:41 +08:00
wangxiyuan
e2a0c19cea [CI] Refactor CI (#952)
1. remove some useless test func and file
2. fix format.sh problem
3. enable full test for singlecard and multicard
4. move long term test to long_term folder. For this kind of test, it
only runs by labeled and daily test. Include: spec decode、accuracy test

## After refactor:
There are 4 test modules
- `singlecard`: contains the test running on one NPU. It'll be run for
each PR and daily test.
- `multicard`: contains the test running on multi NPUs. It'll be run for
each PR and daily test.
- `long_term`: contains the test that cost much time(Now include `spec
decode` and `accuracy` test). It'll be run for the PR with
`long-term-test` labeled and daily test.
- `e2e`: contains the test for doc and pd feature. It'll be run for the
PR with `pd-test` labeled and daily test.

## Todo:
1. some test are skipped, they should be fixed and reenabled in the
future.
2. pyhccl test for multicard doesn't work at all. It should be enabled
as well.
3. ensure long-term-test pass by daily test.

### Know issue
Now, `ready` labels is required to start pd test or long term test. And
when `long-term-test` or `pd-test` is labeled after another one, the old
labeled test will be re-run again. So the labeled test should be ran in
the following step:

1. decide which test need run, then label it. `long-term-test` or
`pd-test` or both.
2. add `ready-for-test` label, then the test will be ran.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-28 06:31:35 +08:00