Commit Graph

157 Commits

Author SHA1 Message Date
Mengqing Cao
6ee7f5cf71 [SpecDecode] Add spec decode support (#500)
### What this PR does / why we need it?
Backport: https://github.com/vllm-project/vllm-ascend/pull/252
This support speculative decoding in Ascend, including speculating with
a draft model、by matching n-grams in the prompt、using MLP speculators
and using EAGLE based draft models.

Backport: https://github.com/vllm-project/vllm-ascend/pull/423
spec decode MultiStepWorker support TP1DraftModelRunner fully, support
run the draft_model_runner with multi-step prepare on the NPU directly
and support draft_model_runner use MLA.

1. before this pr, `MultiStepWorker` would not step into the branch
using NPU prepare, but only into the branch using CPU prepare (`line 52`
of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has
`no effect` on the `correct operation` of speculative decoding and the
performance of the two branches is basically the same as of the current
version, I support entering this branch in this PR. In general, there
are two main changes in `patch_multi_step_worker.py`: first, the
`is_cuda_like()` check is removed and the `TP1DraftModelRunner`
rewritten in vllm_ascend is used; second, the
`supports_gpu_multi_step()` function is made to return true on NPU
devices when outer Multi_step_worker could work correct.

3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU,
but not MLA. The relevant adaptation is in
`vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why
the `input_positions` of `model_input.attn_metadata` in vllm-ascend
needs to be added in `execute_model`, it is done in `model_runner.py`,
so I also made corresponding changes. Otherwise, when atten_backend is
MLA, it will prompt that input_positions cannot be found.

4. I commented out two lines in `draft_model_runner.py` in `line118` to
support the scenario of K>1.
  ```
  # lora_mapping=model_input.lora_mapping,
  # lora_requests=model_input.lora_requests,
  ```
I added comments. In the future, when vllm-ascend supports lora feature,
the changes here can be restored.

TODO:
- [ ] revert the patch when the related issues are addressed in vllm

### How was this patch tested?
CI passed with new added test.
- e2e test for medusa proposer:
tests/singlecard/spec_decode/e2e/test_medusa_correctness.py
- e2e test for mlp proposer:
tests/singlecard/spec_decode/e2e/test_mlp_correctness.py
- e2e test for n-gram proposer:
tests/singlecard/spec_decode/e2e/test_ngram_correctness.py

Tests for patched files:
- tests/singlecard/spec_decode/test_dynamic_spec_decode.py
- tests/singlecard/spec_decode/test_multi_step_worker.py
- tests/singlecard/spec_decode/test_ngram_worker.py
- tests/singlecard/spec_decode/test_spec_decode_worker.py

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
2025-04-17 20:16:32 +08:00
Mengqing Cao
b71f193cb0 [Model][Doc] Update model support list (#552)
Update model support list
cc @Yikun plz help review, thanks!

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-17 19:32:20 +08:00
whx
20dff4deff [Scheduler] Add AscendScheduler. (#543)
This PR adds AscendScheduler to vllm v1 engine.
This scheduler currently supports v0-style prefill-first scheduling
strategy.
In the future more schedule methods will be supported by this scheduler.

---------

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
2025-04-17 19:31:50 +08:00
paulyu12
697908f5cd [Platform][Worker][ModelRunner] Add LoRA & Multi-LoRA support (#521)
### What this PR does / why we need it?
According to this RFC [[RFC]: Join the MultiLora and MultiLora Dynammic
Serving feature develop
#396](https://github.com/vllm-project/vllm-ascend/issues/396) and this
[vLLM Ascend Roadmap Q2 2025
#448](https://github.com/vllm-project/vllm-ascend/issues/448), we pull
request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA
Dynamic Serving.

LoRA reference is here: [LoRA
reference](https://docs.vllm.ai/en/latest/features/lora.html)

### Does this PR introduce _any_ user-facing change?

Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

### How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

---------

Signed-off-by: paulyu <paulyu0307@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
2025-04-17 16:48:46 +08:00
hfadzxy
9935d45728 [CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460)
### What this PR does / why we need it?
Add model basic accuracy test(Qwen2.5-0.5B-Instruct)

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-04-17 14:59:56 +08:00
Huazhong Ji
c3d1a3782a Add pyhccl (#503)
This is the first step to support trl vllm serve on Ascend NPU
https://github.com/vllm-project/vllm-ascend/issues/459.
This PR can work properly only when
https://github.com/vllm-project/vllm/pull/16464 is merged into vLLM.

---------

Signed-off-by: hzji210@gmail.com <hzji210@gmail.com>
2025-04-17 14:57:52 +08:00
Li Wang
64fdf4cbef [Doc]Update faq (#536)
### What this PR does / why we need it?
update performance and accuracy faq

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-17 14:56:51 +08:00
Mengqing Cao
6061f33670 [Bugfix][Model] Fix api in DeepSeek model (#545)
### What this PR does / why we need it?
Fix api in DeepSeekV2, aligning with the latest code of the main branch
in vllm.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Test locally with deepseek-v2-lite, and will add CI by @Potabk.
Plz update the model UT after this pr is merged, thx! cc @Potabk

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-17 11:56:05 +08:00
Li Wang
9859e7313f [CI]Add global env to runner (#537)
### What this PR does / why we need it?
- add `HF_TOKEN` as global var to the runner
- add `HF_ENDPOINT` as global var to the runner
- change concurrency group, rely on current pr num

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-17 10:08:00 +08:00
hfadzxy
00de2ee6ad [Doc] update faq about progress bar display issue (#538)
### What this PR does / why we need it?
update faq about progress bar display issue

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-04-16 16:07:08 +08:00
Mengqing Cao
fe13cd9ea5 [Doc] update faq about w8a8 (#534)
update faq about w8a8

---------

Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-04-16 09:37:21 +08:00
Shanshan Shen
415ed027fa [V1][Platform] Remove supports_structured_output() in platform (#531)
### What this PR does / why we need it?
Remove `supports_structured_output()` in platform. This method is no need, because upstream has deleted this.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-16 09:30:33 +08:00
wangxiyuan
bbe7ccd366 [MISC] Add patch module (#526)
This PR added patch module for vllm
1. platform patch: the patch will be registered when load the platform
2. worker patch: the patch will be registered when worker is started.

The detail is:
1. patch_common: patch for main and 0.8.4 version
4. patch_main: patch for main verison
5. patch_0_8_4: patch for 0.8.4 version
2025-04-16 09:28:58 +08:00
wangxiyuan
434749d299 [CI] update 0.8.3 to 0.8.4 (#528)
Update 0.8.3 CI to 0.8.4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-16 09:26:30 +08:00
Li Wang
13480d1238 [CI]Fix workflow (#532)
### What this PR does / why we need it?
make linux-npu-4 runner run parallel for now


Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-15 19:55:41 +08:00
Shanshan Shen
bcbc04f92b [Doc] Add environment variables doc (#519)
### What this PR does / why we need it?
Add environment variables doc.
---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-15 16:09:36 +08:00
eeethenQ
44a8301424 [Feature] Add PD separation feature (#432)
### What this PR does / why we need it?
Adapt Disaggregated Prefill feature onto Ascend device

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

The test usage has been provided alongwith the PR, in
examples/offline_disaggregated_prefill_npu.py
To run it, do this
```
export PROMPT_DEVICE_ID=0,1
export DECODE_DEVICE_ID=2,3
python examples/offline_disaggregated_prefill_npu.py
```

---------

Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com>
2025-04-15 15:11:35 +08:00
wangxiyuan
c7f6584d75 [V1] clean up V1 code (#505)
Clean up V1 code:
1. remove useless code.
2. format code to be clear.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-15 10:24:02 +08:00
wangxiyuan
f6af1d2471 [MISC] fix logger (#515)
logger in vllm-ascend doesn't work. This PR fix the issue.

Fix: https://github.com/vllm-project/vllm-ascend/issues/431

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-15 10:18:05 +08:00
wangxiyuan
5c6d79687c [Doc] Update FAQ (#518)
Update FAQ

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-15 10:17:56 +08:00
wangxiyuan
5fa70b6393 [Build] Update doc (#509)
1. install torch-npu before vllm-ascend to ensure custom ops build
success.
2. set `COMPILE_CUSTOM_KERNELS=0` if users want to disable custom ops
build.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-14 14:38:50 +08:00
Shanshan Shen
11ecbfdb31 [Doc] Update FAQ doc (#504)
### What this PR does / why we need it?
Update FAQ doc.
---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-14 11:11:40 +08:00
wangxiyuan
9c7428b3d5 [CI] enable custom ops build (#466)
### What this PR does / why we need it?
This PR enable custom ops build  by default. 

### Does this PR introduce _any_ user-facing change?

Yes, users now install vllm-ascend from source will trigger custom ops
build step.

### How was this patch tested?
By image build and e2e CI

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-12 10:24:53 +08:00
Icey
d05ea17427 Add openEuler based container image for vLLM Ascend (#489)
### What this PR does / why we need it?

Provide users with openEuler-based vllm images, so modify the quick
start readme

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

There is no need for performing any test.

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-04-10 14:30:49 +08:00
Li Wang
afdbf77483 [CI] Add new runner and enable QwQ multinpu test (#417)
### What this PR does / why we need it?

- Add a new runner to the continuous integration system and keep the
original CI runner until the new runner runs stably
- Add distributed test cases

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-08 16:52:45 +08:00
jinyuxin
5d6239306b [DOC] Update multi_node.md (#468)
### What this PR does / why we need it?
- Added instructions for verifying multi-node communication environment.
- Included explanations of Ray-related environment variables for
configuration.
- Provided detailed steps for launching services in a multi-node
environment.
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
manually tested.

Signed-off-by: jinyuxin <jinyuxin2@huawei.com>
2025-04-08 14:19:57 +08:00
Mengqing Cao
f6cf92e7d5 [quant][bugfix] fix deepseek quant bug (#478)
see #465

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
2025-04-08 09:15:56 +08:00
Yikun Jiang
579d858a20 Set torchvision<0.21.0 to match torch/torch_npu version (#479)
### What this PR does / why we need it?
Set torchvision<0.21.0 to match torch/torch_npu version to resolve
`RuntimeError: operator torchvision::nms does not exist`.

Closes: https://github.com/vllm-project/vllm-ascend/issues/477

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-08 09:15:42 +08:00
Shanshan Shen
1d88dacf9f [V1][Platform] Add supports_structured_output() method to Platform (#475)
### What this PR does / why we need it?
Add `supports_structured_output()` method to Platform, find more details
at https://github.com/vllm-project/vllm/pull/16148.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-07 19:11:51 +08:00
Yikun Jiang
adabdeea7f Set numpy < 2.0.0 to resolve numpy VersionConflict (#476)
### What this PR does / why we need it?
vLLM bumps numpy version to 2.x:
8427f70493
, this will cause a
`pip._vendor.pkg_resources.ContextualVersionConflict: (numpy 2.2.4
(/usr/local/python3.10/lib/python3.10/site-packages),
Requirement.parse('numpy==1.26.4'), {'vllm-ascend'})` failure when vllm
ascend install. This PR resolved the issue by:
- Set numpy < 2.0.0 to resolve numpy VersionConflict
- Sync requirements and toml 
- Reorder


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes: https://github.com/vllm-project/vllm-ascend/issues/473

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-07 16:07:21 +08:00
Mengqing Cao
344228a5da [deepseek][bugfix] support deepseek quant (#469)
- support deepseek quant
  - add w8a8_dynamic quant
see #391

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
2025-04-07 10:56:12 +08:00
Li Wang
3f9752f8ee [Bugfix]Lazy import vllm config (#462)
### What this PR does / why we need it?
Lazy import vllm config  to avoid circular imports

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-03 16:03:08 +08:00
Pleaplusone
ce8259975e [core] Support custom ascendc kernels in vllm-ascend (#233)
This PR add custom ascendc kernel rotary_embedding support in
vllm-ascend, related CMakeLists and setuptools is also added in this PR.

Related: https://github.com/vllm-project/vllm-ascend/issues/156

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-03 14:52:34 +08:00
Shanshan Shen
14d9a64047 [ModelRunner][V1] Optimize V1 attention mask (#442)
### What this PR does / why we need it?
Pre-construct a mask matrix to improve the efficiency of attention mask
construction during inference.

Note that the length of the matrix needs to be carefully balanced: a
matrix that is too large will consume excessive VRAM, while a matrix
that is too small will require dynamic concatenation during inference,
leading to performance degradation.

Therefore, an environment variable is added here to dynamically set the
size of the pre-constructed mask matrix based on requirements.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: didongli182 <didongli@huawei.com>
2025-04-02 10:33:53 +08:00
hfadzxy
94bf9c379e [Doc]Add developer guide for using lm-eval (#456)
### What this PR does / why we need it?
Add developer guide for using lm-eval

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
test manually

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-01 23:43:51 +08:00
dependabot[bot]
78083d405e Bump actions/setup-python from 5.4.0 to 5.5.0 (#440)
Bumps [actions/setup-python](https://github.com/actions/setup-python)
from 5.4.0 to 5.5.0.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-01 14:34:33 +08:00
Mengqing Cao
2dbd763584 [CI] Fix mypy CI (#443)
### What this PR does / why we need it?
Fix CI by updating mypy and pining numpy version

_the modification of model_runner_v1 is just to make CI happy_

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-01 09:25:33 +08:00
Yikun Jiang
c42e21a5aa [Docs] Add install system dependencies in install doc (#438)
### What this PR does / why we need it?
Add install system dependencies in install doc

Resolve:
```
$ pip install vllm==v0.7.3
CMake Error at CMakeLists.txt:14 (project):
  No CMAKE_CXX_COMPILER could be found.
  Tell CMake where to find the compiler by setting either the environment
  variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
  to the compiler, or to the compiler name if it is in the PATH.
// ... ...
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for vllm
Failed to build vllm
ERROR: Failed to build installable wheels for some pyproject.toml based projects (vllm)
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/439 


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-31 14:17:55 +08:00
hfadzxy
7beb4339dc [Doc]Add developer guide for using OpenCompass (#368)
### What this PR does / why we need it?
Add developer guide for using OpenCompass

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

test manually

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-31 00:24:25 +08:00
wangxiyuan
b6499ed97d [CI] Use CI pool (#428)
Use CI pool instead of self-host for e2e test to speed up CI.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-29 12:42:59 +08:00
wangxiyuan
ca8b1c3e47 [Doc] Add 0.7.3rc2 release note (#419)
Add 0.7.3rc2 release note. We'll release 0.7.3rc2 right now.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-29 09:02:08 +08:00
wangxiyuan
31f29b9f30 [Core] Make V1 work and enable V1 engine test (#389)
1. Make sure the version is string before parse in collect_env
2. Add basic V1 engine test

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-28 19:34:23 +08:00
wuhuikx
57a84bb7be [Bug Fix] Fix bug of platform for parameter checking (#411)
Fix bug in platform.py to avoid the None value of config parameters.

Signed-off-by: wuhuikx <wuhui_csu@163.com>
2025-03-28 16:31:27 +08:00
Tony
b1557abab6 fix multistep bug,remove uselesscodes (#355)
1. remove useluss code in attention.py
2. multistep now using StatefulModelInputForNPU and do not use
StatefulModelInput

Signed-off-by: new-TonyWang <wangtonyyu222@gmail.com>
2025-03-28 09:55:35 +08:00
Yikun Jiang
1864c40520 Add vLLM Ascend Weekly meeting link (#400)
### What this PR does / why we need it?
Add vLLM Ascend Weekly meeting link

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-27 09:00:21 +08:00
Zhenyu Zheng
4804b74e95 Update 110-user-story.yml (#402)
Fix a few typos in issue template

Signed-off-by: Zhenyu Zheng <zheng.zhenyu@outlook.com>
2025-03-27 08:58:57 +08:00
Zhenyu Zheng
0b5a9643fd Add an example for user stories (#399)
Add an example for user stories and fix some typo

Add a new section, user story in the docs, to collect user stories of
llvm-ascend, also add an example and the issue template to collect user
story

Signed-off-by: Zhenyu Zheng <zheng.zhenyu@outlook.com>
2025-03-26 16:25:57 +08:00
BAI Fan
122505208f FastPatch: Optimized Patch Embedding for Qwen2VL (#345)
### What this PR does / why we need it?
We proposed the FastPatch method, which optimized patch embedding
(Conv3D) for Qwen2VL.


### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
We've tested it on benchmark, it meets our satisfaction and is better
than original patch_embed layer.


---------

Signed-off-by: baifanxxx <baifanxxx@gmail.com>
Signed-off-by: zouyida <zouyida@huawei.com>
Co-authored-by: zouyida <zouyida@huawei.com>
2025-03-26 14:28:20 +08:00
Mengqing Cao
d4accf4ec2 [Doc][Model] update LLaVA 1.6 support (#373)
update LLaVA 1.6 support

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-03-26 09:07:55 +08:00
Mengqing Cao
6295d2e9bc [CI/Build][Doc] upgrade torch-npu to 0320 (#392)
### What this PR does / why we need it?
This pr upgrades torch-npu to 0320, so that #321,
https://github.com/vllm-project/vllm-ascend/issues/267#issuecomment-2745045743
could be fixed, and #372 should be reverted after this pr

### Does this PR introduce _any_ user-facing change?
upgrade torch-npu to 0320

### How was this patch tested?
tested locally with long seq inferencing.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-03-26 09:04:12 +08:00