Commit Graph

1245 Commits

Author SHA1 Message Date
Chenguang Li
202b39a38c Ray Worker Ops Optimization (#136)
### What this PR does / why we need it?
In the case where `backend = ray`, only the main process completes the
`forward_oot` call, while the other worker processes call
`forward_native`. (This bug should also exist when `backend = mp`.)

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
**Environment:**

CANN: 8.0.0
PyTorch: 2.5.1
Torch: 2.5.1rc1
python: 3.10
python: 3.10
vllm: branch main 
vllm-ascend: branch main 
The current implementation avoids the Ray Worker initialization issue,
as addressed in the
[PR](https://github.com/vllm-project/vllm-ascend/pull/92). Then, during
the `forward_oot` call, logging will be performed.

**Script:**

```bash
python examples/offline_distributed_inference_npu.py
```

**Result:**
```bash
NPURayWorkerWrapper pid=3984223) forward_oot run. #############################################
(NPURayWorkerWrapper pid=3984223) forward_oot run. #############################################
(NPURayWorkerWrapper pid=3984223) forward_oot run. #############################################
(NPURayWorkerWrapper pid=3984223) forward_oot run. #############################################
(NPURayWorkerWrapper pid=3984223) forward_oot run. #############################################
forward_oot run. #############################################
forward_oot run. #############################################
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.96s/it, est. speed input: 2.80 toks/s, output: 51.00 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Alex and I am a 16 year old male. I have been diagnosed with a rare genetic disorder called X-linked recessive. I have been told that I will not be able to have children. I have been told that I will not be able to have children because of the X-linked recessive disorder. I have been told that I will not be able to have children because of the X-linked recessive disorder. I have been told that I will not be able to have children because of'
Prompt: 'The president of the United States is', Generated text: ' Statesman. He is the leader of the country. He is the one who makes the decisions. He is the one who makes the laws. He is the one who makes the rules. He is the one who makes the country strong. He is the one who makes the country happy. He is the one who makes the country safe. He is the one who makes the country free. He is the one who makes the country beautiful. He is the one who makes the country great. He is'
Prompt: 'The capital of France is', Generated text: ' the city of Paris. It is the largest city in France and the second largest city in Europe. It is located in the center of the country, in the south of the country. It is situated on the banks of the Seine River, which flows through the city. The city is surrounded by the Alps and the Pyrenees mountains. The city is also surrounded by the Mediterranean Sea. The city is known for its beautiful architecture, its museums, its parks, and its food. Paris is'
Prompt: 'The future of AI is', Generated text: ' following the path of the internet, and the internet is following the path of the web. The web is a network of interconnected web pages, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network'
```

---------

Signed-off-by: Chenguang Li <757486878@qq.com>
2025-02-21 22:45:15 +08:00
whx
386817b4d1 [Model Runner][Performance] Cache the jugement result of is_encoder_decoder to decrease framework overhead (#138)
In Model Runner, is_encoder_decoder is exacted from model_config to
determin whether vllm is running for enc-dec models. Obtaining this
status requires a long call stack, and the CPU overhead is high. So this
PR cache this status in __init__ of ModelInputForNPUBuilder.

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
2025-02-21 22:43:11 +08:00
Yikun Jiang
d21b3be685 Mark v0.7.1 as unmaintained and v0.7.3 as maintained (#139)
### What this PR does / why we need it?
Mark v0.7.1 as unmaintained and v0.7.3 as maintained:
vLLM released the v0.7.3 version:
https://github.com/vllm-project/vllm/releases/tag/v0.7.3 which include
serval commits:
- https://github.com/vllm-project/vllm/pull/12874
- https://github.com/vllm-project/vllm/pull/12432
- https://github.com/vllm-project/vllm/pull/13208

We'd better to bump the versions to v0.7.3.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-21 22:41:44 +08:00
Yikun Jiang
72a43a61d8 [Docs] Add issue template (#113)
### What this PR does / why we need it?
Add issue templates.

Most of templates in this PR are from vllm-project/vllm:
https://github.com/vllm-project/vllm/tree/main/.github/ISSUE_TEMPLATE

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Test on my local repo by setting default branch to ISSUE_TEMPLATE:
https://github.com/Yikun/vllm-ascend/issues

https://github.com/Yikun/vllm-ascend/issues/new/choose

Closes: https://github.com/vllm-project/vllm-ascend/issues/48

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-21 17:20:21 +08:00
Mengqing Cao
dd425d68f8 [Platform] add dispatch key (#17)
### What this PR does / why we need it?
Add dispatch key for NPU, so that the log could be print correctly.

Now
```
executor_base.py:110] # CPU blocks: 220478, # CPU blocks: 21845
```

After this pr
```
executor_base.py:110] # NPU blocks: 220478, # CPU blocks: 21845
```

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed and log printed as above

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-21 17:10:30 +08:00
wangxiyuan
5f465010de [Core] Cherry pick from 0.7.1 to keep the main code newest (#127)
Cherry pick from 0.7.1 to keep the main code newest

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-21 17:07:37 +08:00
Mengqing Cao
36991b2052 [CI] enable CI on all branch (#124)
Enable CI on all branch.
Installing with the torch-npu-2.5.1.dev20250218 so that we could enable
CI on all branch and prepare for merging 0.7.1-dev to main

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-21 16:16:48 +08:00
HongtaoYang
fd2cc1b883 [Docs] Add Tutorials for Online Serving on Multi Machine (#120)
Add Tutorials for Online Serving on Multi Machine

---------

Signed-off-by: SidaoY <1024863041@qq.com>
Co-authored-by: yx0716 <jinyx1007@foxmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-02-21 11:03:00 +08:00
Yikun Jiang
3a4ce2aa15 [Docs] Fix vllm and vllm-ascend version (#107)
### What this PR does / why we need it?

Fix vllm and vllm-ascend version 

| branch/tag | vllm_version |
vllm_ascend_version|pip_vllm_ascend_version|pip_vllm_version|
|----|----|----|----|----|
| main | main | main | v0.7.1rc1 | v0.7.1 |
| v0.7.1-dev | v0.7.1 | v0.7.1rc1 | v0.7.1rc1 | v0.7.1 |
| v0.7.1rc1 | v0.7.1 | v0.7.1rc1 | v0.7.1rc1 | v0.7.1 |

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-20 11:05:35 +08:00
wangxiyuan
cff03a4913 [CI] change to quay.io (#102)
change docker registry to quay

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-19 17:04:46 +08:00
wangxiyuan
fafd70e91c [Doc] Update doc to work with release (#85)
1. Update CANN image name
2. Add pta install step
3. update vllm-ascend docker image name to ghcr
4. update quick_start to use vllm-ascend image directly.
5. fix `note` style

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-19 09:51:43 +08:00
Yikun Jiang
17de078d83 [Docs] Add dynamic version in docs (#90)
### What this PR does / why we need it?
Add dynamic version in docs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview: https://vllm-ascend--90.org.readthedocs.build/en/90/

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-19 08:57:27 +08:00
Mengqing Cao
c18fb09b55 [MISC] set default model to qwen in example (#87)
- Set default model to Qwen2.5-0.5B-Instruct in example
- Remove Ultravox 0.3 because it is not tested currently

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-18 17:09:59 +08:00
Huazhong Ji
8ea8523744 reset default block_size from 16 to 128 (#84)
### What this PR does / why we need it?

Changed default block_size in platform.py from 16 to 128, as Ascend
Devices have a better affinity for block size 128.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: hzji210@gmail.com <hzji210@gmail.com>
2025-02-18 14:19:38 +08:00
wangxiyuan
7606977739 [Doc] Add release note (#59)
Add release note template and init the first release note content

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-18 11:20:06 +08:00
Yikun Jiang
7cc024a2d3 [Docs] Refeactor installation doc (#78)
### What this PR does / why we need it?
Refeactor installation doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI, preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-17 22:12:07 +08:00
Shanshan Shen
7c8bdc3a18 [Doc] Update tutorials (#79)
### What this PR does / why we need it?

Update tutorials.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
no.

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2025-02-17 22:11:04 +08:00
Shanshan Shen
2a678141d4 [Doc] Add vllm-ascend usage doc & fix doc format (#53)
### What this PR does / why we need it?
1. Add vllm-ascend tutorial doc for Qwen/Qwen2.5-7B-Instruct model
serving doc
2. fix format of files in `docs` dir, e.g. format tables, add underline
for links, add line feed...

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

no.

### How was this patch tested?
doc CI passed

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2025-02-17 18:37:29 +08:00
Mengqing Cao
c935b7006c [doc] fix feature support (#70)
Check and update the feature support table.

- both multi-step and speculative decoding require adaptation of corresponding workers
- prompt adapter (finetune method) require adaption in worker.py and model_runner.py

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-17 15:43:37 +08:00
Niuya
36ea38fde5 [CI]add file to pytest.ini (#61)
### What this PR does / why we need it?
add file to pytest.ini. Ignore some quantization method

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
pytest tests/xxx

Signed-off-by: ShiyaNiu <1025125896@qq.com>
2025-02-17 14:26:04 +08:00
Yikun Jiang
a6f91f70b7 [Doc] Add versioning_policy doc (#62)
### What this PR does / why we need it?

This patch add the versioning policy doc for vllm-ascend

Reference:
- https://spark.apache.org/versioning-policy.html
- https://docs.openstack.org/project-team-guide/stable-branches.html
- https://github.com/pytorch/pytorch/blob/main/RELEASE.md

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
preview: https://vllm-ascend--62.org.readthedocs.build/en/62/

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-17 14:13:28 +08:00
Mengqing Cao
4544e99d88 [dist] revert communicator patch (#66)
### What this PR does / why we need it?
Revert communicator patch as
https://github.com/vllm-project/vllm/pull/13208 has been merged.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
test locally by
https://github.com/vllm-project/vllm-ascend/pull/30#issuecomment-2650251266

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-17 11:42:33 +08:00
Yikun Jiang
bfbfbce184 [CI] Add container image build ci (#64)
### What this PR does / why we need it?
Add container image build ci:
- Enable branch, tag docker image publish
    - branch image: `vllm-ascend:main`, `vllm-ascend:v0.7.1-dev`
    - tag image: `vllm-ascend:v0.7.1rc1`
- Enable PR docker image build check
- other changes:
    - Prepare the `REPO_OWNER` because the ghcr lowerercase required
- Add `Free up disk space` step to avoid `No space left on device` like
https://github.com/vllm-project/vllm-ascend/issues/27
- Setup qemu with image to resolve
https://github.com/docker/setup-qemu-action/issues/198

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
build: CI passed [push
false](https://github.com/vllm-project/vllm-ascend/actions/runs/13347017608/job/37278724158?pr=64)
Note for test case:
1. merge commits ot `main`, `v0.7.1-dev` branch 
 main: https://github.com/Yikun/vllm-ascend/actions/runs/13347238961
--> ghcr.io/yikun/vllm-ascend:main OK
v0.7.1-dev:
https://github.com/Yikun/vllm-ascend/actions/runs/13347229912 -->
ghcr.io/yikun/vllm-ascend:v0.7.1-dev OK

2. create pep440 tag from github release: v0.7.1rc1, v0.7.1,
v0.7.1rc1.dev1 all release has latest
 v0.7.5 --> v0.7.5, latest 
 v0.7.5rc1 --> v0.7.5rc1
 v0.7.5rc1.dev1 --> v0.7.5rc1.dev1
 (no latest, add a todo here) v0.7.5rc1.post1 --> v0.7.5rc1.post1

3. create unknow tag from github release:
 create 0.7.1 on v0.7.1-dev: not trigger ( only prefix v triggerd)

4. create tag from git: 
 also works, `git tag v0.7.99;git push origin v0.7.99` from
publish-image

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-17 09:07:35 +08:00
Yikun Jiang
c1ac822642 [CI] Switch to cann latest version (#63)
### What this PR does / why we need it?
Switch to cann latest version

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-16 13:38:01 +08:00
Mengqing Cao
b88443b6c6 [dist] fix communicator patch (#58)
### What this PR does / why we need it?
fix communicator patch so parallel could work.
see #52

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-14 10:45:49 +08:00
wangxiyuan
e264987af2 [Doc] Add install doc (#49)
Add official install guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-14 10:22:15 +08:00
Yikun Jiang
46977f9f06 [Doc] Add sphinx build for vllm-ascend (#55)
### What this PR does / why we need it?

This patch enables the doc build for vllm-ascend

- Add sphinx build for vllm-ascend
- Enable readthedocs for vllm-ascend
- Fix CI:
- exclude vllm-empty/tests/mistral_tool_use to skip `You need to agree
to share your contact information to access this model` which introduce
in
314cfade02
- Install test req to fix
https://github.com/vllm-project/vllm-ascend/actions/runs/13304112758/job/37151690770:
      ```
      vllm-empty/tests/mistral_tool_use/conftest.py:4: in <module>
          import pytest_asyncio
      E   ModuleNotFoundError: No module named 'pytest_asyncio'
      ```
  - exclude docs PR

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
1. test locally:
    ```bash
    # Install dependencies.
    pip install -r requirements-docs.txt
    
    # Build the docs and preview
    make clean; make html; python -m http.server -d build/html/
    ```
    
    Launch browser and open http://localhost:8000/.

2. CI passed with preview:
    https://vllm-ascend--55.org.readthedocs.build/en/55/

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-13 18:44:17 +08:00
Yikun Jiang
63b11ec7e9 [Doc] Add Quickstart doc (#44)
### What this PR does / why we need it?
This PR add the quickstart doc 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-13 16:29:36 +08:00
Huazhong Ji
c8b57d10b2 [Misc] update the dependency version of torch-npu (#50)
### What this PR does / why we need it?

This PR updates the dependency version of vllm-ascend on torch-npu, so
that the vllm-ascend can be installed in a later version environment
(like to torch-npu 2.6.0rc1),

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI Test

Signed-off-by: ji-huazhong <hzji210@gmail.com>
2025-02-12 15:50:38 +08:00
Yikun Jiang
28d7691361 [FOLLOWUP][Misc] Remove unused mypy config for base_communicator (#45)
### What this PR does / why we need it?
- Remove on communicator mypy to address:
https://github.com/vllm-project/vllm-ascend/pull/24#issuecomment-2647696781
- Add mypy.ini to trigger list

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-12 09:17:05 +08:00
Mengqing Cao
86796cf2dd [Misc][Build] Fix packages for finding submodule (#42)
### What this PR does / why we need it?
fix packages for finding submodule. 
Before this pr, the wheel built by pip not contain submodule `ops` ,
thus will raise an `ImportError` when importing vllm


### How was this patch tested?
1. build vllm-ascend wheel by pip
```bash
cd ./vllm-ascend
pip wheel ./ --no-deps
pip install vllm_ascend-0.1.dev11+g07f2a16.d20250211-py3-none-any.whl  #change file name according to yours wheel.
```
2. check vllm
```python
import vllm
```

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-11 20:17:43 +08:00
wangxiyuan
f762ee89cc [Communicator] Add monkey patch (#30)
Some PR for plugin support is not merged by vllm yet. This PR add monkey
patch to vllm-ascend to make vllm-ascend work with vllm directly.

This patch code should be removed once the related function is supported
by vllm originally.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-11 19:15:35 +08:00
Yikun Jiang
eb189aac81 Followup fix on official doc update (#34)
### What this PR does / why we need it?
- Fix typos: vllm-ascned --> vllm-ascend
- For version info

### Does this PR introduce _any_ user-facing change?
No


### How was this patch tested?
preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-11 14:28:26 +08:00
wangxiyuan
51eadc68b9 [Docs] Add official doc index (#29)
Add official doc index. Move the release content to the right place.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-11 12:00:27 +08:00
Mengqing Cao
7006835977 [attn] fix device of tensors in attention (#25)
### What this PR does / why we need it?
Fix device of tensors created in `AscendAttentionBackendImpl`.

While specifying device to cards except card-0, there'll cause an
**device conflict** because the tensors (such as `attn_mask`) will be
put on card-0 by default.

This pr creates these tensors on the correct card corresponding to the
input.

### Does this PR introduce _any_ user-facing change?
User could specify device with local rank by this pr, and a modify on
vLLM is also needed, will related to this pr when created.

### How was this patch tested?
This is tested by the following code locally. Will add a test case when
the modify in vLLM is also completed.
```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(model="~/.cache/modelscope/hub/Qwen/Qwen2___5-7B-Instruct", device="npu:1")

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-10 19:20:29 +08:00
wangxiyuan
c59375caff [Misc] version control by setuptools_scm (#21)
make package version control by setuptools_scm to keep the same with
vllm

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-10 09:36:09 +08:00
Yikun Jiang
88714969d4 [Doc] Replace logo official link and update contrib doc (#22)
### What this PR does / why we need it?
Replace logo official link and update contrib doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview:
-
336055be1a/README.zh.md
-
336055be1a/README.md

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-08 15:01:03 +08:00
whx
8fc5dc966a [Worker] Register mindie_turbo while initializing NPUWorker (#13)
Add `try_register_lib` and import mindie-turbo when init.

---------

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
2025-02-07 16:47:17 +08:00
zouyida2002
4495fc6838 bugfix for mrope (#14)
a fix follow up [MRotaryEmbedding
change](bf3b79efb8 (diff-6bc44986c91bf0876240dec03d56c748403691c7fcd90f7a22e7affff7b033ecR839))

Signed-off-by: z00897138 <zhaorifa@huawei.com>
Co-authored-by: z00897138 <zhaorifa@huawei.com>
2025-02-07 16:46:24 +08:00
Mengqing Cao
7d9ae22ecb [CI] use pytest.ini to manage vllm native tests (#5)
### What this PR does / why we need it?
Use `pytest.ini` to manage vllm native tests.
This will convert the original test script whitelist to a blacklist to
prevent missing the newly added test scripts of the upstream vLLM.

**note**: _we do **not** manage the test scripts of vLLM-Ascend in
`pytest.ini`, because if we do so, there will be conflicts between vLLM
and vLLM-Ascend's `conftest.py`._
### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new existing test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-06 23:57:51 +08:00
Li Wang
8cb5615fb0 [Doc]Add chinese doc (#10)
### What this PR does / why we need it?
This PR adds Chinese documents for vllm-ascend for Chinese-speaking
developers

### Does this PR introduce _any_ user-facing change?
Change as follows
- add README.zh.md
- add environment.zh.md
- add CONTRIBUTING.zh.md

### How was this patch tested?
By CI

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-02-06 14:49:43 +08:00
wangxiyuan
a48b9addef [Doc] Update Readme (#11)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Add feature and model support matrix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test is enough

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-06 14:08:44 +08:00
Shanshan Shen
bfccf739e2 [ModelRunner] Refactor model_runner for NPU (#6)
### What this PR does / why we need it?

This PR is a refactoring of model runner, to decouple it from the
classes specifically designed for GPU.

The changes of model runner are generally showed below:

![iShot_2025-01-20_21 32
37](https://github.com/user-attachments/assets/e7e14e5f-5367-42cf-bc82-abff35cd73b9)

**Other changes:** I have removed the code of `cuda`, `lora` and `prompt
adapter`, because NPU doesn`t support them now.

### Does this PR introduce _any_ user-facing change?

no.

### How was this patch tested?

I have used `AI-ModelScope/gpt2` for testing
`examples/offline_inference_npu.py`, and the results showed that it
worked well.

The test logs are showed below:

```bash
INFO 02-05 09:08:46 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 02-05 09:08:46 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 02-05 09:08:46 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 02-05 09:08:46 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 02-05 09:08:46 __init__.py:44] plugin ascend loaded.
INFO 02-05 09:08:46 __init__.py:177] Platform plugin ascend is activated
INFO 02-05 09:08:48 config.py:2383] Downcasting torch.float32 to torch.float16.
INFO 02-05 09:08:59 config.py:542] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 02-05 09:08:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev1+gb3a0d01) with config: model='/home/sss/models/AI-ModelScope/gpt2', speculative_config=None, tokenizer='/home/sss/models/AI-ModelScope/gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/sss/models/AI-ModelScope/gpt2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 02-05 09:09:01 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 02-05 09:09:01 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.18it/s]

INFO 02-05 09:09:11 executor_base.py:110] # CPU blocks: 98557, # CPU blocks: 7281
INFO 02-05 09:09:11 executor_base.py:115] Maximum concurrency for 1024 tokens per request: 1539.95x
INFO 02-05 09:09:12 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 2.13 seconds
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.53it/s, est. speed input: 8.41 toks/s, output: 152.97 toks/s]
Prompt: 'Hello, my name is', Generated text: " John. I'm a writer, and I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm"
Prompt: 'The president of the United States is', Generated text: ' States president. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United'
Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic, and the capital of the French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.'
Prompt: 'The future of AI is', Generated text: '\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future'
```

---------

Signed-off-by: Shanshan Shen <467638484@qq.com>
2025-02-06 09:04:18 +08:00
Yikun Jiang
d5e7756028 [Core] Init vllm-ascend (#3)
### What this PR does / why we need it?
vLLM Ascend plugin (vllm-ascend) is a backend plugin for running vLLM on
the Ascend NPU.

This plugin is the recommended approach for supporting the Ascend
backend within the vLLM community. It adheres to the principles outlined
in the [RFC]: Hardware pluggable, providing a hardware-pluggable
interface that decouples the integration of the Ascend NPU with vLLM.

This patch also include changes to make CI work and use cache speed up
e2e test, including:
1. Change push (post merge ci) and pull_request (pr ci) trigger branch
to main
   2. Make mypy work by ignore base_communicator and clear unused deps
   3. Several improvements for vllm_ascend_test:
     - use cache (pip, ms, hf) speed up e2e test (25mins --> 5mins)
- switch `git clone` command to `action/checkout` to speedup checkout
and
     - Enable sv for pytest for better info dump
- Remove network host to resole `docker: conflicting ontions: cannot
attach both user-defined and non-user-definednetwork-modes`, which is a
problem on docker 1.45 but not on 1.39.
4. Adapt MLA decode optimizations:
cabaf4eff3

### Does this PR introduce _any_ user-facing change?
Yes, init the PR.

### How was this patch tested?
- This is the first PR to make ascend NPU work on vLLM. All code is
tested on ascend with vLLM V0 Engine.
- CI passed

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
2025-02-05 10:53:12 +08:00
Simon Mo
eb283428dd Initial commit 2025-01-29 02:44:13 -08:00