1275 Commits

Author SHA1 Message Date
Yikun Jiang
a6f91f70b7 [Doc] Add versioning_policy doc (#62)
### What this PR does / why we need it?

This patch add the versioning policy doc for vllm-ascend

Reference:
- https://spark.apache.org/versioning-policy.html
- https://docs.openstack.org/project-team-guide/stable-branches.html
- https://github.com/pytorch/pytorch/blob/main/RELEASE.md

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
preview: https://vllm-ascend--62.org.readthedocs.build/en/62/

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-17 14:13:28 +08:00
Mengqing Cao
4544e99d88 [dist] revert communicator patch (#66)
### What this PR does / why we need it?
Revert communicator patch as
https://github.com/vllm-project/vllm/pull/13208 has been merged.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
test locally by
https://github.com/vllm-project/vllm-ascend/pull/30#issuecomment-2650251266

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-17 11:42:33 +08:00
Yikun Jiang
bfbfbce184 [CI] Add container image build ci (#64)
### What this PR does / why we need it?
Add container image build ci:
- Enable branch, tag docker image publish
    - branch image: `vllm-ascend:main`, `vllm-ascend:v0.7.1-dev`
    - tag image: `vllm-ascend:v0.7.1rc1`
- Enable PR docker image build check
- other changes:
    - Prepare the `REPO_OWNER` because the ghcr lowerercase required
- Add `Free up disk space` step to avoid `No space left on device` like
https://github.com/vllm-project/vllm-ascend/issues/27
- Setup qemu with image to resolve
https://github.com/docker/setup-qemu-action/issues/198

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
build: CI passed [push
false](https://github.com/vllm-project/vllm-ascend/actions/runs/13347017608/job/37278724158?pr=64)
Note for test case:
1. merge commits ot `main`, `v0.7.1-dev` branch 
 main: https://github.com/Yikun/vllm-ascend/actions/runs/13347238961
--> ghcr.io/yikun/vllm-ascend:main OK
v0.7.1-dev:
https://github.com/Yikun/vllm-ascend/actions/runs/13347229912 -->
ghcr.io/yikun/vllm-ascend:v0.7.1-dev OK

2. create pep440 tag from github release: v0.7.1rc1, v0.7.1,
v0.7.1rc1.dev1 all release has latest
 v0.7.5 --> v0.7.5, latest 
 v0.7.5rc1 --> v0.7.5rc1
 v0.7.5rc1.dev1 --> v0.7.5rc1.dev1
 (no latest, add a todo here) v0.7.5rc1.post1 --> v0.7.5rc1.post1

3. create unknow tag from github release:
 create 0.7.1 on v0.7.1-dev: not trigger ( only prefix v triggerd)

4. create tag from git: 
 also works, `git tag v0.7.99;git push origin v0.7.99` from
publish-image

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-17 09:07:35 +08:00
Yikun Jiang
c1ac822642 [CI] Switch to cann latest version (#63)
### What this PR does / why we need it?
Switch to cann latest version

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-16 13:38:01 +08:00
Mengqing Cao
b88443b6c6 [dist] fix communicator patch (#58)
### What this PR does / why we need it?
fix communicator patch so parallel could work.
see #52

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-14 10:45:49 +08:00
wangxiyuan
e264987af2 [Doc] Add install doc (#49)
Add official install guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-14 10:22:15 +08:00
Yikun Jiang
46977f9f06 [Doc] Add sphinx build for vllm-ascend (#55)
### What this PR does / why we need it?

This patch enables the doc build for vllm-ascend

- Add sphinx build for vllm-ascend
- Enable readthedocs for vllm-ascend
- Fix CI:
- exclude vllm-empty/tests/mistral_tool_use to skip `You need to agree
to share your contact information to access this model` which introduce
in
314cfade02
- Install test req to fix
https://github.com/vllm-project/vllm-ascend/actions/runs/13304112758/job/37151690770:
      ```
      vllm-empty/tests/mistral_tool_use/conftest.py:4: in <module>
          import pytest_asyncio
      E   ModuleNotFoundError: No module named 'pytest_asyncio'
      ```
  - exclude docs PR

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
1. test locally:
    ```bash
    # Install dependencies.
    pip install -r requirements-docs.txt
    
    # Build the docs and preview
    make clean; make html; python -m http.server -d build/html/
    ```
    
    Launch browser and open http://localhost:8000/.

2. CI passed with preview:
    https://vllm-ascend--55.org.readthedocs.build/en/55/

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-13 18:44:17 +08:00
Yikun Jiang
63b11ec7e9 [Doc] Add Quickstart doc (#44)
### What this PR does / why we need it?
This PR add the quickstart doc 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-13 16:29:36 +08:00
Huazhong Ji
c8b57d10b2 [Misc] update the dependency version of torch-npu (#50)
### What this PR does / why we need it?

This PR updates the dependency version of vllm-ascend on torch-npu, so
that the vllm-ascend can be installed in a later version environment
(like to torch-npu 2.6.0rc1),

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI Test

Signed-off-by: ji-huazhong <hzji210@gmail.com>
2025-02-12 15:50:38 +08:00
Yikun Jiang
28d7691361 [FOLLOWUP][Misc] Remove unused mypy config for base_communicator (#45)
### What this PR does / why we need it?
- Remove on communicator mypy to address:
https://github.com/vllm-project/vllm-ascend/pull/24#issuecomment-2647696781
- Add mypy.ini to trigger list

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-12 09:17:05 +08:00
Mengqing Cao
86796cf2dd [Misc][Build] Fix packages for finding submodule (#42)
### What this PR does / why we need it?
fix packages for finding submodule. 
Before this pr, the wheel built by pip not contain submodule `ops` ,
thus will raise an `ImportError` when importing vllm


### How was this patch tested?
1. build vllm-ascend wheel by pip
```bash
cd ./vllm-ascend
pip wheel ./ --no-deps
pip install vllm_ascend-0.1.dev11+g07f2a16.d20250211-py3-none-any.whl  #change file name according to yours wheel.
```
2. check vllm
```python
import vllm
```

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-11 20:17:43 +08:00
wangxiyuan
f762ee89cc [Communicator] Add monkey patch (#30)
Some PR for plugin support is not merged by vllm yet. This PR add monkey
patch to vllm-ascend to make vllm-ascend work with vllm directly.

This patch code should be removed once the related function is supported
by vllm originally.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-11 19:15:35 +08:00
Yikun Jiang
eb189aac81 Followup fix on official doc update (#34)
### What this PR does / why we need it?
- Fix typos: vllm-ascned --> vllm-ascend
- For version info

### Does this PR introduce _any_ user-facing change?
No


### How was this patch tested?
preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-11 14:28:26 +08:00
wangxiyuan
51eadc68b9 [Docs] Add official doc index (#29)
Add official doc index. Move the release content to the right place.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-11 12:00:27 +08:00
Mengqing Cao
7006835977 [attn] fix device of tensors in attention (#25)
### What this PR does / why we need it?
Fix device of tensors created in `AscendAttentionBackendImpl`.

While specifying device to cards except card-0, there'll cause an
**device conflict** because the tensors (such as `attn_mask`) will be
put on card-0 by default.

This pr creates these tensors on the correct card corresponding to the
input.

### Does this PR introduce _any_ user-facing change?
User could specify device with local rank by this pr, and a modify on
vLLM is also needed, will related to this pr when created.

### How was this patch tested?
This is tested by the following code locally. Will add a test case when
the modify in vLLM is also completed.
```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(model="~/.cache/modelscope/hub/Qwen/Qwen2___5-7B-Instruct", device="npu:1")

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-10 19:20:29 +08:00
wangxiyuan
c59375caff [Misc] version control by setuptools_scm (#21)
make package version control by setuptools_scm to keep the same with
vllm

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-10 09:36:09 +08:00
Yikun Jiang
88714969d4 [Doc] Replace logo official link and update contrib doc (#22)
### What this PR does / why we need it?
Replace logo official link and update contrib doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview:
-
336055be1a/README.zh.md
-
336055be1a/README.md

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-08 15:01:03 +08:00
whx
8fc5dc966a [Worker] Register mindie_turbo while initializing NPUWorker (#13)
Add `try_register_lib` and import mindie-turbo when init.

---------

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
2025-02-07 16:47:17 +08:00
zouyida2002
4495fc6838 bugfix for mrope (#14)
a fix follow up [MRotaryEmbedding
change](bf3b79efb8 (diff-6bc44986c91bf0876240dec03d56c748403691c7fcd90f7a22e7affff7b033ecR839))

Signed-off-by: z00897138 <zhaorifa@huawei.com>
Co-authored-by: z00897138 <zhaorifa@huawei.com>
2025-02-07 16:46:24 +08:00
Mengqing Cao
7d9ae22ecb [CI] use pytest.ini to manage vllm native tests (#5)
### What this PR does / why we need it?
Use `pytest.ini` to manage vllm native tests.
This will convert the original test script whitelist to a blacklist to
prevent missing the newly added test scripts of the upstream vLLM.

**note**: _we do **not** manage the test scripts of vLLM-Ascend in
`pytest.ini`, because if we do so, there will be conflicts between vLLM
and vLLM-Ascend's `conftest.py`._
### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new existing test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-06 23:57:51 +08:00
Li Wang
8cb5615fb0 [Doc]Add chinese doc (#10)
### What this PR does / why we need it?
This PR adds Chinese documents for vllm-ascend for Chinese-speaking
developers

### Does this PR introduce _any_ user-facing change?
Change as follows
- add README.zh.md
- add environment.zh.md
- add CONTRIBUTING.zh.md

### How was this patch tested?
By CI

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-02-06 14:49:43 +08:00
wangxiyuan
a48b9addef [Doc] Update Readme (#11)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Add feature and model support matrix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test is enough

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-06 14:08:44 +08:00
Shanshan Shen
bfccf739e2 [ModelRunner] Refactor model_runner for NPU (#6)
### What this PR does / why we need it?

This PR is a refactoring of model runner, to decouple it from the
classes specifically designed for GPU.

The changes of model runner are generally showed below:

![iShot_2025-01-20_21 32
37](https://github.com/user-attachments/assets/e7e14e5f-5367-42cf-bc82-abff35cd73b9)

**Other changes:** I have removed the code of `cuda`, `lora` and `prompt
adapter`, because NPU doesn`t support them now.

### Does this PR introduce _any_ user-facing change?

no.

### How was this patch tested?

I have used `AI-ModelScope/gpt2` for testing
`examples/offline_inference_npu.py`, and the results showed that it
worked well.

The test logs are showed below:

```bash
INFO 02-05 09:08:46 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 02-05 09:08:46 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 02-05 09:08:46 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 02-05 09:08:46 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 02-05 09:08:46 __init__.py:44] plugin ascend loaded.
INFO 02-05 09:08:46 __init__.py:177] Platform plugin ascend is activated
INFO 02-05 09:08:48 config.py:2383] Downcasting torch.float32 to torch.float16.
INFO 02-05 09:08:59 config.py:542] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 02-05 09:08:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev1+gb3a0d01) with config: model='/home/sss/models/AI-ModelScope/gpt2', speculative_config=None, tokenizer='/home/sss/models/AI-ModelScope/gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/sss/models/AI-ModelScope/gpt2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 02-05 09:09:01 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 02-05 09:09:01 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.18it/s]

INFO 02-05 09:09:11 executor_base.py:110] # CPU blocks: 98557, # CPU blocks: 7281
INFO 02-05 09:09:11 executor_base.py:115] Maximum concurrency for 1024 tokens per request: 1539.95x
INFO 02-05 09:09:12 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 2.13 seconds
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.53it/s, est. speed input: 8.41 toks/s, output: 152.97 toks/s]
Prompt: 'Hello, my name is', Generated text: " John. I'm a writer, and I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm"
Prompt: 'The president of the United States is', Generated text: ' States president. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United'
Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic, and the capital of the French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.'
Prompt: 'The future of AI is', Generated text: '\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future'
```

---------

Signed-off-by: Shanshan Shen <467638484@qq.com>
2025-02-06 09:04:18 +08:00
Yikun Jiang
d5e7756028 [Core] Init vllm-ascend (#3)
### What this PR does / why we need it?
vLLM Ascend plugin (vllm-ascend) is a backend plugin for running vLLM on
the Ascend NPU.

This plugin is the recommended approach for supporting the Ascend
backend within the vLLM community. It adheres to the principles outlined
in the [RFC]: Hardware pluggable, providing a hardware-pluggable
interface that decouples the integration of the Ascend NPU with vLLM.

This patch also include changes to make CI work and use cache speed up
e2e test, including:
1. Change push (post merge ci) and pull_request (pr ci) trigger branch
to main
   2. Make mypy work by ignore base_communicator and clear unused deps
   3. Several improvements for vllm_ascend_test:
     - use cache (pip, ms, hf) speed up e2e test (25mins --> 5mins)
- switch `git clone` command to `action/checkout` to speedup checkout
and
     - Enable sv for pytest for better info dump
- Remove network host to resole `docker: conflicting ontions: cannot
attach both user-defined and non-user-definednetwork-modes`, which is a
problem on docker 1.45 but not on 1.39.
4. Adapt MLA decode optimizations:
cabaf4eff3

### Does this PR introduce _any_ user-facing change?
Yes, init the PR.

### How was this patch tested?
- This is the first PR to make ascend NPU work on vLLM. All code is
tested on ascend with vLLM V0 Engine.
- CI passed

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
2025-02-05 10:53:12 +08:00
Simon Mo
eb283428dd Initial commit 2025-01-29 02:44:13 -08:00