Files

Shanshan Shen c06af8b2e0 [V1][Core] Add support for V1 Engine (#295 )

### What this PR does / why we need it?
Add support for V1 Engine.

Please note that this is just the initial version, and there may be some
places need to be fixed or optimized in the future, feel free to leave
some comments to us.

### Does this PR introduce _any_ user-facing change?

To use V1 Engine on NPU device, you need to set the env variable shown
below:

```bash
export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
```

If you are using vllm for offline inferencing, you must add a `__main__`
guard like:

```bash
if __name__ == '__main__':

    llm = vllm.LLM(...)
```

Find more details
[here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing).

### How was this patch tested?
I have tested the online serving with `Qwen2.5-7B-Instruct` using this
command:

```bash
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```

Query the model with input prompts:

```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "prompt": "The future of AI is",
        "max_tokens": 7,
        "temperature": 0
    }'
```

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: didongli182 <didongli@huawei.com>

2025-03-20 19:34:44 +08:00

5.0 KiB

Raw Blame History

Release note

v0.7.3rc1

🎉 Hello, World! This is the first release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.

Quickstart with container: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/quick_start.html
Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html

Highlights

DeepSeek V3/R1 works well now. Read the official guide to start! #242
Speculative decoding feature is supported. #252
Multi step scheduler feature is supported. #300

Core

Bump torch_npu version to dev20250308.3 to improve _exponential accuracy
Added initial support for pooling models. Bert based model, such as BAAI/bge-base-en-v1.5 and BAAI/bge-reranker-v2-m3 works now. #229

Model

The performance of Qwen2-VL is improved. #241
MiniCPM is now supported #164

Other

Support MTP(Multi-Token Prediction) for DeepSeek V3/R1 #236
[Docs] Added more model tutorials, include DeepSeek, QwQ, Qwen and Qwen 2.5VL. See the official doc for detail
Pin modelscope<1.23.0 on vLLM v0.7.3 to resolve: https://github.com/vllm-project/vllm/pull/13807

Known issues

In some cases, especially when the input/output is very long, the accuracy of output may be incorrect. We are working on it. It'll be fixed in the next release.
Improved and reduced the garbled code in model output. But if you still hit the issue, try to change the generation config value, such as temperature, and try again. There is also a knonwn issue shown below. Any feedback is welcome. #277

v0.7.1rc1

🎉 Hello, World!

We are excited to announce the first release candidate of v0.7.1 for vllm-ascend.

vLLM Ascend Plugin (vllm-ascend) is a community maintained hardware plugin for running vLLM on the Ascend NPU. With this release, users can now enjoy the latest features and improvements of vLLM on the Ascend NPU.

Please follow the official doc to start the journey. Note that this is a release candidate, and there may be some bugs or issues. We appreciate your feedback and suggestions here

Highlights

Initial supports for Ascend NPU on vLLM. #3
DeepSeek is now supported. #88 #68
Qwen, Llama series and other popular models are also supported, you can see more details in here.

Core

Added the Ascend quantization config option, the implementation will coming soon. #7 #73
Add silu_and_mul and rope ops and add mix ops into attention layer. #18

Other

[CI] Enable Ascend CI to actively monitor and improve quality for vLLM on Ascend. #3
[Docker] Add vllm-ascend container image #64
[Docs] Add a live doc #55

Known issues

This release relies on an unreleased torch_npu version. It has been installed within official container image already. Please install it manually if you are using non-container environment.
There are logs like No platform detected, vLLM is running on UnspecifiedPlatform or Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") shown when running vllm-ascend. It actually doesn't affect any functionality and performance. You can just ignore it. And it has been fixed in this PR which will be included in v0.7.3 soon.
There are logs like # CPU blocks: 35064, # CPU blocks: 2730 shown when running vllm-ascend which should be # NPU blocks: . It actually doesn't affect any functionality and performance. You can just ignore it. And it has been fixed in this PR which will be included in v0.7.3 soon.

5.0 KiB Raw Blame History

Release note

v0.7.3rc1

Highlights

Core

Model

Other

Known issues

v0.7.1rc1

Highlights

Core

Other

Known issues

5.0 KiB

Raw Blame History