[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
SILONG ZENG
2026-01-15 09:06:01 +08:00
committed by GitHub
parent 96edd4673f
commit 4811ba62e0
75 changed files with 711 additions and 308 deletions

View File

@@ -28,9 +28,11 @@ jobs:
- name: cp problem matchers
run: |
cp .github/workflows/matchers/actionlint.json "$RUNNER_TEMP/actionlint.json"
cp .github/workflows/matchers/markdownlint.json "$RUNNER_TEMP/markdownlint.json"
cp .github/workflows/matchers/mypy.json "$RUNNER_TEMP/mypy.json"
- run: echo "::add-matcher::$RUNNER_TEMP/actionlint.json"
- run: echo "::add-matcher::$RUNNER_TEMP/markdownlint.json"
- run: echo "::add-matcher::$RUNNER_TEMP/mypy.json"
- name: Checkout vllm-project/vllm repo

View File

@@ -0,0 +1,17 @@
{
"problemMatcher": [
{
"owner": "markdownlint",
"pattern": [
{
"regexp": "^([^:]*):(\\d+):?(\\d+)?\\s([\\w-\\/]*)\\s(.*)$",
"file": 1,
"line": 2,
"column": 3,
"code": 4,
"message": 5
}
]
}
]
}

15
.markdownlint.yaml Normal file
View File

@@ -0,0 +1,15 @@
MD007:
indent: 4
MD013: false
MD024:
siblings_only: true
MD031:
list_items: false
MD029: false
MD036: false
MD033: false
MD041: false
MD046: false
MD052: false
MD053: false
MD059: false

View File

@@ -26,11 +26,12 @@ repos:
# files: ^csrc/.*\.(cpp|hpp|cc|hh|cxx|hxx)$
# types_or: [c++]
# args: [--style=google, --verbose]
- repo: https://github.com/jackdewinter/pymarkdown
rev: v0.9.29
- repo: https://github.com/igorshubovych/markdownlint-cli
rev: v0.45.0
hooks:
- id: pymarkdown
args: [fix]
- id: markdownlint
exclude: '.*\.inc\.md$|.*report_template\.md$|.*contributors\.md$|.*PULL_REQUEST_TEMPLATE\.md$'
stages: [manual] # Only run in CI
- repo: https://github.com/rhysd/actionlint
rev: v1.7.7
hooks:

View File

@@ -19,6 +19,7 @@ vLLM Ascend Plugin
---
*Latest News* 🔥
- [2025/12] We released the new official version [v0.11.0](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.11.0)! Please follow the [official guide](https://docs.vllm.ai/projects/ascend/en/v0.11.0/) to start using vLLM Ascend Plugin on Ascend.
- [2025/09] We released the new official version [v0.9.1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.9.1)! Please follow the [official guide](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/large_scale_ep.html) to start deploy large scale Expert Parallelism (EP) on Ascend.
- [2025/08] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q) with vLLM and Tencent! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF).
@@ -28,7 +29,9 @@ vLLM Ascend Plugin
- [2025/03] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/VtxO9WXa5fC-mKqlxNUJUQ) with vLLM team! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF).
- [2025/02] vLLM community officially created [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) repo for running vLLM seamlessly on the Ascend NPU.
- [2024/12] We are working with the vLLM community to support [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162).
---
## Overview
vLLM Ascend (`vllm-ascend`) is a community maintained hardware plugin for running vLLM seamlessly on the Ascend NPU.
@@ -42,10 +45,10 @@ By using vLLM Ascend plugin, popular open-source models, including Transformer-l
- Hardware: Atlas 800I A2 Inference series, Atlas A2 Training series, Atlas 800I A3 Inference series, Atlas A3 Training series, Atlas 300I Duo (Experimental)
- OS: Linux
- Software:
* Python >= 3.10, < 3.12
* CANN == 8.3.rc2 (Ascend HDK version refers to [here](https://www.hiascend.com/document/detail/zh/canncommercial/83RC2/releasenote/releasenote_0000.html))
* PyTorch == 2.8.0, torch-npu == 2.8.0
* vLLM (the same version as vllm-ascend)
- Python >= 3.10, < 3.12
- CANN == 8.3.rc2 (Ascend HDK version refers to [here](https://www.hiascend.com/document/detail/zh/canncommercial/83RC2/releasenote/releasenote_0000.html))
- PyTorch == 2.8.0, torch-npu == 2.8.0
- vLLM (the same version as vllm-ascend)
## Getting Started
@@ -57,9 +60,11 @@ Please use the following recommended versions to get started quickly:
|v0.11.0|Latest stable version|[QuickStart](https://docs.vllm.ai/projects/ascend/en/v0.11.0/quick_start.html) and [Installation](https://docs.vllm.ai/projects/ascend/en/v0.11.0/installation.html) for more details|
## Contributing
See [CONTRIBUTING](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html) for more details, which is a step-by-step guide to help you set up development environment, build and test.
We welcome and value any contributions and collaborations:
- Please let us know if you encounter a bug by [filing an issue](https://github.com/vllm-project/vllm-ascend/issues)
- Please use [User forum](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support) for usage questions and help.
@@ -86,7 +91,7 @@ Please refer to [Versioning policy](https://docs.vllm.ai/projects/ascend/en/late
## Weekly Meeting
- vLLM Ascend Weekly Meeting: https://tinyurl.com/vllm-ascend-meeting
- vLLM Ascend Weekly Meeting: <https://tinyurl.com/vllm-ascend-meeting>
- Wednesday, 15:00 - 16:00 (UTC+8, [Convert to your timezone](https://dateful.com/convert/gmt8?t=15))
## License

View File

@@ -29,7 +29,9 @@ vLLM Ascend Plugin
- [2025/03] 我们和vLLM团队举办了[vLLM Beijing Meetup](https://mp.weixin.qq.com/s/CGDuMoB301Uytnrkc2oyjg)! 你可以在[这里](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF)找到演讲材料.
- [2025/02] vLLM社区正式创建了[vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)仓库让vLLM可以无缝运行在Ascend NPU。
- [2024/12] 我们正在与 vLLM 社区合作,以支持 [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162).
---
## 总览
vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NPU无缝运行的后端插件。
@@ -43,10 +45,10 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
- 硬件Atlas 800I A2 Inference系列、Atlas A2 Training系列、Atlas 800I A3 Inference系列、Atlas A3 Training系列、Atlas 300I Duo实验性支持
- 操作系统Linux
- 软件:
* Python >= 3.10, < 3.12
* CANN == 8.3.rc2 (Ascend HDK 版本参考[这里](https://www.hiascend.com/document/detail/zh/canncommercial/83RC2/releasenote/releasenote_0000.html))
* PyTorch == 2.8.0, torch-npu == 2.8.0
* vLLM (与vllm-ascend版本一致)
- Python >= 3.10, < 3.12
- CANN == 8.3.rc2 (Ascend HDK 版本参考[这里](https://www.hiascend.com/document/detail/zh/canncommercial/83RC2/releasenote/releasenote_0000.html))
- PyTorch == 2.8.0, torch-npu == 2.8.0
- vLLM (与vllm-ascend版本一致)
## 开始使用
@@ -58,13 +60,16 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
|v0.11.0| 最新正式/稳定版本 |[快速开始](https://docs.vllm.ai/projects/ascend/en/v0.11.0/quick_start.html) and [安装指南](https://docs.vllm.ai/projects/ascend/en/v0.11.0/installation.html)了解更多|
## 贡献
请参考 [CONTRIBUTING]((https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html)) 文档了解更多关于开发环境搭建功能测试以及 PR 提交规范的信息
我们欢迎并重视任何形式的贡献与合作
- 请通过[Issue](https://github.com/vllm-project/vllm-ascend/issues)来告知我们您遇到的任何Bug
- 请通过[用户论坛](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support)来交流使用问题和寻求帮助
## 分支策略
vllm-ascend有主干分支和开发分支
- **main**: 主干分支与vLLM的主干分支对应并通过昇腾CI持续进行质量看护
@@ -86,8 +91,9 @@ vllm-ascend有主干分支和开发分支。
## 社区例会
- vLLM Ascend 每周社区例会: https://tinyurl.com/vllm-ascend-meeting
- vLLM Ascend 每周社区例会: <https://tinyurl.com/vllm-ascend-meeting>
- 每周三下午15:00 - 16:00 (UTC+8, [查看您的时区](https://dateful.com/convert/gmt8?t=15))
## 许可证
Apache 许可证 2.0,如 [LICENSE](./LICENSE) 文件中所示。

View File

@@ -1,8 +1,13 @@
# Introduction
# vLLM Ascend Benchmarks
## Introduction
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.
# Overview
## Overview
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
- Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
@@ -26,8 +31,10 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
**Benchmarking Duration**: about 800 senond for single model.
# Quick Use
## Prerequisites
## Quick Use
### Prerequisites
Before running the benchmarks, ensure the following:
- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
@@ -41,7 +48,7 @@ Before running the benchmarks, ensure the following:
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
```shell
```json
[
{
"test_name": "serving_qwen2_5vl_7B_tp1",
@@ -75,45 +82,46 @@ Before running the benchmarks, ensure the following:
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
- **Test Overview**
- Test Name: serving_qwen2_5vl_7B_tp1
- **Test Overview**
- Test Name: serving_qwen2_5vl_7B_tp1
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
- Server Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- Server Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
- disable_log_stats: disables logging of performance statistics.
- disable_log_stats: disables logging of performance statistics.
- disable_log_requests: disables logging of individual requests.
- disable_log_requests: disables logging of individual requests.
- Trust Remote Code: enabled (allows execution of model-specific custom code)
- Trust Remote Code: enabled (allows execution of model-specific custom code)
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
- Client Parameters
- Client Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
- Dataset Source: Hugging Face (hf)
- Dataset Source: Hugging Face (hf)
- Dataset Split: train
- Dataset Split: train
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
- Number of Prompts: 200 (the total number of prompts used during the test)
- Number of Prompts: 200 (the total number of prompts used during the test)
## Run benchmarks
### Run benchmarks
#### Use benchmark script
### Use benchmark script
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
```shell
@@ -134,11 +142,13 @@ Once the script completes, you can find the results in the benchmarks/results fo
These files contain detailed benchmarking results for further analysis.
### Use benchmark cli
#### Use benchmark cli
For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
Similarly, lets take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
#### Online serving
##### Online serving
1. Launch the server:
```shell
@@ -156,7 +166,8 @@ Similarly, lets take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
--request-rate 16
```
#### Offline
##### Offline
- **Throughput**
```shell

View File

@@ -10,6 +10,7 @@
{serving_tests_markdown_table}
## Offline tests
### Latency tests
- Input length: 32 tokens.

View File

@@ -1,6 +1,6 @@
# vLLM Ascend Plugin documents
Live doc: https://docs.vllm.ai/projects/ascend
Live doc: <https://docs.vllm.ai/projects/ascend>
## Build the docs
@@ -20,5 +20,6 @@ python -m http.server -d _build/html/
```
Launch your browser and open:
- English version: http://localhost:8000
- Chinese version: http://localhost:8000/zh_CN
- English version: <http://localhost:8000>
- Chinese version: <http://localhost:8000/zh_CN>

View File

@@ -1,12 +1,15 @@
# Governance
## Mission
As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for everyone on Ascend NPUs and to actively contributing to the enrichment of vLLM.
## Principles
vLLM Ascend follows the vLLM community's code of conduct: [vLLM - CODE OF CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md)
## Governance - Mechanics
vLLM Ascend is an open-source project under the vLLM community, where the authority to appoint roles is ultimately determined by the vLLM community. It adopts a hierarchical technical governance structure.
- Contributor:

View File

@@ -12,6 +12,7 @@ Each vLLM Ascend release is versioned as `v[major].[minor].[micro][rcN][.postN]`
- **Post releases**: Typically issued **on demand** to address minor errors in a final release. Different from [PEP-440 post release note](https://peps.python.org/pep-0440/#post-releases) convention, these versions include actual bug fixes, as the final release version must strictly align with the vLLM final release format (`v[major].[minor].[micro]`). Any post version must be published as a patch version of the final release.
For example:
- `v0.7.x`: first final release to match the vLLM `v0.7.x` version.
- `v0.7.3rc1`: first pre version of vLLM Ascend.
- `v0.7.3.post1`: post release for the `v0.7.3` release if it has some minor errors.
@@ -49,6 +50,7 @@ If you're using v0.7.3, don't forget to install [mindie-turbo](https://pypi.org/
:::
For main branch of vLLM Ascend, we usually make it compatible with the latest vLLM release and a newer commit hash of vLLM. Please note that this table is usually updated. Please check it regularly.
| vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu |
|-------------|--------------|------------------|-------------|--------------------|
| main | bde38c11df0ea066a740efe9b77fff5418be45df, v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 |
@@ -94,6 +96,7 @@ vLLM Ascend includes two branches: main and dev.
Commits should typically be merged into the main branch first, and only then backported to the dev branch, to reduce maintenance costs as much as possible.
### Maintenance branch and EOL
The table below lists branch states.
| Branch | Time Frame | Summary |
@@ -121,7 +124,8 @@ Usually, each minor version of vLLM (such as 0.7) corresponds to a vLLM Ascend v
| Branch | State | RFC Link | Scheduled Merge Time | Mentor |
|------------|--------------|---------------------------------------|------------|--------|
|rfc/long_seq_optimization|Maintained|https://github.com/vllm-project/vllm/issues/22693|930|wangxiyuan|
|rfc/long_seq_optimization|Maintained|<https://github.com/vllm-project/vllm/issues/22693>|930|wangxiyuan|
- Branch: The feature branch should be created with a prefix `rfc/` followed by the feature name, such as `rfc/feature-name`.
- State: The state of the feature branch is `Maintained` until it is merged into the main branch or deleted.
- RFC Link: The feature branch should be created with a corresponding RFC issue. The creation of a feature branch requires an RFC and approval from at least two maintainers.
@@ -131,11 +135,13 @@ Usually, each minor version of vLLM (such as 0.7) corresponds to a vLLM Ascend v
### Backward compatibility
For main branch, vLLM Ascend should works with vLLM main branch and latest 1 or 2 releases. To ensure backward compatibility, do as follows:
- Both main branch and target vLLM release, such as the vLLM main branch and vLLM 0.8.4, are tested by Ascend E2E CI.
- To make sure that code changes are compatible with the latest 1 or 2 vLLM releases, vLLM Ascend introduces a version check mechanism inside the code. It checks the version of the installed vLLM package first to decide which code logic to use. If users hit the `InvalidVersion` error, it may indicate that they have installed a dev or editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use.
- Document changes should be compatible with the latest 1 or 2 vLLM releases. Notes should be added if there are any breaking changes.
## Document branch policy
To reduce maintenance costs, **all branch documentation content should remain consistent, and version differences can be controlled via variables in [docs/source/conf.py](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/conf.py)**. While this is not a simple task, it is a principle we should strive to follow.
| Version | Purpose | Code Branch |
@@ -151,6 +157,7 @@ Notes:
- `version` documentation: keep updating the `releases/vX.Y.Z` branch documentation to fix doc bugs.
## Software dependency management
- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPi](https://pypi.org/project/torch-npu)
every 3 months, a development version (aka the POC version) every month, and a nightly version every day.
The PyPi stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in

View File

@@ -1,6 +1,7 @@
# Contributing
## Building and Testing
It's recommended to set up a local development environment to build vllm-ascend and run tests
before you submit a PR.

View File

@@ -116,7 +116,7 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/
- Step 1. Install LWS CRD resources
See https://lws.sigs.k8s.io/docs/installation/ Which can be used as a reference
See <https://lws.sigs.k8s.io/docs/installation/> Which can be used as a reference
- Step 2. Deploy the following yaml file `lws.yaml` as what you want
@@ -318,14 +318,14 @@ Since our script is Kubernetes-friendly, we need to actively pass in some cluste
`cluster_hosts: ["xxx.xxx.xxx.188", "xxx.xxx.xxx.212"]`
- Step 2. Install develop environment
- Install vllm-ascend develop packages on every cluster host
- Install vllm-ascend develop packages on every cluster host
``` bash
cd /vllm-workspace/vllm-ascend
python3 -m pip install -r requirements-dev.txt
```
- Install AISBench on the first host(leader node) in cluster_hosts
- Install AISBench on the first host(leader node) in cluster_hosts
``` bash
export AIS_BENCH_TAG="v3.0-20250930-master"

View File

@@ -248,7 +248,7 @@ This will reproduce the E2E test. See [vllm_ascend_test.yaml](https://github.com
Run nightly multi-node test cases locally refer to section of `Running Locally` of [Multi Node Test](./multi_node_test.md).
#### E2E test example:
#### E2E test example
- Offline test example: [`tests/e2e/singlecard/test_offline_inference.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_offline_inference.py)
- Online test examples: [`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)

View File

@@ -1,8 +1,11 @@
# Using AISBench
This document guides you to conduct accuracy testing using [AISBench](https://gitee.com/aisbench/benchmark/tree/master). AISBench provides accuracy and performance evaluation for many datasets.
## Online Server
### 1. Start the vLLM server
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
@@ -44,7 +47,7 @@ vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &
The vLLM server is started successfully, if you see logs as below:
```
```shell
INFO: Started server process [9446]
INFO: Waiting for application startup.
INFO: Application startup complete.
@@ -220,7 +223,7 @@ ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_pro
After each dataset execution, you can get the result from saved files such as `outputs/default/20250628_151326`, there is an example as follows:
```
```shell
20250628_151326/
├── configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
│ └── 20250628_151326_29317.py
@@ -276,7 +279,7 @@ ais_bench --models vllm_api_stream_chat --datasets textvqa_gen_base64 --summariz
After execution, you can get the result from saved files, there is an example as follows:
```
```shell
20251031_070226/
|-- configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
| `-- 20251031_070226_122485.py

View File

@@ -34,7 +34,7 @@ vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
If the vLLM server is started successfully, you can see information shown below:
```
```shell
INFO: Started server process [6873]
INFO: Waiting for application startup.
INFO: Application startup complete.
@@ -42,7 +42,7 @@ INFO: Application startup complete.
Once your server is started, you can query the model with input prompts in a new terminal:
```
```shell
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
@@ -67,7 +67,7 @@ pip install gradio plotly evalscope
You can use `evalscope eval` to run GSM8K for accuracy testing:
```
```shell
evalscope eval \
--model Qwen/Qwen2.5-7B-Instruct \
--api-url http://localhost:8000/v1 \
@@ -101,7 +101,7 @@ pip install evalscope[perf] -U
You can use `evalscope perf` to run perf testing:
```
```shell
evalscope perf \
--url "http://localhost:8000/v1/chat/completions" \
--parallel 5 \

View File

@@ -1,8 +1,11 @@
# Using lm-eval
This document guides you to conduct accuracy testing using [lm-eval][1].
## Online Server
### 1. Start the vLLM server
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
@@ -34,7 +37,7 @@ vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
The vLLM server is started successfully, if you see logs as below:
```
```shell
INFO: Started server process [9446]
INFO: Waiting for application startup.
INFO: Application startup complete.
@@ -44,7 +47,7 @@ INFO: Application startup complete.
You can query the result with input prompts:
```
```shell
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
@@ -71,7 +74,7 @@ curl http://localhost:8000/v1/completions \
The output format matches the following:
```
```json
{
"id": "cmpl-2f678e8bdf5a4b209a3f2c1fa5832e25",
"object": "text_completion",
@@ -108,7 +111,7 @@ pip install lm-eval[api]
Run the following command:
```
```shell
# Only test gsm8k dataset in this demo
lm_eval \
--model local-completions \
@@ -119,7 +122,7 @@ lm_eval \
After 30 minutes, the output is as shown below:
```
```shell
The markdown format results is as below:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
@@ -130,6 +133,7 @@ The markdown format results is as below:
```
## Offline Server
### 1. Run docker container
You can run docker container on a single NPU:
@@ -161,6 +165,7 @@ docker run --rm \
```
### 2. Run GSM8K using lm-eval for accuracy testing
Install lm-eval in the container:
```bash
@@ -170,7 +175,7 @@ pip install lm-eval
Run the following command:
```
```shell
# Only test gsm8k dataset in this demo
lm_eval \
--model vllm \
@@ -181,7 +186,7 @@ lm_eval \
After 1 to 2 minutes, the output is shown below:
```
```shell
The markdown format results is as below:
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

View File

@@ -1,4 +1,5 @@
# Using OpenCompass
This document guides you to conduct accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
## 1. Online Server
@@ -33,7 +34,7 @@ vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
The vLLM server is started successfully, if you see information as below:
```
```shell
INFO: Started server process [6873]
INFO: Waiting for application startup.
INFO: Application startup complete.
@@ -41,7 +42,7 @@ INFO: Application startup complete.
Once your server is started, you can query the model with input prompts in a new terminal.
```
```shell
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
@@ -53,6 +54,7 @@ curl http://localhost:8000/v1/completions \
```
## 2. Run C-Eval using OpenCompass for accuracy testing
Install OpenCompass and configure the environment variables in the container:
```bash
@@ -107,13 +109,13 @@ models = [
Run the following command:
```
```shell
python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
```
After 1 to 2 minutes, the output is shown below:
```
```shell
The markdown format results is as below:
| dataset | version | metric | mode | Qwen2.5-7B-Instruct-vLLM-API |

View File

@@ -4,7 +4,7 @@
When in LLM inference, each token requires nearly thousand operator executions, and when host launching operators are slower than device, it will cause host bound. In severe cases, the device will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
```
```shell
eager mode:
host: | launch op1 | launch op2 | launch op3 | launch op4 | launch op5 |
@@ -38,11 +38,12 @@ But in reality, graph mode is not that simple.
Due to graph can only replay the ops captured before, without doing tiling and checking graph input, we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency.
Obviously, we can solve this problem by capturing the biggest shape and padding all of the model input to it. But it will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shape, and pad the model input to the nearest graph, which will greatly reduce redundant computing. But when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. But we know that when intensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of things we need to do is:
1. Set a threshold;
2. When `num_scheduled_tokens` is bigger than the threshold, use `eager_mode`;
3. Capture multiple graphs within a range below the threshold;
```
```shell
| graph1 |
| graph2 |
| graph3 |

View File

@@ -21,6 +21,7 @@ vLLM Ascend Currently supports Mooncake Store for KV Cache Pool. To enable Moonc
For step-by-step deployment and configuration, please refer to the [KV Pool User Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html).
## How it works?
The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) through a connector-based architecture.
Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth.
@@ -28,6 +29,7 @@ Each connector implements a unified interface for storing, retrieving, and trans
When combined with vLLMs Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
### 1. Combining KV Cache Pool with HBM Prefix Caching
Prefix Caching with HBM is already supported by the vLLM V1 Engine.
By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Caching with Mooncake-backed KV Pool.
@@ -54,17 +56,22 @@ To Enable this feature, we need to setup both Mooncake Connector and Mooncake St
For details, please also refer to the Mooncake Connector Store Deployment Guide.
## How is MooncakestoreConnectorV1 Implemented?
**MooncakestoreConnectorV1** inhereits the KV Connector V1 class in vLLM V1: through implementing the required methods defined in the KV connector V1 base class, one can integrate a thrid-party KV cache transfer/storage backend into the vLLM framework.
MooncakeStoreConnectorV1 is also largly inspried by LMCacheConnectorV1 in term of the `Lookup Engine`/`Lookup Client` design for looking up KV cache keys, and the `ChunkedTokenDatabase` class for processing tokens into prefix-aware hashes as well as other hashing related designs. On top of this, we have also added our own design including `KVTransferThread` that allows async `get` and `put` of KV caches with multi-threading, and NPU-related data transfer optimization such as removing the `LocalBuffer` in LMCache to remove redundant data transfer.
The KV Connector methods that need to be implemented can be categorized into scheduler-side methods that are called in V1 scheduler and worker-side methods that are called in V1 worker, namely:
### KV Connector Scheduler-Side Methods:
### KV Connector Scheduler-Side Methods
`get_num_new_matched_tokens`: Get prefix cache hit in number of tokens through looking up into the KV pool.
`update_states_after_alloc`: Update KVConnector state after temporary buffer alloc.
`build_connector_meta`: Attach the connector metadata to the request object.
`request_finished`: Once a request is finished, determine whether request blocks should be freed now or will be sent asynchronously and freed later.
### Connector Worker-Side Methods:
### Connector Worker-Side Methods
`register_kv_caches`: Register KV cache buffers needed for KV cache transfer.
`start_load_kv`: Perform KV cache load operation that transfers KV cache from storage to device.
`wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario.
@@ -73,6 +80,7 @@ The KV Connector methods that need to be implemented can be categorized into sch
`get_finished` Get request that finished KV transfer, `done_sending` if `put` finished, `done_reciving` if `get` finished.
## DFX
1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for current request.
2. Similaly, when we are trying to put a block into KV Pool and failed, we do not put further blocks (subject to change).

View File

@@ -1,13 +1,15 @@
# Prepare inputs for model forwarding
## Purpose
Information required to perform model forward pass:
- the inputs
- the corresponding attention metadata of the inputs
- the inputs
- the corresponding attention metadata of the inputs
The following diagram shows what we should prepare for model inference.
```
```shell
+---------------+
inputs --> | |
| model | --> output
@@ -20,8 +22,11 @@ Therefore, as long as we have these two pieces of information mentioned above, w
This document will explain **how we obtain the inputs and their corresponding attention metadata**.
## Overview
### 1. Obtain inputs
The workflow of obtaining inputs:
1. Get `token positions`: relative position of each token within its request sequence.
2. Get `token indices`: index of each scheduled token in the token table.
@@ -33,7 +38,9 @@ At last, these `Token IDs` are required to be fed into a model, and also, `posit
**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`.
### 2. Build inputs attention metadata
A model requires these attention metadata during the forward pass:
- `query start location`: start and end location of each request corresponding to the scheduled tokens.
- `sequence length`: length of each request including both computed tokens and newly scheduled tokens.
- `number of computed tokens`: number of computed tokens for each request.
@@ -45,7 +52,9 @@ A model requires these attention metadata during the forward pass:
- `attention mask`: mask matrix applied to attention scores before softmax to control which tokens can attend to each other (usually a causal attention).
## Before start
There are mainly three types of variables.
- token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens
- request level: represents one attribute of each scheduled request, whose length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element)
- system level:
@@ -55,10 +64,11 @@ There are mainly three types of variables.
**Note**: Both of these two tables are come from the `_update_states` method before **preparing inputs**. You can take a look if you need more inspiration.
### Tips
Simply put, a `token ID` is an **integer** (usually `int32`), which represents a token.
Example of `Token ID`:
```
```shell
| Token ID | Token |
|--------------|---------------|
| 0 | [PAD] |
@@ -76,19 +86,24 @@ Example of `Token ID`:
```
## Go through details
Assumptions:
- maximum number of tokens can be scheduled at once: 10
- `block size`: 2
- Totally schedule 3 requests. Their prompt lengths are 3, 2, and 8 respectively.
- `max model length`: 12 (the maximum token count can be handled at one request sequence in a model).
These assumptions are configured in the beginning when starting vLLM. They are not fixed, so you can manually set them.
### Step 1: All requests in the prefill phase
#### Obtain inputs
As the maximum number of tokens that can be schedules is 10, the scheduled tokens of each request can be represented as `{'0': 3, '1': 2, '2': 5}`. Note that`request_2` uses chunked prefill, leaving 3 prompt tokens unscheduled.
##### 1. Get token positions:
##### 1. Get token positions
First, determine which request each token belongs to: tokens 02 are assigned to **request_0**, tokens 34 to **request_1**, and tokens 59 to **request_2**. To represent this mapping, we use `request indices`, for example, `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`.
For each request, use **the number of computed tokens** + **the relative position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`) and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`).
@@ -97,13 +112,15 @@ Note: there is more efficient way (using `request indices`) to create positions
Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**.
##### 2. Get token indices:
##### 2. Get token indices
The shape of the current **Token IDs table** is `(max num request, max model len)`.
Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table without being scheduled?
- We fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we retrieve the remain Token IDs next time.
```
```shell
| T_0_0 | T_0_1 | T_0_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
| T_1_0 | T_1_1 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
@@ -120,19 +137,22 @@ Let's say `M = max model len`. Then we can use `token positions` together with `
So `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 * M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2, 12, 13, 24, 25, 26, 27, 28]`
##### 3. Retrieve the Token IDs
We use `token indices` to select out the corresponding `Input IDs` from the token table. The pseudocode is as follows:
```
```shell
input_ids = token_table[token_indices]
```
As mentioned before, we refer to these `Token IDs` as `Input IDs`.
- `Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, T_3_3, T_3_4]`
#### Build inputs attention metadata
In the current **Block Table**, we use the first block (i.e. block_0) to mark the unused block. The shape of the block is `(max num request, max model len / block size)`, where `max model len / block size = 12 / 2 = 6`.
```
```shell
| 1 | 2 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 6 | 0 | 0 | 0 |
@@ -144,13 +164,14 @@ In the current **Block Table**, we use the first block (i.e. block_0) to mark th
The KV cache block in the device memory is like:
```
```shell
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
```
Let's say `K = max model len / block size = 6`, and we can get token `device block number`.
The workflow of achieving slot mapping:
1. Get `block table indices` using `K`, `positions` and `request indices`.
Purpose: For each token, it could be used to select `device block number` from `block table`.
@@ -168,6 +189,7 @@ The workflow of achieving slot mapping:
Purpose: we can use `slot mapping` to store Token IDs into token slots.
Details:
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
2. (**Token level**) Use `block table indices` to select out `device block number` for each scheduled token. The Pseudocode is `block_numbers = block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`
3. (**Token level**) `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`.
@@ -185,9 +207,11 @@ Details:
- `attention mask`: For all requests that initiate a prefill process, we simply create only one mask matrix for reuse across different requests. The shape of this mask matrix is `5 * 5`:
### Step 2: Chunked prefill
In Step 2, we no longer provide explanations or perform calculations; instead, we directly present the final result.
#### Obtain inputs
Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
1. `request indices`: `[0, 1, 2, 2, 2]`
@@ -195,7 +219,7 @@ Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
Current **Token IDs table**:
```
```shell
| T_0_0 | T_0_1 | T_0_2 | T_0_3 | ? | ? | ? | ? | ? | ? | ? | ? |
| T_1_0 | T_1_1 | T_1_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
@@ -211,11 +235,12 @@ Current **Token IDs table**:
4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`
#### Build inputs attention metadata
We allocate the blocks `7` and `8` to `request_1` and `request_2` respectively, as they need more space in device to store KV cache following token generation or chunked prefill.
Current **Block Table**:
```
```shell
| 1 | 2 | 0 | 0 | 0 | 0 |
| 3 | 7 | 0 | 0 | 0 | 0 |
| 4 | 5 | 6 | 8 | 0 | 0 |
@@ -227,7 +252,7 @@ Current **Block Table**:
KV cache block in the device memory:
```
```shell
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
```
@@ -237,6 +262,7 @@ KV cache block in the device memory:
4. (**Token level**) `slot mapping`: `[5, 14, 13, 16, 17]`
Scheduled token count:`[1, 1, 3]`
- `query start location`: `[0, 1, 2, 5]`
- `sequence length`: `[4, 3, 8]`
@@ -254,6 +280,7 @@ Scheduled token count:`[1, 1, 3]`
Each token has a `1 * 8` vector, and there are 5 scheduled tokens.
## At last
If you understand the step_1 and step_2, you will know the all following steps.
Hope this document can help you better understand how vLLM prepares inputs for model forwarding. If you have any good idea, welcome to contribute to us.

View File

@@ -20,6 +20,7 @@ Its main objective is to eliminate duplicated storage of the KV cache by shardin
DCP primarily influences the Decode logic, as well as the logic for chunked prefill and cached prefill.
## How to Use CP?
Please refer to the [context parallel user guide](../../user_guide/feature_guide/context_parallel.md) for detailed information.
## How It Works?

View File

@@ -15,6 +15,7 @@ This feature addresses the need to optimize the **Time Per Output Token (TPOT)**
## Usage
vLLM Ascend currently supports two types of connectors for handling KV cache management:
- **MooncakeConnector**: D nodes pull KV cache from P nodes.
- **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner.
@@ -35,7 +36,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
![alt text](../../assets/disaggregated_prefill_pull.png)
![alt text](../../assets/disaggregated_prefill_push.png)
#### Mooncake Connector:
#### Mooncake Connector
1. The request is sent to the Proxys `_handle_completions` endpoint.
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
@@ -43,7 +44,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
#### Mooncake Layerwise Connector:
#### Mooncake Layerwise Connector
1. The request is sent to the Proxys `_handle_completions` endpoint.
2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
@@ -55,6 +56,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
### 3. Interface Design
Taking MooncakeConnector as an example, the system is organized into three primary classes:
- **MooncakeConnector**: Base class that provides core interfaces.
- **MooncakeConnectorScheduler**: Interface for scheduling the connectors within the engine core, responsible for managing KV cache transfer requirements and completion.
- **MooncakeConnectorWorker**: Interface for managing KV cache registration and transfer in worker processes.

View File

@@ -1,18 +1,22 @@
# Expert Parallelism Load Balancer (EPLB)
## Why We Need EPLB?
When using Expert Parallelism (EP), different experts are assigned to different NPUs. Given that the load of various experts may vary depending on the current workload, it is crucial to maintain balanced loads across different NPUs. We adopt a redundant experts strategy by duplicating heavily-loaded experts. Then, we heuristically pack these duplicated experts onto NPUs to ensure load balancing across them. Moreover, thanks to the group-limited expert routing used in MoE models, we also attempt to place experts of the same group on the same node to reduce inter-node data traffic, whenever possible.
To facilitate reproduction and deployment, Vllm Ascend supported deployed EP load balancing algorithm in `vllm_ascend/eplb/core/policy`. The algorithm computes a balanced expert replication and placement plan based on the estimated expert loads. Note that the exact method for predicting expert loads is outside the scope of this repository. A common method is to use a moving average of historical statistics.
![eplb](../../assets/eplb.png)
## How to Use EPLB?
Please refer to the EPLB section of the user guide for detailed information: [How to Use EPLB](../../user_guide/feature_guide/eplb_swift_balancer.md)
## How It Works?
**EPLB Module Architecture**
```
```shell
vllm_ascend
├── eplb
│ ├── adaptor
@@ -35,6 +39,7 @@ vllm_ascend
**1. Adaptor Module**
*Handles registration and adaptation for different MoE model types*
- `abstract_adaptor.py`
Abstract base class defining unified registration interfaces for EPLB adapters
- `vllm_adaptor.py`
@@ -42,17 +47,18 @@ vllm_ascend
**2. Core Module**
*Implements core algorithms, updates, and asynchronous processing*
- **Policy Submodule**
*Load balancing algorithms with factory pattern instantiation*
- `policy_abstract.py`
- `policy_abstract.py`
Abstract class for load balancing strategy interfaces
- `policy_dynamic_ep.py`
- `policy_dynamic_ep.py`
Default implementation of open-source EPLB paper algorithm
- `policy_dynamic_ep_v2.py`
- `policy_dynamic_ep_v2.py`
Enhanced version optimizing expert swaps for low-bandwidth devices (e.g., A2)
- `policy_flashlb.py`
- `policy_flashlb.py`
Threshold-based adjustment reducing operational costs through layer-wise fluctuation detection
- `policy_factory.py`
- `policy_factory.py`
Strategy factory for automatic algorithm instantiation
- `eplb_device_transfer_loader.py`
@@ -63,12 +69,14 @@ vllm_ascend
Asynchronous algorithm orchestration and result processing
**3. System Components**
- `eplb_updator.py`
Central coordinator for load balancing during inference workflows
- `utils.py`
General utilities for EPLB interface registration
*Key Optimizations:*
1. Maintained original structure while improving technical clarity
2. Standardized terminology
3. Enhanced algorithm differentiation through concise descriptors
@@ -76,14 +84,19 @@ vllm_ascend
5. Preserved file/class relationships while optimizing readability
### Default Algorithm
#### Hierarchical Load Balancing
When the number of server nodes evenly divides the number of expert groups, we use the hierarchical load balancing policy to leverage group-limited expert routing. We first pack the expert groups onto nodes evenly, ensuring balanced loads across different nodes. Then, we replicate the experts within each node. Finally, we pack the replicated experts onto individual NPUs to ensure load balancing across them. The hierarchical load balancing policy can be used in the prefilling stage with a smaller expert-parallel size.
#### Global Load Balancing
In other cases, we use the global load balancing policy, which replicates experts globally regardless of expert groups, and packs the replicated experts onto individual NPUs. This policy can be adopted in the decoding stage with a larger expert-parallel size.
### Add a New EPLB Policy
If you want to add a new eplb policy to vllm_ascend, you must follow these steps:
1. Inherit the `EplbPolicy` abstract class of `policy_abstract.py` and override the `rebalance_experts` interface, ensuring consistent input parameters `current_expert_table`, `expert_workload` and return types `newplacement`.
For example:
@@ -113,6 +126,7 @@ class RandomLoadBalance(EplbPolicy):
2. To add a new EPLB algorithm, include the policy type and its corresponding implementation class in the `PolicyFactory` of `policy_factory.py`.
### Add a New MoE Model
**Implementation Guide for Model Integration**
1. **Adapter File Modification**
@@ -154,12 +168,17 @@ class RandomLoadBalance(EplbPolicy):
- Benchmark against baseline implementations (e.g., Qwen3-MoE)
*Key Implementation Notes:*
- Preserve existing interface contracts in abstract classes
- Use decorators for non-intrusive patch integration
- Leverage `eplb_utils.py` for shared expert mapping operations
## DFX
### Parameter Validation
#### Integer Parameters
All integer input parameters must explicitly specify their maximum and minimum values and be subject to valid value validation. For example, `num_iterations_eplb_update` must be greater than 0:
```python
@@ -176,6 +195,7 @@ All integer input parameters must explicitly specify their maximum and minimum v
```
#### File Path
The file path for EPLB must be checked for legality, such as whether the file path is valid and whether it has appropriate read and write permissions. For example:
```python
@@ -203,20 +223,27 @@ The file path for EPLB must be checked for legality, such as whether the file pa
```
### Function Specifications
#### Initialization Function
All EPLB parameters must be initialized by default during initialization, with specified parameter types and default values for proper handling.
#### General Functions
All method arguments must specify parameter types and default values, and functions must include default return value handling for default arguments. It is recommended to use `try-except` blocks to handle the function body, specifying the type of exception captured and the failure handling (e.g., logging exceptions or returning a failure status).
### Consistency
#### Expert Map
The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified which ranks have inconsistent maps.
During the update process, if only a few layers or the expert table of a certain rank has been changed, the updated expert table must be synchronized with the EPLB's context to ensure global consistency.
#### Expert Weight
When updating expert weights, ensure that the memory allocated for the expert weights has been released, or that the expert (referring to the old version) is no longer in use.
## Limitation
Before using EPLB, start the script and add `export DYNAMIC_EPLB="true"`.
Before performing load data collection (or performance data collection), start the script and add `export EXPERT_MAP_RECORD="true"`.

View File

@@ -16,7 +16,7 @@ We should keep in mind that Patch is not the best way to make vLLM Ascend compat
In `vllm_ascend/patch`, you can see the code structure as follows:
```
```shell
vllm_ascend
├── patch
│ ├── platform
@@ -27,10 +27,10 @@ vllm_ascend
```
- **platform**: The patch code in this directory is for patching the code in vLLM main process. It's called by `vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when vLLM is initialized.
- For online mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
- For offline mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
- For online mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
- For offline mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
- **worker**: The patch code in this directory is for patching the code in vLLM worker process. It's called by `vllm_ascend/worker/worker::NPUWorker::__init__` when the vLLM worker process is initialized.
- For both online and offline mode, vLLM engine core process calls the worker patch in `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
- For both online and offline mode, vLLM engine core process calls the worker patch in `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
## How to write a patch
@@ -54,7 +54,7 @@ Before writing a patch, following the principle above, we should patch the least
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_distributed` into `vllm_ascend/patch/platform/__init__.py`.
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
```
```python
# ** File: <The patch file name> **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `<The target patch module in vLLM>`
@@ -71,5 +71,6 @@ Before writing a patch, following the principle above, we should patch the least
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
## Limitation
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.

View File

@@ -53,7 +53,7 @@ To restrict the operators that are captured, configure the `list` block:
- `scope` (list[str]): In PyTorch pynative scenarios this field restricts the dump range. Provide two module or API names that follow the tool's naming convention to lock a range; only data between the two names will be dumped. Examples:
```
```json
"scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]
"scope": ["Cell.conv1.Conv2d.forward.0", "Cell.fc2.Dense.backward.0"]
"scope": ["Tensor.add.0.forward", "Functional.square.2.forward"]
@@ -62,9 +62,9 @@ To restrict the operators that are captured, configure the `list` block:
The `level` setting determines what can be provided—modules when `level=L0`, APIs when `level=L1`, and either modules or APIs when `level=mix`.
- `list` (list[str]): Custom operator list. Options include:
- Supply the full names of specific APIs in PyTorch pynative scenarios to only dump those APIs. Example: `"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]`.
- When `level=mix`, you can provide module names so that the dump expands to everything produced while the module is running. Example: `"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]`.
- Provide a substring such as `"list": ["relu"]` to dump every API whose name contains the substring. When `level=mix`, modules whose names contain the substring are also expanded.
- Supply the full names of specific APIs in PyTorch pynative scenarios to only dump those APIs. Example: `"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]`.
- When `level=mix`, you can provide module names so that the dump expands to everything produced while the module is running. Example: `"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]`.
- Provide a substring such as `"list": ["relu"]` to dump every API whose name contains the substring. When `level=mix`, modules whose names contain the substring are also expanded.
Example configuration:
@@ -188,7 +188,7 @@ Use `msprobe graph_visualize` to generate results that can be opened inside `tb_
Replace the paths with your dump directories before invoking `msprobe graph_visualize`. **If you only need to build a single graph**, omit `bench_path` to visualize one dump.
Multi-rank scenarios (single rank, multi-rank, or multi-step multi-rank) are also supported. `npu_path` or `bench_path` must contain folders named `rank+number`, and every rank folder must contain a non-empty `construct.json` together with `dump.json` and `stack.json`. If any `construct.json` is empty, verify that the dump level includes `L0` or `mix`. When comparing graphs, both `npu_path` and `bench_path` must contain the same set of rank folders so they can be paired one-to-one.
```
```shell
├── npu_path or bench_path
| ├── rank0
| | ├── dump_tensor_data (only when the `tensor` option is enabled)

View File

@@ -200,10 +200,12 @@ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
```
Purpose
- Forces all CPU cores to run under the `performance` governor
- Disables dynamic frequency scaling (e.g., `ondemand`, `powersave`)
Benefits
- Keeps CPU cores at maximum frequency
- Reduces latency jitter
- Improves predictability for inference workloads
@@ -224,6 +226,7 @@ Benefits
- Improves stability for large in-memory models
Notes
- For inference workloads, swap can introduce second-level latency
- Recommended values are `0` or `1`
@@ -244,6 +247,7 @@ Benefits
- Improves performance stability on NUMA systems
Recommended For
- Multi-socket servers
- Ascend / NPU deployments with explicit NUMA binding
- Systems with manually managed CPU and memory affinity
@@ -255,14 +259,17 @@ sysctl -w kernel.sched_migration_cost_ns=50000
```
Purpose
- Increases the cost for the scheduler to migrate tasks between CPU cores
Benefits
- Reduces frequent thread migration
- Improves CPU cache locality
- Lowers latency jitter for inference workloads
Parameter Details
- Unit: nanoseconds (ns)
- Typical recommended range: 50000100000
- Higher values encourage threads to stay on the same CPU core

View File

@@ -1,4 +1,5 @@
# Performance Benchmark
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
**Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
@@ -38,10 +39,12 @@ pip install -r benchmarks/requirements-bench.txt
```
## 3. Run basic benchmarks
This section introduces how to perform performance testing using the benchmark suite built into VLLM.
### 3.1 Dataset
VLLM supports a variety of (datasets)[https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py].
VLLM supports a variety of [datasets](https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py).
<style>
th {

View File

@@ -5,19 +5,20 @@ The execution duration of each stage (including pre/post-processing, model forwa
**To reduce the performance overhead, we add this feature, using the NPU event timestamp mechanism to observe the device execution time asynchronously.**
## Usage
* Use the environment variable `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` to enable this feature.
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execution duration profiling. Execute the script as follows:**
```
```shell
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
```
## Example Output
```
```shell
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms

View File

@@ -15,6 +15,7 @@ pip install msserviceprofiler==1.2.2
```
### 1 Preparation
Before starting the service, set the environment variable `SERVICE_PROF_CONFIG_PATH` to point to the profiling configuration file, and set the environment variable `PROFILING_SYMBOLS_PATH` to specify the YAML configuration file for the symbols that need to be imported. After that, start the vLLM service according to your deployment method.
```bash
@@ -32,6 +33,7 @@ The file `ms_service_profiler_config.json` is the profiling configuration. If it
`service_profiling_symbols.yaml` is the configuration file containing the profiling points to be imported. You can choose **not** to set the `PROFILING_SYMBOLS_PATH` environment variable, in which case the default configuration file will be used. If the file does not exist at the path you specified, likewise, the system will generate a configuration file at your specified path for future configuration. You can customize it according to the instructions in the `Symbols Configuration File` section below.
### 2 Enable Profiling
To enable the performance data collection switch, change the `enable` field from `0` to `1` in the configuration file `ms_service_profiler_config.json`. This can be accomplished by executing the following sed command:
```bash
@@ -39,6 +41,7 @@ sed -i 's/"enable":\s*0/"enable": 1/' ./ms_service_profiler_config.json
```
### 3 Send Requests
Choose a request-sending method that suits your actual profiling needs:
```bash
@@ -65,6 +68,7 @@ msserviceprofiler analyze --input-path=./ --output-path output
### 5 View Results
After analysis, the `output` directory will contain:
- `chrome_tracing.json`: Chrome tracing format data, which can be opened in [MindStudio Insight](https://www.hiascend.com/document/detail/zh/mindstudio/81RC1/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html).
- `profiler.db`: Performance data in database format.
- `request.csv`: Request-related data.
@@ -77,7 +81,9 @@ After analysis, the `output` directory will contain:
---
## Appendix
(profiling-configuration-file)=
### 1 Profiling Configuration File
The profiling configuration file controls profiling parameters and behavior.
@@ -116,6 +122,7 @@ The configuration is in JSON format. Main parameters:
---
(symbols-configuration-file)=
### 2 Symbols Configuration File
The symbols configuration file defines which functions/methods to profile and supports flexible configuration with custom attribute collection.

View File

@@ -19,6 +19,7 @@ Currently, **ONLY** Atlas A2 series(Ascend-cann-kernels-910b)Atlas A3 series(
- [Experimental] Currently for 310I Duo the stable version is vllm-ascend v0.10.0rc1.
Below series are NOT supported yet:
- Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
- Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet
@@ -39,6 +40,7 @@ docker pull quay.nju.edu.cn/ascend/vllm-ascend:$TAG
```
#### Load Docker Images for offline environment
If you want to use container image for offline environments (no internet connection), you need to download container image in an environment with internet access:
**Exporting Docker images:**
@@ -85,13 +87,14 @@ Find more details [<u>here</u>](https://docs.vllm.ai/projects/ascend/en/latest/u
### 6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?
Basically, the reason is that the NPU environment is not configured correctly. You can:
1. try `source /usr/local/Ascend/nnal/atb/set_env.sh` to enable NNAL package.
2. try `source /usr/local/Ascend/ascend-toolkit/set_env.sh` to enable CANN package.
3. try `npu-smi info` to check whether the NPU is working.
If all above steps are not working, you can try the following code with python to check whether there is any error:
```
```python
import torch
import torch_npu
import vllm
@@ -100,6 +103,7 @@ import vllm
If all above steps are not working, feel free to submit a GitHub issue.
### 7. How vllm-ascend work with vLLM?
vllm-ascend is a hardware plugin for vLLM. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.9.1, you should use vllm-ascend 0.9.1 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
### 8. Does vllm-ascend support Prefill Disaggregation feature?
@@ -129,9 +133,11 @@ vllm-ascend is tested in three aspects, functions, performance, and accuracy.
Finnall, for each release, we'll publish the performance test and accuracy test report in the future.
### 12. How to fix the error "InvalidVersion" when using vllm-ascend?
The problem is usually caused by the installation of a dev or editable version of the vLLM package. In this case, we provide the environment variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the environment variable `VLLM_VERSION` to the version of the vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
### 13. How to handle the out-of-memory issue?
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
@@ -142,7 +148,8 @@ In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynam
- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can use `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime. See details in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
### 14. Failed to enable NPU graph mode when running DeepSeek.
### 14. Failed to enable NPU graph mode when running DeepSeek
Enabling NPU graph mode for DeepSeek may trigger an error. This is because when both MLA and NPU graph mode are active, the number of queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has only 16 attention heads, which results in 16 queries per KV—a value outside the supported range. Support for NPU graph mode on DeepSeek-V2-Lite will be added in a future update.
And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads/num_kv_heads is {32, 64, 128}.
@@ -152,10 +159,12 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
```
### 15. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend.
### 15. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend
You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, use `python setup.py install` (recommended) to install, or use `python setup.py clean` to clear the cache.
### 16. How to generate deterministic results when using vllm-ascend?
There are several factors that affect output certainty:
1. Sampler method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
@@ -193,18 +202,20 @@ export ATB_LLM_LCOC_ENABLE=0
```
### 17. How to fix the error "ImportError: Please install vllm[audio] for audio support" for the Qwen2.5-Omni model
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`.
This package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensure that the audio processing functionality works correctly.
### 18. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
```
```shell
error example in detail:
ERROR 09-26 10:48:07 [model_runner_v1.py:3029] ACLgraph sizes capture fail: RuntimeError:
ERROR 09-26 10:48:07 [model_runner_v1.py:3029] ACLgraph has insufficient available streams to capture the configured number of sizes.Please verify both the availability of adequate streams and the appropriateness of the configured size count.
```
Recommended mitigation strategies:
1. Manually configure the compilation_config parameter with a reduced size set: '{"cudagraph_capture_sizes":[size1, size2, size3, ...]}'.
2. Employ ACLgraph's full graph mode as an alternative to the piece-wise approach.
@@ -212,6 +223,7 @@ Root cause analysis:
The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements, such as operator characteristics and specific hardware features, consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.
### 19. How to install custom version of torch_npu?
torch-npu will be overridden when installing vllm-ascend. If you need to install a specific version of torch-npu, you can manually install the specified version of torch-npu after vllm-ascend is installed.
### 20. On certain systems (e.g., Kylin OS), `docker pull` may fail with an `invalid tar header` error
@@ -241,10 +253,12 @@ This is often due to system compatibility issues. You can resolve this by using
Copy the `vllm_ascend_<tag>.tar` file (where `<tag>` is the image tag you used) to your target machine
### 21. Why am I getting an error when executing the script to start a Docker container? The error message is: "operation not permitted".
### 21. Why am I getting an error when executing the script to start a Docker container? The error message is: "operation not permitted"
When using `--shm-size`, you may need to add the `--privileged=true` flag to your `docker run` command to grant the container necessary permissions. Please be aware that using `--privileged=true` grants the container extensive privileges on the host system, which can be a security risk. Only use this option if you understand the implications and trust the container's source.
### 22. How to achieve low latency in a small batch scenario?
The performance of `torch_npu.npu_fused_infer_attention_score` in small batch scenario is not satisfactory, mainly due to the lack of flash decoding function. We offer an alternative operator in `tools/install_flash_infer_attention_score_ops_a2.sh` and `tools/install_flash_infer_attention_score_ops_a3.sh`, you can install it by the following instruction:
```bash

View File

@@ -18,6 +18,7 @@ This document describes how to install vllm-ascend manually.
| NNAL | == 8.3.RC2 | Required for libatb.so, enables advanced tensor operations |
There are two installation methods:
- **Using pip**: first prepare env manually or via CANN image, then install `vllm-ascend` using pip.
- **Using docker**: use the `vllm-ascend` pre-built docker image directly.
@@ -184,6 +185,7 @@ If you encounter other problems during compiling, it is probably because unexpec
`vllm-ascend` offers Docker images for deployment. You can just pull the **prebuilt image** from the image repository [ascend/vllm-ascend](https://quay.io/repository/ascend/vllm-ascend?tab=tags) and run it with bash.
Supported images as following.
| image name | Hardware | OS |
|-|-|-|
| image-tag | Atlas A2 | Ubuntu |
@@ -306,16 +308,17 @@ Prompt: 'The future of AI is', Generated text: ' not bright\n\nThere is no doubt
```
## Multi-node Deployment
### Verify Multi-Node Communication
First, check physical layer connectivity, then verify each node, and finally verify the inter-node connectivity.
#### Physical Layer Requirements:
#### Physical Layer Requirements
- The physical machines must be located on the same WLAN, with network connectivity.
- All NPUs are connected with optical modules, and the connection status must be normal.
#### Each Node Verification:
#### Each Node Verification
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
@@ -362,8 +365,10 @@ Execute the following commands on each node in sequence. The results must all be
::::
:::::
#### Interconnect Verification:
#### Interconnect Verification
##### 1. Get NPU IP Addresses
:::::{tab-set}
:sync-group: multi-node

View File

@@ -3,6 +3,7 @@
## Prerequisites
### Supported Devices
- Atlas A2 training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
- Atlas 800I A2 inference series (Atlas 800I A2)
- Atlas A3 training series (Atlas 800T A3, Atlas 900 A3 SuperPoD, Atlas 9000 A3 SuperPoD)
@@ -140,7 +141,7 @@ vllm serve Qwen/Qwen2.5-0.5B-Instruct &
If you see a log as below:
```
```shell
INFO: Started server process [3594]
INFO: Waiting for application startup.
INFO: Application startup complete.
@@ -183,7 +184,7 @@ vLLM is serving as a background process, you can use `kill -2 $VLLM_PID` to stop
The output is as below:
```
```shell
INFO: Shutting down FastAPI HTTP server.
INFO: Shutting down
INFO: Waiting for application shutdown.

View File

@@ -69,6 +69,7 @@ docker run --rm \
If you want to deploy multi-node environment, you need to set up environment on each node.
## Deployment
### Service-oriented Deployment
- `DeepSeek-R1-W8A8`: require 1 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8).
@@ -119,6 +120,7 @@ vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
**Notice:**
The parameters are explained as follows:
- Setting the environment variable `VLLM_ASCEND_ENABLE_MLAPO=1` enables a fusion operator that can significantly improve performance, though it requires more NPU memory. It is therefore recommended to enable this option when sufficient NPU memory is available.
- Setting the environment variable `VLLM_ASCEND_BALANCE_SCHEDULING=1` enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated.
- For single-node deployment, we recommend using `dp4tp4` instead of `dp2tp8`.
@@ -284,6 +286,7 @@ lm_eval \
3. After execution, you can get the result.
## Performance
### Using AISBench
Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
@@ -295,6 +298,7 @@ Run performance evaluation of `DeepSeek-R1-W8A8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -23,6 +23,7 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur
## Environment Preparation
### Model Weight
- `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1).
- `DeepSeek-V3.1-w8a8`(Quantized version without mtp): [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-w8a8).
- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
@@ -131,6 +132,7 @@ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
**Notice:**
The parameters are explained as follows:
- Setting the environment variable `VLLM_ASCEND_ENABLE_MLAPO=1` enables a fusion operator that can significantly improve performance, though it requires more NPU memory. It is therefore recommended to enable this option when sufficient NPU memory is available.
- Setting the environment variable `VLLM_ASCEND_BALANCE_SCHEDULING=1` enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated.
- For single-node deployment, we recommend using `dp4tp4` instead of `dp2tp8`.
@@ -257,7 +259,8 @@ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
We recommend using Mooncake for deployment: [Mooncake](./pd_disaggregation_mooncake_multi_node.md).
Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory to serve high concurrency in 1P1D case.
- `DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16).
- `DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16).
To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests.
@@ -659,6 +662,7 @@ curl http://<node0_ip>:<port>/v1/completions \
Here are two accuracy evaluation methods.
### Using AISBench
1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details.
2. After execution, you can get the result, here is the result of `DeepSeek-V3.1-w8a8-mtp-QuaRot` in `vllm-ascend:0.11.0rc1` for reference only.
@@ -669,6 +673,7 @@ Here are two accuracy evaluation methods.
| gsm8k | - | accuracy | gen | 96.28 | 1 Atlas 800 A3 (64G × 16) |
### Using Language Model Evaluation Harness
Not test yet.
## Performance
@@ -684,6 +689,7 @@ Run performance evaluation of `DeepSeek-V3.1-w8a8-mtp-QuaRot` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -153,9 +153,10 @@ In this tutorial, we suppose you downloaded the model weight to `/root/.cache/`.
We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environment with 1P1D for better performance.
Before you start, please
1. prepare the script `launch_online_dp.py` on each node.
```
```python
import argparse
import multiprocessing
import os
@@ -260,7 +261,7 @@ Before you start, please
1. Prefill node 0
```
```shell
nic_name="enp48s3u1u1" # change to your own nic name
local_ip=141.61.39.105 # change to your own ip
@@ -333,7 +334,7 @@ Before you start, please
2. Prefill node 1
```
```shell
nic_name="enp48s3u1u1" # change to your own nic name
local_ip=141.61.39.113 # change to your own ip
@@ -406,7 +407,7 @@ Before you start, please
3. Decode node 0
```
```shell
nic_name="enp48s3u1u1" # change to your own nic name
local_ip=141.61.39.117 # change to your own ip
@@ -484,7 +485,7 @@ Before you start, please
4. Decode node 1
```
```shell
nic_name="enp48s3u1u1" # change to your own nic name
local_ip=141.61.39.181 # change to your own ip
@@ -564,28 +565,28 @@ Once the preparation is done, you can start the server with the following comman
1. Prefill node 0
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
```
2. Prefill node 1
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
```
3. Decode node 0
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
```
4. Decode node 1
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
```
@@ -656,6 +657,7 @@ Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -17,6 +17,7 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur
## Environment Preparation
### Model Weight
- `GLM-4.5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.5).
- `GLM-4.6`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6).
- `GLM-4.7`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7).
@@ -102,6 +103,7 @@ vllm serve /weight/glm4.5_w8a8_with_float_mtp \
**Notice:**
The parameters are explained as follows:
- For single-node deployment, we recommend using `dp1tp16` and turn off expert parallel in low-latency scenarios.
- `--async-scheduling` Asynchronous scheduling is a technique used to optimize inference efficiency. It allows non-blocking task scheduling to improve concurrency and throughput, especially when processing large-scale models.
@@ -118,6 +120,7 @@ Not test yet.
Here are two accuracy evaluation methods.
### Using AISBench
1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details.
2. After execution, you can get the result, here is the result of `GLM4.6` in `vllm-ascend:main` (after `vllm-ascend:0.13.0rc1`) for reference only.
@@ -144,6 +147,7 @@ Run performance evaluation of `GLM-4.x` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -44,6 +44,7 @@ docker run --rm \
```
## Verify the Quantized Model
Please be advised to edit the value of `"quantization_config.config_groups.group_0.targets"` from `["Linear"]` into `["MoE"]` in `config.json` of original model downloaded from [Hugging Face](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
```json

View File

@@ -19,6 +19,7 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur
### Model Weight
require 1 Atlas 800I A2 (64G × 8) node or 1 Atlas 800 A3 (64G × 16) node:
- `Qwen2.5-VL-3B-Instruct`: [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)
- `Qwen2.5-VL-7B-Instruct`: [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)
- `Qwen2.5-VL-32B-Instruct`:[Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)
@@ -188,7 +189,7 @@ print(generated_text)
If you run this script successfully, you can see the info shown below:
```
```shell
**Visual Components:**
1. **Abstract Geometric Icon (Left Side):**
@@ -284,7 +285,7 @@ print(generated_text)
If you run this script successfully, you can see the info shown below:
```
```shell
The image displays a logo and text related to the Qwen model, which is an artificial intelligence (AI) language model developed by Alibaba Cloud. Here is a detailed description of the elements in the image:
### **1. Logo:**
@@ -469,6 +470,7 @@ INFO 12-05 08:51:20 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 toke
### Using Language Model Evaluation Harness
The accuracy of some models is already within our CI monitoring scope, including:
- `Qwen2.5-VL-7B-Instruct`
- `Qwen3-VL-8B-Instruct`
@@ -547,6 +549,7 @@ lm_eval \
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -153,11 +153,13 @@ Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy re
Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
### Using vLLM Benchmark
Run performance evaluation of `Qwen2.5-7B-Instruct` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -196,6 +196,7 @@ Run performance evaluation of `Qwen2.5-Omni-7B` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -124,19 +124,21 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
```
**Notice:**
- [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) originally only supports 40960 context(max_position_embeddings). If you want to use it and its related quantization weights to run long seqs (such as 128k context), it is required to use yarn rope-scaling technique.
- For vLLM version same as or new than `v0.12.0`, use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`.
- For vllm version below `v0.12.0`, use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`.
- For vLLM version same as or new than `v0.12.0`, use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`.
- For vllm version below `v0.12.0`, use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`.
If you are using weights like [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) which originally supports long contexts, there is no need to add this parameter.
The parameters are explained as follows:
- `--data-parallel-size` 1 and `--tensor-parallel-size` 8 are common settings for data parallelism (DP) and tensor parallelism (TP) sizes.
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
@@ -147,7 +149,8 @@ The parameters are explained as follows:
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tp_size > 1.
### Multi-node Deployment with MP (Recommended)
Assume you have Atlas 800 A3 (64G*16) nodes (or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
Node 0
@@ -239,7 +242,7 @@ vllm serve vllm-ascend/Qwen3-235B-A22B \
If the service starts successfully, the following information will be displayed on node 0:
```
```shell
INFO: Started server process [44610]
INFO: Waiting for application startup.
INFO: Application startup complete.
@@ -298,6 +301,7 @@ Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.
@@ -392,6 +396,7 @@ Reference test results:
| 720 | 144 | 4717.45 | 48.69 | 2761.72 |
Note:
1. Setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` enables MoE fused operators that reduce time consumption of MoE in both prefill and decode. This is an experimental feature which only supports W8A8 quantization on Atlas A3 servers now. If you encounter any problems when using this feature, you can disable it by setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=0` and update issues in vLLM-Ascend community.
2. Here we disable prefix cache because of random datasets. You can enable prefix cache if requests have long common prefix.
@@ -597,7 +602,7 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
PD proxy:
```
```shell
python load_balance_proxy_server_example.py --port 12347 --prefiller-hosts prefill_node_1_ip --prefiller-port 8000 --decoder-hosts decode_node_1_ip --decoder-ports 8000
```
@@ -624,4 +629,5 @@ Reference test results:
| 2880 | 576 | 3735.98 | 52.07 | 8593.44 |
Note:
1. We recommend to set `export VLLM_ASCEND_ENABLE_FUSED_MC2=2` on this scenario (typically EP32 for Qwen3-235B). This enables a different MoE fusion operator.

View File

@@ -36,7 +36,7 @@ docker run --rm \
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/Qwen3-32B-W4A4
see <https://www.modelscope.cn/models/vllm-ascend/Qwen3-32B-W4A4>
:::
```bash

View File

@@ -1,6 +1,7 @@
# Qwen3-8B-W4A8
## Run Docker Container
:::{note}
w4a8 quantization feature is supported by v0.9.1rc2 and later.
:::
@@ -27,9 +28,10 @@ docker run --rm \
```
## Install modelslim and Convert Model
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/Qwen3-8B-W4A8
see <https://www.modelscope.cn/models/vllm-ascend/Qwen3-8B-W4A8>
:::
```bash
@@ -69,6 +71,7 @@ python quant_qwen.py \
```
## Verify the Quantized Model
The converted model files look like:
```bash

View File

@@ -11,6 +11,7 @@ This document will show the main verification steps of the model, including supp
The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429)
## **Node**
This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.
## Supported Features
@@ -105,15 +106,17 @@ If you want to deploy multi-node environment, you need to set up environment on
In this section, we will demonstrate best practices for adjusting hyperparameters in vLLM-Ascend to maximize inference throughput performance. By tailoring service-level configurations to fit different use cases, you can ensure that your system performs optimally across various scenarios. We will guide you through how to fine-tune hyperparameters based on observed phenomena, such as max_model_len, max_num_batched_tokens, and cudagraph_capture_sizes, to achieve the best performance.
The specific example scenario is as follows:
- The machine environment is an Atlas 800 A3 (64G*16)
- The LLM is Qwen3-32B-W8A8
- The data scenario is a fixed-length input of 3.5K and an output of 1.5K.
- The parallel configuration requirement is DP=1&TP=4
- If the machine environment is an **Atlas 800I A2(64G*8)**, the deployment approach stays identical.
### Run docker container:
### Run docker container
#### **Node**
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
- v0.11.0rc2-a3 is image tag, replace this with your actual tag.
- replace this with your actual port: '-p 8113:8113'.
@@ -191,6 +194,7 @@ vllm serve /model/Qwen3-32B-W8A8 \
```
#### **Node**
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
- If the model is not a quantized model, remove the `--quantization ascend` parameter.
@@ -219,6 +223,7 @@ curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/jso
Run the following script to execute offline inference on multi-NPU.
#### **Node**
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
- If the model is not a quantized model,remove the `quantization="ascend"` parameter.
@@ -290,6 +295,7 @@ Run performance evaluation of `Qwen3-32B-W8A8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.
@@ -297,6 +303,7 @@ There are three `vllm bench` subcommand:
Take the `serve` as an example. Run the code as follows.
#### **Node**
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
```shell
@@ -306,19 +313,23 @@ vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port
After about several minutes, you can get the performance evaluation result.
## Key Optimization Points
In this section, we will cover the key optimization points that can significantly improve the performance of Qwen Dense models. These techniques are designed to enhance throughput and efficiency across various scenarios.
### 1. Rope Optimization
Rope optimization enhances the model's efficiency by modifying the position encoding process. Specifically, it ensures that the cos_sin_cache and the associated index selection operation are only performed during the first layer of the forward pass. For subsequent layers, the position encoding is directly reused, eliminating redundant calculations and significantly speeding up inference in decode phase.
This optimization is enabled by default and does not require any additional environment variables to be set.
### 2. AddRMSNormQuant Fusion
AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.
This optimization is enabled by default and does not require any additional environment variables to be set.
### 3. FlashComm_v1
FlashComm_v1 significantly improves performance in large-batch scenarios by decomposing the traditional allreduce collective communication into reduce-scatter and all-gather. This breakdown helps reduce the computation of the RMSNorm token dimensions, leading to more efficient processing. In quantization scenarios, FlashComm_v1 also reduces the communication overhead by decreasing the bit-level data transfer, which further minimizes the end-to-end latency during the prefill phase.
It is important to note that the decomposition of the allreduce communication into reduce-scatter and all-gather operations only provides benefits in high-concurrency scenarios, where there is no significant communication degradation. In other cases, this decomposition may result in noticeable performance degradation. To mitigate this, the current implementation uses a threshold-based approach, where FlashComm_v1 is only enabled if the actual token count for each inference schedule exceeds the threshold. This ensures that the feature is only activated in scenarios where it improves performance, avoiding potential degradation in lower-concurrency situations.
@@ -326,11 +337,13 @@ It is important to note that the decomposition of the allreduce communication in
This optimization requires setting the environment variable `VLLM_ASCEND_ENABLE_FLASHCOMM1 = 1` to be enabled.
### 4. Matmul and ReduceScatter Fusion
Once FlashComm_v1 is enabled, an additional optimization can be applied. This optimization fuses matrix multiplication and ReduceScatter operations, along with tiling optimization. The Matmul computation is treated as one pipeline, while the ReduceScatter and dequant operations are handled in a separate pipeline. This approach significantly reduces communication steps, improves computational efficiency, and allows for better resource utilization, resulting in enhanced throughput, especially in large-scale distributed environments.
This optimization is automatically enabled once FlashComm_v1 is activated. However, due to an issue with performance degradation in small-concurrency scenarios after this fusion, a threshold-based approach is currently used to mitigate this problem. The optimization is only applied when the token count exceeds the threshold, ensuring that it is not enabled in cases where it could negatively impact performance.
### 5. Weight Prefetching
Weight prefetching optimizes memory usage by preloading weights into the cache before they are needed, minimizing delays caused by memory access during model execution.
In dense model scenarios, the MLP's gate_up_proj and down_proj linear layers often exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as RMSNorm and SiLU, before the MLP. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the MLP computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow.
@@ -340,16 +353,19 @@ It is important to emphasize that, since we use vector computations to hide the
This optimization requires setting the environment variable `VLLM_ASCEND_ENABLE_PREFETCH_MLP = 1` to be enabled.
### 6. Zerolike Elimination
This elimination removes unnecessary operations related to zero-like tensors in Attention forward, improving the efficiency of matrix operations and reducing memory usage.
This optimization is enabled by default and does not require any additional environment variables to be set.
### 7. FullGraph Optimization
ACLGraph offers several key optimizations to improve model execution efficiency. By replaying the entire model execution graph at once, we significantly reduce dispatch latency compared to multiple smaller replays. This approach also stabilizes multi-device performance, as capturing the model as a single static graph mitigates dispatch fluctuations across devices. Additionally, consolidating graph captures frees up streams, allowing for the capture of more graphs and optimizing resource usage, ultimately leading to improved system efficiency and reduced overhead.
The configuration compilation_config = { "cudagraph_mode": "FULL_DECODE_ONLY"} is used when starting the service. This setup is necessary to enable the aclgraph's full decode-only mode.
### 8. Asynchronous Scheduling
Asynchronous scheduling is a technique used to optimize inference efficiency. It allows non-blocking task scheduling to improve concurrency and throughput, especially when processing large-scale models.
This optimization is enabled by setting `--async-scheduling`.
@@ -359,16 +375,19 @@ This optimization is enabled by setting `--async-scheduling`.
Building on the specific example scenarios outlined earlier, this section highlights the key tuning points that played a crucial role in achieving optimal performance. By focusing on the most impactful adjustments to hyperparameters and optimizations, well emphasize the strategies that can be leveraged to maximize throughput, minimize latency, and ensure efficient resource utilization in various environments. These insights will help guide you in fine-tuning your own configurations for the best possible results.
### 1.Prefetch Buffer Size
Setting the right prefetch buffer size is essential for optimizing weight loading and the size of this prefetch buffer is directly related to the time that can be hidden by vector computations. To achieve a near-perfect overlap between the prefetch and computation streams, you can flexibly adjust the buffer size by profiling and observing the degree of overlap at different buffer sizes.
For example, in the real-world scenario mentioned above, I set the prefetch buffer size for the gate_up_proj and down_proj in the MLP to 18MB. The reason for this is that, at this value, the vector computations of RMSNorm and SiLU can effectively hide the prefetch stream, thereby accelerating the Matmul computations of the two linear layers.
### 2.Max-num-batched-tokens
The max-num-batched-tokens parameter determines the maximum number of tokens that can be processed in a single batch. Adjusting this value helps to balance throughput and memory usage. Setting this value too small can negatively impact end-to-end performance, as fewer tokens are processed per batch, potentially leading to inefficiencies. Conversely, setting it too large increases the risk of Out of Memory (OOM) errors due to excessive memory consumption.
In the above real-world scenario, we not only conducted extensive testing to determine the most cost-effective value, but also took into account the accumulation of decode tokens when enabling chunked prefill. If the value is set too small, a single request may be chunked multiple times, and during the early stages of inference, a batch may contain only a small number of decode tokens. This can result in the end-to-end throughput falling short of expectations.
### 3.Cudagraph_capture_sizes
The cudagraph_capture_sizes parameter controls the granularity of graph captures during the inference process. Adjusting this value determines how much of the computation graph is captured at once, which can significantly impact both performance and memory usage.
If this list is not manually specified, it will be filled with a series of evenly distributed values, which typically ensures good performance. However, if you want to fine-tune it further, manually specifying the values will yield better results. This is because if the batch size falls between two sizes, the framework will automatically pad the token count to the larger size. This often leads to actual performance deviating from the expected or even degrading.

View File

@@ -174,6 +174,7 @@ Run performance evaluation of `Qwen3-Next` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -7,11 +7,13 @@ Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models.
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation.
## Supported Features
Refer to [supported features](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/support_matrix/supported_models.html) to get the model's supported feature matrix.
Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/feature_guide/index.html) to get the feature's configuration.
## Environment Preparation
### Model Weight
- `Qwen3-Omni-30B-A3B-Thinking` require 2 NPU Card(64G × 2).[Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking)
@@ -77,7 +79,9 @@ ffmpeg -version
```
## Deployment
### Single-node Deployment
#### Offline Inference on Multi-NPU
Run the following script to execute offline inference on multi-NPU:
@@ -177,6 +181,7 @@ vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 --enable_ex
```
## Functional Verification
Once your server is started, you can query the model with input prompts.
```bash
@@ -225,7 +230,8 @@ Here are accuracy evaluation methods.
### Using EvalScope
As an example, take the `gsm8k` `omnibench` `bbh` dataset as a test dataset, and run accuracy evaluation of `Qwen3-Omni-30B-A3B-Thinking` in online mode.
1. Refer to Using evalscope(https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip) for `evalscope`installation.
1. Refer to Using evalscope(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip>) for `evalscope`installation.
2. Run `evalscope` to execute the accuracy evaluation.
```bash
@@ -258,11 +264,13 @@ evalscope eval \
## Performance
### Using vLLM Benchmark
Run performance evaluation of `Qwen3-Omni-30B-A3B-Thinking` as an example.
Refer to vllm benchmark for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -86,7 +86,8 @@ If you want to deploy multi-node environment, you need to set up environment on
## Deployment
### Multi-node Deployment with MP (Recommended)
Assume you have Atlas 800 A3 (64G*16) nodes (or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
Node 0
@@ -179,12 +180,13 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
```
The parameters are explained as follows:
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
@@ -196,7 +198,7 @@ The parameters are explained as follows:
If the service starts successfully, the following information will be displayed on node 0:
```
```shell
INFO: Started server process [44610]
INFO: Waiting for application startup.
INFO: Application startup complete.
@@ -259,6 +261,7 @@ Run performance evaluation of `Qwen3-VL-235B-A22B-Instruct` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -1,6 +1,7 @@
# Qwen3-Embedding
## Introduction
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
## Supported Features
@@ -16,11 +17,15 @@ Refer to [supported features](../user_guide/support_matrix/supported_models.md)
- `Qwen3-Embedding-0.6B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
### Installation
You can use our official docker image to run `Qwen3-Embedding` series models.
- Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
if you don't want to use the docker image as above, you can also build all from source:
- Install `vllm-ascend` from source, refer to [installation](../installation.md).
## Deployment

View File

@@ -1,6 +1,7 @@
# Qwen3-Reranker
## Introduction
The Qwen3 Reranker model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
## Supported Features
@@ -18,10 +19,13 @@ Refer to [supported features](../user_guide/support_matrix/supported_models.md)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
### Installation
You can use our official docker image to run `Qwen3-Reranker` series models.
- Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
if you don't want to use the docker image as above, you can also build all from source:
- Install `vllm-ascend` from source, refer to [installation](../installation.md).
## Deployment

View File

@@ -301,15 +301,16 @@ bash proxy.sh
**Notice:**
The parameters are explained as follows:
- `--tensor-parallel-size` 16 are common settings for tensor parallelism (TP) sizes.
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism (PCP) sizes.
- `--decode-context-parallel-size` 8 are common settings for decode context parallelism (DCP) sizes.
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
@@ -321,6 +322,7 @@ The parameters are explained as follows:
- `export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` indicates that context parallel is enabled. This environment variable is required in the PD architecture but not needed in the pd co-locate deployment scenario. It will be removed in the future.
**Notice:**
- tensor-parallel-size needs to be divisible by decode-context-parallel-size.
- decode-context-parallel-size must less than or equal to tensor-parallel-size.
@@ -351,6 +353,7 @@ Run performance evaluation of `DeepSeek-V3.1-w8a8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -15,6 +15,7 @@ Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use 1 A
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
### Run with Docker
Start a Docker container on each node.
```{code-block} bash
@@ -103,19 +104,21 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
```
**Notice:**
- for vllm version below `v0.12.0` use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`
- for vllm version `v0.12.0` use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`
The parameters are explained as follows:
- `--tensor-parallel-size` 8 are common settings for tensor parallelism (TP) sizes.
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism PCP) sizes.
- `--decode-context-parallel-size` 2 are common settings for decode context parallelism DCP) sizes.
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
@@ -126,6 +129,7 @@ The parameters are explained as follows:
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tp_size > 1.
**Notice:**
- tp_size needs to be divisible by dcp_size
- decode context parallel size must less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads.
@@ -156,6 +160,7 @@ Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommand:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

View File

@@ -126,6 +126,7 @@ for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
:::::
## Run with Docker
Start a Docker container on each node.
```{code-block} bash
@@ -172,7 +173,7 @@ docker run --rm \
## Install Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
First, we need to obtain the Mooncake project. Refer to the following command:
```shell
@@ -224,10 +225,12 @@ export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LI
We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
### launch_online_dp.py
Use `launch_online_dp.py` to launch external dp vllm servers.
[launch\_online\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py)
### run_dp_template.sh
Modify `run_dp_template.sh` on each node.
[run\_dp\_template.sh](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh)

View File

@@ -56,6 +56,7 @@ for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
## Run with Docker
Start a Docker container.
```{code-block} bash
@@ -93,7 +94,7 @@ docker run --rm \
## Install Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
First, we need to obtain the Mooncake project. Refer to the following command:
```shell

View File

@@ -8,12 +8,12 @@ Multi-node inference is suitable for scenarios where the model cannot be deploye
## Verify Multi-Node Communication Environment
### Physical Layer Requirements:
### Physical Layer Requirements
* The physical machines must be located on the same LAN, with network connectivity.
* All NPUs are connected with optical modules, and the connection status must be normal.
### Verification Process:
### Verification Process
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
@@ -32,7 +32,8 @@ Execute the following commands on each node in sequence. The results must all be
cat /etc/hccn.conf
```
### NPU Interconnect Verification:
### NPU Interconnect Verification
#### 1. Get NPU IP Addresses
```bash
@@ -47,7 +48,9 @@ hccn_tool -i 0 -ping -g address 10.20.0.20
```
## Set Up and Start the Ray Cluster
### Setting Up the Basic Container
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is advised to use Docker images.
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the primary and secondary nodes, with the `--net=host` option to enable proper network connectivity.
@@ -88,6 +91,7 @@ docker run --rm \
```
### Start Ray Cluster
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
Choose one machine as the primary node and the others as secondary nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
@@ -133,9 +137,10 @@ Once the cluster is started on multiple nodes, execute `ray status` and `ray lis
After Ray is successfully started, the following content will appear:\
A local Ray instance has started successfully.\
Dashboard URL: The access address for the Ray Dashboard (default: http://localhost:8265); Node status (CPU/memory resources, number of healthy nodes); Cluster connection address (used for adding multiple nodes).
Dashboard URL: The access address for the Ray Dashboard (default: <http://localhost:8265>); Node status (CPU/memory resources, number of healthy nodes); Cluster connection address (used for adding multiple nodes).
## Start the Online Inference Service on Multi-node scenario
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster.
**You only need to run the vllm command on one node.**

View File

@@ -87,7 +87,7 @@ The details of each configuration option are as follows:
An example of additional configuration is as follows:
```
```python
{
"weight_prefetch_config": {
"enabled": True,

View File

@@ -9,11 +9,11 @@ This guide shows how to run **prefilldecode (PD) disaggregation** on Huawei A
Large language model inference naturally splits into two phases:
- **Prefill**
- Processes input tokens and builds the keyvalue (KV) cache.
- Batchfriendly, high throughput, well suited to parallel NPU execution.
- Processes input tokens and builds the keyvalue (KV) cache.
- Batchfriendly, high throughput, well suited to parallel NPU execution.
- **Decode**
- Consumes the KV cache to generate output tokens.
- Latencysensitive, memoryintensive, more sequential.
- Consumes the KV cache to generate output tokens.
- Latencysensitive, memoryintensive, more sequential.
From the clients perspective, this still looks like a single Chat / Completions endpoint.
@@ -43,7 +43,7 @@ This section uses the `deepseek-ai/DeepSeek-V2-Lite` example, but you can swap i
### 2.2 Deploy Prefill-Decode Disaggregated DeepSeek-V2-Lite on Kubernetes
A concrete example is provided in Kthena as https://github.com/volcano-sh/kthena/blob/main/examples/model-serving/prefill-decode-disaggregation.yaml
A concrete example is provided in Kthena as <https://github.com/volcano-sh/kthena/blob/main/examples/model-serving/prefill-decode-disaggregation.yaml>
Deploy it with below command:

View File

@@ -25,6 +25,7 @@ Together, these effects allow practitioners to better balance memory, communicat
## Supported Scenarios
### Models
Finegrained TP is **model-agnostic** and supports all standard dense transformer architectures, including Llama, Qwen, DeepSeek (base/dense variants), and others.
### Component & Execution Mode Support
@@ -37,20 +38,24 @@ Finegrained TP is **model-agnostic** and supports all standard dense transformer
| **LMhead** | ✅ | ✅ | ✅ | ✅ | ✅ |
> ⚠️ Note:
>
> - `o_proj` TP is only supported in Graph mode during Decode, because dummy_run in eager mode will not trigger o_proj.
> - `mlp` TP supports dense models, or dense layers in MoE models. For example, the first three dense layers of DeepSeek-R1.
### Configuration Limit:
### Configuration Limit
The Fine-Grained TP size for any component must:
- Be **≤ the data-parallel (DP) size**, and
- **Evenly divide the DP size** (i.e., `dp_size % tp_size == 0`) to ensure valid device assignment and communication grouping.
- Be **≤ the data-parallel (DP) size**, and
- **Evenly divide the DP size** (i.e., `dp_size % tp_size == 0`) to ensure valid device assignment and communication grouping.
> ⚠️ Violating these constraints will result in runtime errors or undefined behavior.
---
## How to Use Finegrained TP
### Configuration Format:
### Configuration Format
Finegrained TP is controlled via the `finegrained_tp_config` field inside `--additional-config`.
@@ -65,7 +70,7 @@ Finegrained TP is controlled via the `finegrained_tp_config` field inside `--add
}'
```
### Example Usage:
### Example Usage
```bash
vllm serve deepseek-ai/DeepSeek-R1 \
@@ -96,6 +101,7 @@ To evaluate the effectiveness of fine-grained TP in large-scale service scenario
| **Total** | **9.72 GB** | — |
- We achieved significant gains in terms of high memory capacity on a single card, as well as the benefits of TPOT.
---
## ✅ Deployment Recommendations

View File

@@ -1,9 +1,11 @@
# Multi Token Prediction (MTP)
## Why We Need MTP
MTP boosts inference performance by parallelizing the prediction of multiple tokens, shifting from single-token to multi-token generation. This approach significantly increases generation throughput and achieves multiplicative acceleration in inference speed—all without compromising output quality.
## How to Use MTP
To enable MTP for DeepSeek-V3 models, add the following parameter when starting the service:
--speculative_config ' {"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} '
@@ -15,7 +17,7 @@ To enable MTP for DeepSeek-V3 models, add the following parameter when starting
### Module Architecture
```
```shell
vllm_ascend
├── sample
│ ├── rejection_sample.py
@@ -28,7 +30,7 @@ vllm_ascend
- *rejection_sample.py*: During decoding, the main model processes the previous rounds output token and the predicted token together (computing 1+k tokens simultaneously). The first token is always correct, while the second token—referred to as the **bonus token**—is uncertain since it is derived from speculative prediction, thus We employ **Greedy Strategy** and **Rejection Sampling Strategy** to determine whether the bonus token should be accepted. The module structure consists of an `AscendRejectionSampler` class with a forward method that implements the specific sampling logic.
```
```shell
rejection_sample.py
├── AscendRejectionSampler
│ ├── forward
@@ -37,9 +39,10 @@ rejection_sample.py
**2. spec_decode**
This section encompasses the model preprocessing for spec-decode, primarily structured as follows: it includes loading the model, executing a dummy run, and generating token ids. These steps collectively form the model data construction and forward invocation for a single spec-decode operation.
- *mtp_proposer.py*: Configure vLLM-Ascend to use speculative decoding where proposals are generated by deepseek mtp layer.
```
```shell
mtp_proposer.py
├── Proposer
│ ├── load_model
@@ -52,6 +55,7 @@ mtp_proposer.py
### Algorithm
**1. Reject_Sample**
- *Greedy Strategy*
Verify whether the token generated by the main model matches the speculative token predicted by MTP in the previous round. If they match exactly, accept the bonus token; otherwise, reject it and any subsequent tokens derived from that speculation.
@@ -76,7 +80,7 @@ If the bonus token is accepted, the MTP model performs inference for (num_specul
- Currently, the spec_decode scenario only supports methods such as ngram, eagle, eagle3, and mtp. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
```
```python
def get_spec_decode_method(method,
vllm_config,
device,
@@ -93,9 +97,10 @@ def get_spec_decode_method(method,
```
### Integer Validation
- The current npu_fused_infer_attention_score operator only supports integers less than 16 per decode round. Therefore, the maximum supported value for MTP is 15. If a value greater than 15 is provided, the code will raise an error and alert the user.
```
```python
if self.speculative_config:
spec_token_num = self.speculative_config.num_speculative_tokens
self.decode_threshold += spec_token_num
@@ -105,5 +110,6 @@ if self.speculative_config:
```
## Limitation
- Due to the fact that only a single layer of weights is exposed in DeepSeek's MTP, the accuracy and performance are not effectively guaranteed in scenarios where MTP > 1 (especially MTP ≥ 3). Moreover, due to current operator limitations, MTP supports a maximum of 15.
- In the fullgraph mode with MTP > 1, the capture size of each aclgraph must be an integer multiple of (num_speculative_tokens + 1).

View File

@@ -5,6 +5,7 @@
This guide shows how to use Context Parallel, a long sequence inference optimization technique. Context Parallel includes `PCP` (Prefill Context Parallel) and `DCP` (Decode Context Parallel), which reduces NPU memory usage and improves inference speed in long sequence LLM inference.
## Benefits of Context Parallel
Context parallel mainly solves the problem of serving long context requests. As prefill and decode present quite different characteristics and have quite different SLO (service level objectives), we need to implement context parallel separately for them. The major considerations are:
- For long context prefill, we can use context parallel to reduce TTFT (time to first token) by amortizing the computation time of the prefill across query tokens.
@@ -13,13 +14,16 @@ Context parallel mainly solves the problem of serving long context requests. As
To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/feature_guide/context_parallel.md).
## Supported Scenarios
Currently context parallel can be used together with most other features, supported features are as follows:
| | Eager | Graph | Prefix <br> Cache | Chunked <br> Prefill | SpecDecode <br> (MTP) | PD <br> disaggregation | MLAPO |
| ------- | ----- | ----- | ------ | ------ | ----- | ----- | ----- |
| **PCP** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅|
| **DCP** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
## How to use Context Parallel
You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and `decode_context_parallel_size`, refer to the following example:
- Offline example:
@@ -53,6 +57,7 @@ You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and `decode_co
The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`, so the examples above need 4 NPUs for each.
## Constraints
- While using DCP, the following constraints must be met:
- For MLA based model, such as Deepseek-R1:
- `tensor_parallel_size >= decode_context_parallel_size`
@@ -63,7 +68,7 @@ The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`
- While using Context Parallel in KV cache transfer needed scenario (e.g. KV pooling, PD-disaggregation), to simplify KV cache transmission, `cp_kv_cache_interleave_size` must be set to the same value of KV cache `block_size`(default: 128), which specify cp to split KV cache in a block-interleave style. For example:
```
```shell
vllm serve deepseek-ai/DeepSeek-V2-Lite \
--tensor-parallel-size 2 \
--decode-context-parallel-size 2 \
@@ -73,16 +78,19 @@ The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`
```
## Experimental Results
To evaluate the effectiveness of Context Parallel in in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD-disaggregate instances in the environment of 64 cards Ascend 910C*64G (A3), the configuration and performance data are as follows.
- DeepSeek-R1-W8A8:
| Configuration | Input length <br> 32k | Input length <br> 64k | Input length <br> 128k |
| ----------------------------- | ------------------------- | ------------------------- | ------------------------- |
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 9.3s <br> TPOT: 72ms | TTFT: 22.8s <br> TPOT: 74ms | TTFT: 73.2s <br> TPOT: 82ms |
| P node: (PCP2 TP8 DCP8 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 7.9s <br> TPOT: 74ms | TTFT: 15.9s <br> TPOT: 78ms | TTFT: 46.0s <br> TPOT: 83ms |
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 9.3s <br> TPOT: 72ms | TTFT: 22.8s <br> TPOT: 74ms | TTFT: 73.2s <br> TPOT: 82ms |
| P node: (PCP2 TP8 DCP8 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 7.9s <br> TPOT: 74ms | TTFT: 15.9s <br> TPOT: 78ms | TTFT: 46.0s <br> TPOT: 83ms |
- Qwen3-235B:
| Configuration | Input length <br> 32k | Input length <br> 64k | Input length <br> 120k |
| ----------------------------- | ------------------------- | ------------------------- | ------------------------- |
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 5.1s <br> TPOT: 65ms | TTFT: 13.1s <br> TPOT: 85ms | TTFT: 33.9s <br> TPOT: 120ms |
| P node: (PCP2 TP8 DCP2 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 3.0s <br> TPOT: 66ms | TTFT: 8.9s <br> TPOT: 86ms | TTFT: 22.7s <br> TPOT: 121ms |
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 5.1s <br> TPOT: 65ms | TTFT: 13.1s <br> TPOT: 85ms | TTFT: 33.9s <br> TPOT: 120ms |
| P node: (PCP2 TP8 DCP2 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 3.0s <br> TPOT: 66ms | TTFT: 8.9s <br> TPOT: 86ms | TTFT: 22.7s <br> TPOT: 121ms |

View File

@@ -29,9 +29,11 @@ We are working on further improvements and this feature will support more XPUs i
```
### Supported Models
So far, dynamic batch performs better on several dense models including Qwen and Llama (from 8B to 32B) with `tensor_parallel_size=8`. For different models, a proper `SLO_limits_for_dynamic_batch` parameter is needed. The empirical value of this parameter is generally `35, 50, or 75`. Therefore, some additional tests are needed to select the best parameter.
## Usage
Dynamic batch is used in the online inference. A fully executable example is as follows:
```shell

View File

@@ -14,9 +14,12 @@ Expert balancing for MoE models in LLM serving is essential for optimal performa
## Support Scenarios
### Models:
### Models
DeepseekV3/V3.1/R1、Qwen3-MOE
### MOE QuantType:
### MOE QuantType
W8A8-dynamic
## How to Use EPLB
@@ -37,6 +40,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
```
### Static EPLB
#### Initial Setup (Record Expert Map)
We need to add environment variable `export EXPERT_MAP_RECORD="true"` to record expert map.Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
@@ -54,6 +58,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
```
#### Subsequent Deployments (Use Recorded Map)
Load the pre-recorded expert map for consistent performance. This avoids recalculating distributions at runtime.
```shell
@@ -66,6 +71,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
```
## Critical Considerations
1. Parameter Tuning:
- num_iterations_eplb_update: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.
- num_wait_worker_iterations: Should be ≥ 30 to avoid premature balancing during startup.

View File

@@ -13,33 +13,36 @@ The functionality of [external DP](https://docs.vllm.ai/en/latest/serving/data_p
This tutorial will introduce the usage of them.
### Prerequisites:
### Prerequisites
- Python 3.10+
- Install dependencies needed by load-balance proxy server:
```
```shell
pip install fastapi httpx uvicorn
```
## Starting Exeternal DP Servers
First you need to have at least two vLLM servers running in data parallel. These can be mock servers or actual vLLM servers. Note that this proxy also works with only one vLLM server running, but will fall back to direct request forwarding which is meaningless.
You can start external vLLM dp servers one-by-one manually or using the launch script in `examples/external_online_dp`. For scenarios of large dp size across multi nodes, we recommend using our launch script for convenience.
### Manually Launch
```
```shell
# This example shows how to manually launch a vLLM service with DP size 2 in one node.
vllm serve --host 0.0.0.0 --port 8100 --data-parallel-size 2 --data-parallel-rank 0 ... # vLLM DP0
vllm serve --host 0.0.0.0 --port 8101 --data-parallel-size 2 --data-parallel-rank 1 ... # vLLM DP1
```
### Use Launch Script
Firstly, you need to modify the `examples/external_online_dp/run_dp_template.sh` according to your vLLM configuration. Then you can use `examples/external_online_dp/launch_online_dp.py` to launch multiple vLLM instances in one command each node. It will internally call `examples/external_online_dp/run_dp_template.sh` for each DP rank with proper DP-related parameters.
An example of running external DP in one single node:
```
```shell
cd examples/external_online_dp
# running DP4 TP4 in a node with 16 NPUs
python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address x.x.x.x --dp-rpc-port 12342
@@ -47,7 +50,7 @@ python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 4 --dp-rank-s
An example of running external DP in two nodes:
```
```shell
cd examples/external_online_dp
# running DP4 TP4 in two nodes with 8 NPUs each
# Node 0 holds DP0 DP1 and node 1 holds DP2 DP3
@@ -61,16 +64,18 @@ python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 2 --dp-rank-s
```
## Starting Load-balance Proxy Server
After all vLLM DP instances are launched, you can now launch the load-balance proxy server which serves as entrypoint for coming requests and load balance them between vLLM DP instances.
The proxy server has following features:
- Load balances requests to multiple vLLM servers based on request length.
- Supports OpenAI-compatible `/v1/completions` and `/v1/chat/completions` endpoints.
- Streams responses from backend servers to clients.
To run the proxy server, you need to specify the host and port for each vLLM DP Instance:
```
```shell
# For example, we have already started two DP instances in single node:
# python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address x.x.x.x --dp-rpc-port 12342
# By default, launch_online_dp.py will launch vLLM instances from starting port 9000,

View File

@@ -11,10 +11,12 @@ This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. P
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
There are two kinds for graph mode supported by vLLM Ascend:
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, Qwen and Deepseek series models are well tested.
- **XliteGraph**: This is the openeuler xlite graph mode. In v0.11.0, only Llama, Qwen dense series models, and Qwen3-vl are supported.
## Using ACLGraph
ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
Offline example:

View File

@@ -3,19 +3,21 @@
## Environmental Dependencies
* Software:
* Python >= 3.10, < 3.12
* CANN == 8.3.rc2
* PyTorch == 2.8.0, torch-npu == 2.8.0
* vLLMmain branch
* vLLM-Ascendmain branch
* Python >= 3.10, < 3.12
* CANN == 8.3.rc2
* PyTorch == 2.8.0, torch-npu == 2.8.0
* vLLMmain branch
* vLLM-Ascendmain branch
### KV Pool Parameter Description
**kv_connector_extra_config**: Additional Configurable Parameters for Pooling.
**lookup_rpc_port**: Port for RPC Communication Between Pooling Scheduler Process and Worker Process: Each Instance Requires a Unique Port Configuration.
**load_async**: Whether to Enable Asynchronous Loading. The default value is false.
**backend**: Set the storage backend for kvpool, with the default being mooncake.
### Environment Variable Configuration
To guarantee uniform hash generation, it is required to synchronize the PYTHONHASHSEED environment variable across all nodes upon enabling KV Pool.
```bash
@@ -23,6 +25,7 @@ export PYTHONHASHSEED=0
```
## Example of using Mooncake as a KV Pool backend
* Software:
* Check NPU HCCN Configuration:
@@ -35,7 +38,7 @@ export PYTHONHASHSEED=0
* Install Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
Installation and Compilation Guide: https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries.
Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
First, we need to obtain the Mooncake project. Refer to the following command:
```shell
@@ -75,8 +78,8 @@ export PYTHONHASHSEED=0
**Note:**
- Adjust the Python path according to your specific Python installation
- Ensure `/usr/local/lib` and `/usr/local/lib64` are in your `LD_LIBRARY_PATH`
* Adjust the Python path according to your specific Python installation
* Ensure `/usr/local/lib` and `/usr/local/lib64` are in your `LD_LIBRARY_PATH`
```shell
export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LIBRARY_PATH
@@ -88,7 +91,7 @@ export PYTHONHASHSEED=0
The environment variable **MOONCAKE_CONFIG_PATH** is configured to the full path where mooncake.json is located.
```
```shell
{
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
@@ -108,7 +111,7 @@ The environment variable **MOONCAKE_CONFIG_PATH** is configured to the full path
Under the mooncake folder:
```
```shell
mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.1
```
@@ -122,13 +125,13 @@ Using `MultiConnector` to simultaneously utilize both `MooncakeConnectorV1` and
`prefill` Node
```
```shell
bash multi_producer.sh
```
The content of the multi_producer.sh script:
```
```shell
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONHASHSEED=0 
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
@@ -195,13 +198,13 @@ python3 -m vllm.entrypoints.openai.api_server \
`decode` Node
```
```shell
bash multi_consumer.sh
```
The content of multi_consumer.sh:
```
```shell
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export PYTHONHASHSEED=0 
@@ -259,7 +262,7 @@ python3 -m vllm.entrypoints.openai.api_server \
Currently, the key-value pool in PD Disaggregate only stores the kv cache generated by the Prefill node by default. In models using MLA, it is now supported that the Decode node stores the kv cache for use by the Prefill node, enabled by adding `consumer_is_to_put: true` to the AscendStoreConnector. If the Prefill node enables PP, `prefill_pp_size` or `prefill_pp_layer_partition` also needs to be set. Example as follows:
```
```python
{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_consumer",
@@ -273,9 +276,9 @@ Currently, the key-value pool in PD Disaggregate only stores the kv cache genera
}
```
#### 2、Start proxy_server.
#### 2、Start proxy_server
```
```shell
python vllm-ascend/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py \
--host localhost\
--prefiller-hosts localhost \
@@ -292,13 +295,13 @@ Configure the localhost, port, and model weight path in the command to your own
Short question:
```
```shell
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'
```
Long question:
```
```shell
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'
```
@@ -306,13 +309,13 @@ curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json"
#### 1.Run Mixed Department Script
```
```shell
bash mixed_department.sh
```
Content of mixed_department.sh:
```
```shell
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
@@ -351,12 +354,12 @@ Configure the localhost, port, and model weight path in the command to your own
Short question:
```
```shell
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'
```
Long question:
```
```shell
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'
```

View File

@@ -7,12 +7,12 @@ Take the deepseek model as an example, use 8 Atlas 800T A3 servers to deploy the
## Verify Multi-Node Communication Environment
### Physical Layer Requirements:
### Physical Layer Requirements
- The physical machines must be located on the same WLAN, with network connectivity.
- All NPUs must be interconnected. For the Atlas A2 generation, intra-node connectivity is via HCCS, and inter-node connectivity is via RDMA. For the Atlas A3 generation, both intra-node and inter-node connectivity are via HCCS.
### Verification Process:
### Verification Process
:::::{tab-set}
::::{tab-item} A3

View File

@@ -5,12 +5,14 @@
**Layer Shard Linear** is a memory-optimization feature designed for large language model (LLM) inference. It addresses the high memory pressure caused by **repeated linear operators across many layers** that share identical structure but have distinct weights.
Instead of replicating all weights on every device, **Layer Shard Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**:
- The **i-th layer's linear weight** is stored **only on device `i % K`**, where `K` is the number of devices in the group.
- Other devices hold a lightweight **shared dummy tensor** during initialization and fetch the real weight **on-demand via asynchronous broadcast** during the forward pass.
As illustrated in the figure below, this design enables broadcast to reach weights: while the current layer (e.g., MLA or MOE) is being computed, the system **asynchronously broadcasts the next layer's weight** in the background. Because the attention computation in the MLA module is sufficiently latency-bound, the weight transfer for `o_proj` is **fully overlapped with computation**, making the communication **latency-free from the perspective of end-to-end inference**.
This approach **preserves exact computational semantics** while **significantly reducing NPU memory footprint**, especially critical for:
- Extremely deep architectures (e.g., DeepSeek-V3/R1 with 61 layers);
- Models using **[DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702)** or **[FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188)**, where the full `O` (output) projection matrix must reside in memory per layer;
- Scenarios where **attention computation latency fully overlaps** (hides) the communication cost of weight broadcasting.
@@ -18,6 +20,7 @@ This approach **preserves exact computational semantics** while **significantly
---
### Flowchart
![layer shard](./images/layer_sharding.png)
> **Figure.** Layer Shard Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast pre-fetches the next layer's weight while the current layer computes—enabling zero-overhead weight loading.
@@ -68,4 +71,4 @@ vllm serve \
--additional-config '{
"layer_sharding": ["q_b_proj", "o_proj"]
}'
```
```

View File

@@ -1,6 +1,7 @@
# LoRA Adapters Guide
## Overview
Like vLLM, vllm-ascend supports LoRA as well. The usage and more details can be found in [vLLM official document](https://docs.vllm.ai/en/latest/features/lora.html).
You can refer to [Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models) to find which models support LoRA in vLLM.
@@ -8,11 +9,12 @@ You can refer to [Supported Models](https://docs.vllm.ai/en/latest/models/suppor
You can run LoRA with ACLGraph mode now. Please refer to [Graph Mode Guide](./graph_mode.md) for a better LoRA performance.
Address for downloading models:\
base model: https://www.modelscope.cn/models/vllm-ascend/Llama-2-7b-hf/files \
base model: <https://www.modelscope.cn/models/vllm-ascend/Llama-2-7b-hf/files> \
lora model:
https://www.modelscope.cn/models/vllm-ascend/llama-2-7b-sql-lora-test/files
<https://www.modelscope.cn/models/vllm-ascend/llama-2-7b-sql-lora-test/files>
## Example
We provide a simple LoRA example here, which enables the ACLGraph mode by default.
```shell

View File

@@ -3,6 +3,7 @@
This guide shows how to use Speculative Decoding with vLLM Ascend. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
## Speculating by matching n-grams in the prompt
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by matching n-grams in the prompt.
- Offline inference

View File

@@ -89,6 +89,7 @@ INFO: Application startup complete.
Congratulations, you have successfully started the vLLM server with UCM connector!
## Evaluating UCM Prefix Caching Performance
After launching the vLLM server with `UCMConnector` enabled, the easiest way to observe the prefix caching effect is to run the built-in `vllm bench` CLI. Executing the following command **twice** in a separate terminal shows the improvement clearly.
```bash
@@ -109,32 +110,34 @@ vllm bench serve \
```
### After the first execution
The `vllm bench` terminal prints the benchmark result:
```
```shell
---------------Time to First Token----------------
Mean TTFT (ms): 15323.87
```
Inspecting the vLLM server logs reveals entries like:
```
```shell
INFO ucm_connector.py:228: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 0
```
This indicates that for the first inference request, UCM did not hit any cached KV blocks. As a result, the full 16K-token prefill must be computed, leading to a relatively large TTFT.
### After the second execution
Running the same benchmark again produces:
```
```shell
---------------Time to First Token----------------
Mean TTFT (ms): 1920.68
```
The vLLM server logs now contain similar entries:
```
```shell
INFO ucm_connector.py:228: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 125
```

View File

@@ -1,13 +1,17 @@
# Release Notes
## v0.13.0rc1 - 2025.12.27
This is the first release candidate of v0.13.0 for vLLM Ascend. We landed lots of bug fix, performance improvement and feature support in this release. Any feedback is welcome to help us to improve vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest) to get started.
### Highlights
- Improved the performance of DeepSeek V3.2, please refer to [tutorials](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.2.html)
- Qwen3-Next MTP with chunked prefill is supported now [#4770](https://github.com/vllm-project/vllm-ascend/pull/4770), please refer to [tutorials](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/Qwen3-Next.html)
- [Experimental] Prefill Context Parallel and Decode Context Parallel are supported, but notice that it is an experimental feature now, welcome any feedback. please refer to [context parallel feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/context_parallel.html)
### Features
- Support openPangu Ultra MoE [4615](https://github.com/vllm-project/vllm-ascend/pull/4615)
- A new quantization method W8A16 is supported now. [#4541](https://github.com/vllm-project/vllm-ascend/pull/4541)
- Cross-machine Disaggregated Prefill is supported now. [#5008](https://github.com/vllm-project/vllm-ascend/pull/5008)
@@ -17,7 +21,9 @@ This is the first release candidate of v0.13.0 for vLLM Ascend. We landed lots o
- Enhance all-reduce skipping logic for MoE models in NPUModelRunner [#5329](https://github.com/vllm-project/vllm-ascend/pull/5329)
### Performance
Some general performance improvement:
- Add l2norm triton kernel [#4595](https://github.com/vllm-project/vllm-ascend/pull/4595)
- Add new pattern for AddRmsnormQuant with SP, which could only take effect in graph mode. [#5077](https://github.com/vllm-project/vllm-ascend/pull/5077)
- Add async exponential while model executing. [#4501](https://github.com/vllm-project/vllm-ascend/pull/4501)
@@ -25,6 +31,7 @@ Some general performance improvement:
- To optimize the performance in small batch size scenario, an attention operator with flash decoding function is offered, please refer to item 22 in [FAQs](https://docs.vllm.ai/projects/ascend/en/latest/faqs.html) to enable it.
### Other
- OOM error on VL models is fixed now. We're keeping observing it, if you hit OOM problem again, please submit an issue. [#5136](https://github.com/vllm-project/vllm-ascend/pull/5136)
- Fixed an accuracy bug of Qwen3-Next-MTP when batched inferring. [#4932](https://github.com/vllm-project/vllm-ascend/pull/4932)
- Fix npu-cpu offloading interface change bug. [#5290](https://github.com/vllm-project/vllm-ascend/pull/5290)
@@ -32,6 +39,7 @@ Some general performance improvement:
- Fix unsuitable moe_comm_type under ep=1 scenario [#5388](https://github.com/vllm-project/vllm-ascend/pull/5388)
### Deprecation & Breaking Changes
- `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is removed and `VLLM_ASCEND_ENABLE_PREFETCH_MLP` is recommend to replace as they always be enabled together. [#5272](https://github.com/vllm-project/vllm-ascend/pull/5272)
- `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is dropped now. [#5270](https://github.com/vllm-project/vllm-ascend/pull/5270)
- `VLLM_ASCEND_ENABLE_NZ` is disabled for float weight case, since we notice that the performance is not good in some float case. Feel free to set it to 2 if you make sure it works for your case. [#4878](https://github.com/vllm-project/vllm-ascend/pull/4878)
@@ -39,24 +47,29 @@ Some general performance improvement:
- `dump_config` in `additional_config` is renamed to `dump_config_path` and the type is change from `dict` to `string`. [#5296](https://github.com/vllm-project/vllm-ascend/pull/5296)
### Dependencies
- vLLM version has been upgraded to 0.13.0 and drop 0.12.0 support. [#5146](https://github.com/vllm-project/vllm-ascend/pull/5146)
- Transformer version has been upgraded >= 4.57.3 [#5250](https://github.com/vllm-project/vllm-ascend/pull/5250)
### Known Issues
- Qwen3-Next doesn't support long sequence scenario, and we should limit `gpu-memory-utilization` according to the doc to run Qwen3-Next. We'll improve it in the next release
- The functional break on Qwen3-Next when the input/output is around 3.5k/1.5k is fixed, but it introduces a regression on performance. We'll fix it in next release. [#5357](https://github.com/vllm-project/vllm-ascend/issues/5357)
- There is a precision issue with curl on ultra-short sequences in DeepSeek-V3.2. We'll fix it in next release. [#5370](https://github.com/vllm-project/vllm-ascend/issues/5370)
## v0.11.0 - 2025.12.16
We're excited to announce the release of v0.11.0 for vLLM Ascend. This is the official release for v0.11.0. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.11.0) to get started. We'll consider to release post version in the future if needed. This release note will only contain the important change and note from v0.11.0rc3.
### Highlights
- Improved the performance for deepseek 3/3.1. [#3995](https://github.com/vllm-project/vllm-ascend/pull/3995)
- Fixed the accuracy bug for qwen3-vl. [#4811](https://github.com/vllm-project/vllm-ascend/pull/4811)
- Improved the performance of sample. [#4153](https://github.com/vllm-project/vllm-ascend/pull/4153)
- Eagle3 is back now. [#4721](https://github.com/vllm-project/vllm-ascend/pull/4721)
### Other
- Improved the performance for kimi-k2. [#4555](https://github.com/vllm-project/vllm-ascend/pull/4555)
- Fixed a quantization bug for deepseek3.2-exp. [#4797](https://github.com/vllm-project/vllm-ascend/pull/4797)
- Fixed qwen3-vl-moe bug under high concurrency. [#4658](https://github.com/vllm-project/vllm-ascend/pull/4658)
@@ -65,15 +78,18 @@ We're excited to announce the release of v0.11.0 for vLLM Ascend. This is the of
- Fixed the version incompatibility issue for openEuler docker image. [#4745](https://github.com/vllm-project/vllm-ascend/pull/4745)
### Deprecation announcement
- LLMdatadist connector has been deprecated, it'll be removed in v0.12.0rc1
- Torchair graph has been deprecated, it'll be removed in v0.12.0rc1
- Ascend scheduler has been deprecated, it'll be removed in v0.12.0rc1
### Upgrade notice
- torch-npu is upgraded to 2.7.1.post1. Please note that the package is pushed to [pypi mirror](https://mirrors.huaweicloud.com/ascend/repos/pypi/torch-npu/). So it's hard to add it to auto dependence. Please install it by yourself.
- CANN is upgraded to 8.3.rc2.
### Known Issues
- Qwen3-Next doesn't support expert parallel and MTP features in this release. And it'll be oom if the input is too long. We'll improve it in the next release
- Deepseek 3.2 only work with torchair graph mode in this release. We'll make it work with aclgraph mode in the next release.
- Qwen2-audio doesn't work by default. Temporary solution is to set `--gpu-memory-utilization` to a suitable value, such as 0.8.
@@ -84,11 +100,13 @@ We're excited to announce the release of v0.11.0 for vLLM Ascend. This is the of
This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots of bug fix, performance improvement and feature support in this release. Any feedback is welcome to help us to improve vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest) to get started.
### Highlights
- DeepSeek 3.2 is stable and performance is improved. In this release, you don't need to install any other packages now. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.2.html) to start using it.
- Async scheduler is more stable and ready to enable now. Please set `--async-scheduling` to enable it.
- More new models, such as Qwen3-omni, DeepSeek OCR, PaddleOCR, OpenCUA are supported now.
### Core
- [Experimental] Full decode only graph mode is supported now. Although it is not enabled by default, we suggest to enable it by `--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'` in most case. Let us know if you hit any error. We'll keep improve it and enable it by default in next few release.
- Lots of triton kernel are added. The performance of vLLM Ascend, especially Qwen3-Next and DeepSeek 3.2 is improved. Please note that triton is not installed and enabled by default, but we suggest to enable it in most case. You can download and install it by hand from [package url](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27_aarch64.whl). If you're running vLLM Ascend with X86, you need to build triton ascend by yourself from [source](https://gitcode.com/Ascend/triton-ascend)
- Lots of Ascend ops are added to improve the performance. It means that from this release vLLM Ascend only works with custom ops built. So we removed the env `COMPILE_CUSTOM_KERNELS`. You can not set it to 0 now.
@@ -101,6 +119,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
- Official doc has been improved. we refactored the tutorial to make it more clear. The user guide and developer guide is more complete now. We'll keep improving it.
### Other
- [Experimental] Mooncake layerwise connector is supported now.
- [Experimental] [KV cache pool](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/KV_Cache_Pool_Guide.html) feature is added
- [Experimental] A new graph mode `xlite` is introduced. It performs good with some models. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html#using-xlitegraph) to start using it.
@@ -113,11 +132,13 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
- msserviceprofiler tool is added to help user to profile the model performance. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/performance_and_debug/service_profiling_guide.html) to get started.
### Upgrade Note
- vLLM Ascend self maintained modeling file has been removed. The related python entrypoint is removed as well. So please uninstall the old version of vLLM Ascend in your env before upgrade.
- CANN is upgraded to 8.3.RC2, Pytorch and torch-npu are upgraded to 2.8.0. Don't forget to install them.
- Python 3.9 support is dropped to keep the same with vLLM v0.12.0
### Known Issues
- DeepSeek 3/3.1 and Qwen3 doesn't work with FULL_DECODE_ONLY graph mode. We'll fix it in next release. [#4990](https://github.com/vllm-project/vllm-ascend/pull/4990)
- Hunyuan OCR doesn't work. We'll fix it in the next release. [#4989](https://github.com/vllm-project/vllm-ascend/pull/4989) [#4992](https://github.com/vllm-project/vllm-ascend/pull/4992)
- DeepSeek 3.2 doesn't work with chat template. It because that vLLM v0.12.0 doesn't support it. We'll support in the next v0.13.0rc1 version.
@@ -126,14 +147,17 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
- speculative decode method `suffix` doesn't work. We'll fix it in next release. You can pick this commit to fix the issue: [#5010](https://github.com/vllm-project/vllm-ascend/pull/5010)
## v0.11.0rc3 - 2025.12.03
This is the third release candidate of v0.11.0 for vLLM Ascend. For quality reasons, we released a new rc before the official release. Thanks for all your feedback. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.11.0) to get started.
### Highlights
- torch-npu is upgraded to 2.7.1.post1. Please note that the package is pushed to [pypi mirror](https://mirrors.huaweicloud.com/ascend/repos/pypi/torch-npu/). So it's hard to add it to auto dependence. Please install it by yourself.
- Disable NZ weight loader to speed up dense model. Please note that this is a temporary solution. If you find the performance becomes bad, please let us know. We'll keep improving it. [#4495](https://github.com/vllm-project/vllm-ascend/pull/4495)
- mooncake is installed in official docker image now. You can use it directly in container now. [#4506](https://github.com/vllm-project/vllm-ascend/pull/4506)
### Other
- Fix an OOM issue for moe models. [#4367](https://github.com/vllm-project/vllm-ascend/pull/4367)
- Fix hang issue of multimodal model when running with DP>1 [#4393](https://github.com/vllm-project/vllm-ascend/pull/4393)
- Fix some bugs for EPLB [#4416](https://github.com/vllm-project/vllm-ascend/pull/4416)
@@ -142,19 +166,23 @@ This is the third release candidate of v0.11.0 for vLLM Ascend. For quality reas
- Fix a function bug when running qwen2.5 vl under high concurrency. [#4553](https://github.com/vllm-project/vllm-ascend/pull/4553)
## v0.11.0rc2 - 2025.11.21
This is the second release candidate of v0.11.0 for vLLM Ascend. In this release, we solved many bugs to improve the quality. Thanks for all your feedback. We'll keep working on bug fix and performance improvement. The v0.11.0 official release will come soon. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.11.0) to get started.
### Highlights
- CANN is upgraded to 8.3.RC2. [#4332](https://github.com/vllm-project/vllm-ascend/pull/4332)
- Ngram spec decode method is back now. [#4092](https://github.com/vllm-project/vllm-ascend/pull/4092)
- The performance of aclgraph is improved by updating default capture size. [#4205](https://github.com/vllm-project/vllm-ascend/pull/4205)
### Core
- Speed up vLLM startup time. [#4099](https://github.com/vllm-project/vllm-ascend/pull/4099)
- Kimi k2 with quantization works now. [#4190](https://github.com/vllm-project/vllm-ascend/pull/4190)
- Fix a bug for qwen3-next. It's more stable now. [#4025](https://github.com/vllm-project/vllm-ascend/pull/4025)
### Other
- Fix an issue for full decode only mode. Full graph mode is more stable now. [#4106](https://github.com/vllm-project/vllm-ascend/pull/4106) [#4282](https://github.com/vllm-project/vllm-ascend/pull/4282)
- Fix a allgather ops bug for DeepSeek V3 series models. [#3711](https://github.com/vllm-project/vllm-ascend/pull/3711)
- Fix some bugs for EPLB feature. [#4150](https://github.com/vllm-project/vllm-ascend/pull/4150) [#4334](https://github.com/vllm-project/vllm-ascend/pull/4334)
@@ -165,6 +193,7 @@ This is the second release candidate of v0.11.0 for vLLM Ascend. In this release
- Audio required library is installed in container. [#4324](https://github.com/vllm-project/vllm-ascend/pull/4324)
### Known Issues
- Ray + EP doesn't work, if you run vLLM Ascend with ray, please disable expert parallelism. [#4123](https://github.com/vllm-project/vllm-ascend/pull/4123)
- `response_format` parameter is not supported yet. We'll support it soon. [#4175](https://github.com/vllm-project/vllm-ascend/pull/4175)
- cpu bind feature doesn't work for multi instance case(Such as multi DP on one node). We'll fix it in the next release.
@@ -175,11 +204,13 @@ This is the first release candidate of v0.11.0 for vLLM Ascend. Please follow th
v0.11.0 will be the next official release version of vLLM Ascend. We'll release it in the next few days. Any feedback is welcome to help us to improve v0.11.0.
### Highlights
- CANN is upgrade to 8.3.RC1. Torch-npu is upgrade to 2.7.1. [#3945](https://github.com/vllm-project/vllm-ascend/pull/3945) [#3896](https://github.com/vllm-project/vllm-ascend/pull/3896)
- PrefixCache and Chunked Prefill are enabled by default. [#3967](https://github.com/vllm-project/vllm-ascend/pull/3967)
- W4A4 quantization is supported now. [#3427](https://github.com/vllm-project/vllm-ascend/pull/3427) Official tutorial is available at [here](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/single_npu_qwen3_w4a4.html.
- W4A4 quantization is supported now. [#3427](https://github.com/vllm-project/vllm-ascend/pull/3427) Official tutorial is available at [here](<https://docs.vllm.ai/projects/ascend/en/latest/tutorials/single_npu_qwen3_w4a4.html>.
### Core
- Performance of Qwen3 and Deepseek V3 series models are improved.
- Mooncake layerwise connector is supported now [#2602](https://github.com/vllm-project/vllm-ascend/pull/2602). Find tutorial [here](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html).
- MTP > 1 is supported now. [#2708](https://github.com/vllm-project/vllm-ascend/pull/2708)
@@ -187,6 +218,7 @@ v0.11.0 will be the next official release version of vLLM Ascend. We'll release
- Pooling models, such as bge-m3, are supported now. [#3171](https://github.com/vllm-project/vllm-ascend/pull/3171)
### Other
- Refactor the MOE module to make it clearer and easier to understand and the performance has improved in both quantitative and non-quantitative scenarios.
- Refactor model register module to make it easier to maintain. We'll remove this module in Q4 2025. [#3004](https://github.com/vllm-project/vllm-ascend/pull/3004)
- Torchair is deprecated. We'll remove it once the performance of ACL Graph is good enough. The deadline is Q1 2026.
@@ -194,6 +226,7 @@ v0.11.0 will be the next official release version of vLLM Ascend. We'll release
- Refactor the linear module to support features flashcomm1 and flashcomm2 in paper [flashcomm](https://arxiv.org/pdf/2412.04964) [#3004](https://github.com/vllm-project/vllm-ascend/pull/3004) [#3334](https://github.com/vllm-project/vllm-ascend/pull/3334)
### Known issue
- The memory may be leaked and the service may be stuck after long time serving. This is a bug from torch-npu, we'll upgrade and fix it soon.
- The accuracy of qwen2.5 VL is not very good. This is a bug lead by CANN, we fix it soon.
- For long sequence input case, there is no response sometimes and the kv cache usage is become higher. This is a bug for scheduler. We are working on it.
@@ -243,6 +276,7 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
- Fixed the performance regression with non MLA model when using default scheduler. [#2894](https://github.com/vllm-project/vllm-ascend/pull/2894)
### Others
- The performance of W8A8 quantization is improved. [#2275](https://github.com/vllm-project/vllm-ascend/pull/2275)
- The performance is improved for moe models. [#2689](https://github.com/vllm-project/vllm-ascend/pull/2689) [#2842](https://github.com/vllm-project/vllm-ascend/pull/2842)
- Fixed resources limit error when apply speculative decoding and aclgraph. [#2472](https://github.com/vllm-project/vllm-ascend/pull/2472)
@@ -261,6 +295,7 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
- Fixed the aclgraph steam error for moe model. [#2827](https://github.com/vllm-project/vllm-ascend/pull/2827)
### Known Issues
- The server will hang when running Prefill Decode Disaggregation with different TP size for P and D. It's fixed by [vLLM commit](https://github.com/vllm-project/vllm/pull/23917) which is not included in v0.10.2. You can pick this commit to fix the issue.
- The HBM usage of Qwen3-Next is higher than expected. It is a [known issue](https://github.com/vllm-project/vllm-ascend/issues/2884) and we are working on it. You can set `max_model_len` and `gpu_memory_utilization` to suitable value based on your parallel configuration to avoid oom error.
- We notice that LoRA does not work with this release due to the refactor of KV cache. We will fix it soon. [2941](https://github.com/vllm-project/vllm-ascend/issues/2941)
@@ -271,11 +306,13 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started.
### Highlights
- LoRA Performance improved much through adding Custom Kernels by China Merchants Bank. [#2325](https://github.com/vllm-project/vllm-ascend/pull/2325)
- Support Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. [#1568](https://github.com/vllm-project/vllm-ascend/pull/1568)
- Support capture custom ops into aclgraph now. [#2113](https://github.com/vllm-project/vllm-ascend/pull/2113)
### Core
- Added MLP tensor parallel to improve performance, but note that this will increase memory usage. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
- openEuler is upgraded to 24.03. [#2631](https://github.com/vllm-project/vllm-ascend/pull/2631)
- Added custom lmhead tensor parallel to achieve reduced memory consumption and improved TPOT performance. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
@@ -285,31 +322,31 @@ This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the
### Others
- Bug fixes:
* Updated the graph capture size calculation, somehow alleviated the problem that NPU stream not enough in some scenarios. [#2511](https://github.com/vllm-project/vllm-ascend/pull/2511)
* Fixed bugs and refactor cached mask generation logic. [#2442](https://github.com/vllm-project/vllm-ascend/pull/2442)
* Fixed the nz format does not work in quantization scenarios. [#2549](https://github.com/vllm-project/vllm-ascend/pull/2549)
* Fixed the accuracy issue on Qwen series caused by enabling `enable_shared_pert_dp` by default. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
* Fixed the accuracy issue on models whose rope dim is not equal to head dim, e.g., GLM4.5. [#2601](https://github.com/vllm-project/vllm-ascend/pull/2601)
- Updated the graph capture size calculation, somehow alleviated the problem that NPU stream not enough in some scenarios. [#2511](https://github.com/vllm-project/vllm-ascend/pull/2511)
- Fixed bugs and refactor cached mask generation logic. [#2442](https://github.com/vllm-project/vllm-ascend/pull/2442)
- Fixed the nz format does not work in quantization scenarios. [#2549](https://github.com/vllm-project/vllm-ascend/pull/2549)
- Fixed the accuracy issue on Qwen series caused by enabling `enable_shared_pert_dp` by default. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
- Fixed the accuracy issue on models whose rope dim is not equal to head dim, e.g., GLM4.5. [#2601](https://github.com/vllm-project/vllm-ascend/pull/2601)
- Performance improved through a lot of prs:
* Removed torch.cat and replaced it with List[0]. [#2153](https://github.com/vllm-project/vllm-ascend/pull/2153)
* Converted the format of gmm to nz. [#2474](https://github.com/vllm-project/vllm-ascend/pull/2474)
* Optimized parallel strategies to reduce communication overhead. [#2198](https://github.com/vllm-project/vllm-ascend/pull/2198)
* Optimized reject sampler in greedy situation. [#2137](https://github.com/vllm-project/vllm-ascend/pull/2137)
- Removed torch.cat and replaced it with List[0]. [#2153](https://github.com/vllm-project/vllm-ascend/pull/2153)
- Converted the format of gmm to nz. [#2474](https://github.com/vllm-project/vllm-ascend/pull/2474)
- Optimized parallel strategies to reduce communication overhead. [#2198](https://github.com/vllm-project/vllm-ascend/pull/2198)
- Optimized reject sampler in greedy situation. [#2137](https://github.com/vllm-project/vllm-ascend/pull/2137)
- A batch of refactoring PRs to enhance the code architecture:
* Refactor on MLA. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
* Refactor on torchair fused_moe. [#2438](https://github.com/vllm-project/vllm-ascend/pull/2438)
* Refactor on allgather/mc2-related fused_experts. [#2369](https://github.com/vllm-project/vllm-ascend/pull/2369)
* Refactor on torchair model runner. [#2208](https://github.com/vllm-project/vllm-ascend/pull/2208)
* Refactor on CI. [#2276](https://github.com/vllm-project/vllm-ascend/pull/2276)
- Refactor on MLA. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
- Refactor on torchair fused_moe. [#2438](https://github.com/vllm-project/vllm-ascend/pull/2438)
- Refactor on allgather/mc2-related fused_experts. [#2369](https://github.com/vllm-project/vllm-ascend/pull/2369)
- Refactor on torchair model runner. [#2208](https://github.com/vllm-project/vllm-ascend/pull/2208)
- Refactor on CI. [#2276](https://github.com/vllm-project/vllm-ascend/pull/2276)
- Parameters changes:
* Added `lmhead_tensor_parallel_size` in `additional_config`, set it to enable lmhead tensor parallel. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
* Some unused environment variables `HCCN_PATH`, `PROMPT_DEVICE_ID`, `DECODE_DEVICE_ID`, `LLMDATADIST_COMM_PORT` and `LLMDATADIST_SYNC_CACHE_WAIT_TIME` are removed. [#2448](https://github.com/vllm-project/vllm-ascend/pull/2448)
* Environment variable `VLLM_LLMDD_RPC_PORT` is renamed to `VLLM_ASCEND_LLMDD_RPC_PORT` now. [#2450](https://github.com/vllm-project/vllm-ascend/pull/2450)
* Added `VLLM_ASCEND_ENABLE_MLP_OPTIMIZE` in environment variables, Whether to enable mlp optimize when tensor parallel is enabled. This feature provides better performance in eager mode. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
* Removed `MOE_ALL2ALL_BUFFER` and `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environment variables. [#2612](https://github.com/vllm-project/vllm-ascend/pull/2612)
* Added `enable_prefetch` in `additional_config`, Whether to enable weight prefetch. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
* Added `mode` in `additional_config.torchair_graph_config`, When using reduce-overhead mode for torchair, mode needs to be set. [#2461](https://github.com/vllm-project/vllm-ascend/pull/2461)
* `enable_shared_expert_dp` in `additional_config` is disabled by default now, and it is recommended to be enabled when inferencing with deepseek. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
- Added `lmhead_tensor_parallel_size` in `additional_config`, set it to enable lmhead tensor parallel. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
- Some unused environment variables `HCCN_PATH`, `PROMPT_DEVICE_ID`, `DECODE_DEVICE_ID`, `LLMDATADIST_COMM_PORT` and `LLMDATADIST_SYNC_CACHE_WAIT_TIME` are removed. [#2448](https://github.com/vllm-project/vllm-ascend/pull/2448)
- Environment variable `VLLM_LLMDD_RPC_PORT` is renamed to `VLLM_ASCEND_LLMDD_RPC_PORT` now. [#2450](https://github.com/vllm-project/vllm-ascend/pull/2450)
- Added `VLLM_ASCEND_ENABLE_MLP_OPTIMIZE` in environment variables, Whether to enable mlp optimize when tensor parallel is enabled. This feature provides better performance in eager mode. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
- Removed `MOE_ALL2ALL_BUFFER` and `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environment variables. [#2612](https://github.com/vllm-project/vllm-ascend/pull/2612)
- Added `enable_prefetch` in `additional_config`, Whether to enable weight prefetch. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
- Added `mode` in `additional_config.torchair_graph_config`, When using reduce-overhead mode for torchair, mode needs to be set. [#2461](https://github.com/vllm-project/vllm-ascend/pull/2461)
- `enable_shared_expert_dp` in `additional_config` is disabled by default now, and it is recommended to be enabled when inferencing with deepseek. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
### Known Issues
@@ -336,6 +373,7 @@ Please note that this release note will list all the important changes from last
- Dynamic and Static EPLB support is added. This feature is still experimental.
### Note
The following notes are especially for reference when upgrading from last final release (v0.7.3):
- V0 Engine is not supported from this release. Please always set `VLLM_USE_V1=1` to use V1 engine with vLLM Ascend.
@@ -356,6 +394,7 @@ The following notes are especially for reference when upgrading from last final
- Fix file not found error with shutil.rmtree in torchair mode. [#2506](https://github.com/vllm-project/vllm-ascend/pull/2506)
### Known Issues
- When running MoE model, Aclgraph mode only work with tensor parallel. DP/EP doesn't work in this release.
- Pipeline parallelism is not supported in this release for V1 engine.
- If you use w4a8 quantization with eager mode, please set `VLLM_ASCEND_MLA_PARALLEL=1` to avoid oom error.
@@ -397,10 +436,12 @@ This is the 3rd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started. V0 is completely removed from this version.
### Highlights
- Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model [#950](https://github.com/vllm-project/vllm-ascend/pull/950), following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md).
- W4A8 quantization method is supported for dense and MoE model now. [#2060](https://github.com/vllm-project/vllm-ascend/pull/2060) [#2172](https://github.com/vllm-project/vllm-ascend/pull/2172)
### Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.7.1.dev20250724`. [#1562](https://github.com/vllm-project/vllm-ascend/pull/1562) And CANN hase been upgraded to `8.2.RC1`. [#1653](https://github.com/vllm-project/vllm-ascend/pull/1653) Dont forget to update them in your environment or using the latest images.
- vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582)
- Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this [tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node_kimi.md.html) to have a try. [#2162](https://github.com/vllm-project/vllm-ascend/pull/2162)
@@ -412,36 +453,37 @@ This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the
### Others
- Bug fixes:
* Fix functional problem of multi-modality models like Qwen2-audio with Aclgraph. [#1803](https://github.com/vllm-project/vllm-ascend/pull/1803)
* Fix the process group creating error with external launch scenario. [#1681](https://github.com/vllm-project/vllm-ascend/pull/1681)
* Fix the functional problem with guided decoding. [#2022](https://github.com/vllm-project/vllm-ascend/pull/2022)
* Fix the accuracy issue with common MoE models in DP scenario. [#1856](https://github.com/vllm-project/vllm-ascend/pull/1856)
- Fix functional problem of multi-modality models like Qwen2-audio with Aclgraph. [#1803](https://github.com/vllm-project/vllm-ascend/pull/1803)
- Fix the process group creating error with external launch scenario. [#1681](https://github.com/vllm-project/vllm-ascend/pull/1681)
- Fix the functional problem with guided decoding. [#2022](https://github.com/vllm-project/vllm-ascend/pull/2022)
- Fix the accuracy issue with common MoE models in DP scenario. [#1856](https://github.com/vllm-project/vllm-ascend/pull/1856)
- Performance improved through a lot of prs:
* Caching sin/cos instead of calculate it every layer. [#1890](https://github.com/vllm-project/vllm-ascend/pull/1890)
* Improve shared expert multi-stream parallelism [#1891](https://github.com/vllm-project/vllm-ascend/pull/1891)
* Implement the fusion of allreduce and matmul in prefill phase when tp is enabled. Enable this feature by setting `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to `1`. [#1926](https://github.com/vllm-project/vllm-ascend/pull/1926)
* Optimize Quantized MoE Performance by Reducing All2All Communication. [#2195](https://github.com/vllm-project/vllm-ascend/pull/2195)
* Use AddRmsNormQuant ops in the custom model to optimize Qwen3's performance [#1806](https://github.com/vllm-project/vllm-ascend/pull/1806)
* Use multicast to avoid padding decode request to prefill size [#1555](https://github.com/vllm-project/vllm-ascend/pull/1555)
* The performance of LoRA has been improved. [#1884](https://github.com/vllm-project/vllm-ascend/pull/1884)
- Caching sin/cos instead of calculate it every layer. [#1890](https://github.com/vllm-project/vllm-ascend/pull/1890)
- Improve shared expert multi-stream parallelism [#1891](https://github.com/vllm-project/vllm-ascend/pull/1891)
- Implement the fusion of allreduce and matmul in prefill phase when tp is enabled. Enable this feature by setting `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to `1`. [#1926](https://github.com/vllm-project/vllm-ascend/pull/1926)
- Optimize Quantized MoE Performance by Reducing All2All Communication. [#2195](https://github.com/vllm-project/vllm-ascend/pull/2195)
- Use AddRmsNormQuant ops in the custom model to optimize Qwen3's performance [#1806](https://github.com/vllm-project/vllm-ascend/pull/1806)
- Use multicast to avoid padding decode request to prefill size [#1555](https://github.com/vllm-project/vllm-ascend/pull/1555)
- The performance of LoRA has been improved. [#1884](https://github.com/vllm-project/vllm-ascend/pull/1884)
- A batch of refactoring prs to enhance the code architecture:
* Torchair model runner refactor [#2205](https://github.com/vllm-project/vllm-ascend/pull/2205)
* Refactoring forward_context and model_runner_v1. [#1979](https://github.com/vllm-project/vllm-ascend/pull/1979)
* Refactor AscendMetaData Comments. [#1967](https://github.com/vllm-project/vllm-ascend/pull/1967)
* Refactor torchair utils. [#1892](https://github.com/vllm-project/vllm-ascend/pull/1892)
* Refactor torchair worker. [#1885](https://github.com/vllm-project/vllm-ascend/pull/1885)
* Register activation customop instead of overwrite forward_oot. [#1841](https://github.com/vllm-project/vllm-ascend/pull/1841)
- Torchair model runner refactor [#2205](https://github.com/vllm-project/vllm-ascend/pull/2205)
- Refactoring forward_context and model_runner_v1. [#1979](https://github.com/vllm-project/vllm-ascend/pull/1979)
- Refactor AscendMetaData Comments. [#1967](https://github.com/vllm-project/vllm-ascend/pull/1967)
- Refactor torchair utils. [#1892](https://github.com/vllm-project/vllm-ascend/pull/1892)
- Refactor torchair worker. [#1885](https://github.com/vllm-project/vllm-ascend/pull/1885)
- Register activation customop instead of overwrite forward_oot. [#1841](https://github.com/vllm-project/vllm-ascend/pull/1841)
- Parameters changes:
* `expert_tensor_parallel_size` in `additional_config` is removed now, and the EP and TP is aligned with vLLM now. [#1681](https://github.com/vllm-project/vllm-ascend/pull/1681)
* Add `VLLM_ASCEND_MLA_PA` in environ variables, use this to enable mla paged attention operator for deepseek mla decode.
* Add `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` in environ variables, enable `MatmulAllReduce` fusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance.
* Add `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion.
- `expert_tensor_parallel_size` in `additional_config` is removed now, and the EP and TP is aligned with vLLM now. [#1681](https://github.com/vllm-project/vllm-ascend/pull/1681)
- Add `VLLM_ASCEND_MLA_PA` in environ variables, use this to enable mla paged attention operator for deepseek mla decode.
- Add `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` in environ variables, enable `MatmulAllReduce` fusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance.
- Add `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion.
- UT coverage reached 76.34% after a batch of prs followed by this rfc: [#1298](https://github.com/vllm-project/vllm-ascend/issues/1298)
- Sequence Parallelism works for Qwen3 MoE. [#2209](https://github.com/vllm-project/vllm-ascend/issues/2209)
- Chinese online document is added now. [#1870](https://github.com/vllm-project/vllm-ascend/issues/1870)
### Known Issues
- Aclgraph could not work with DP + EP currently, the mainly gap is the number of npu stream that Aclgraph needed to capture graph is not enough. [#2229](https://github.com/vllm-project/vllm-ascend/issues/2229)
- There is an accuracy issue on W8A8 dynamic quantized DeepSeek with multistream enabled. This will be fixed in the next release. [#2232](https://github.com/vllm-project/vllm-ascend/issues/2232)
- In Qwen3 MoE, SP cannot be incorporated into the Aclgraph. [#2246](https://github.com/vllm-project/vllm-ascend/issues/2246)
@@ -449,14 +491,17 @@ This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the
- When running MTP with DP > 1, we need to disable metrics logger due to some issue on vLLM. [#2254](https://github.com/vllm-project/vllm-ascend/issues/2254)
## v0.9.1rc2 - 2025.08.04
This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.9.1/) to get started.
### Highlights
- MOE and dense w4a8 quantization support now: [#1320](https://github.com/vllm-project/vllm-ascend/pull/1320) [#1910](https://github.com/vllm-project/vllm-ascend/pull/1910) [#1275](https://github.com/vllm-project/vllm-ascend/pull/1275) [#1480](https://github.com/vllm-project/vllm-ascend/pull/1480)
- Dynamic EPLB support in [#1943](https://github.com/vllm-project/vllm-ascend/pull/1943)
- Disaggregated Prefilling support for V1 Engine and improvement, continued development and stabilization of the disaggregated prefill feature, including performance enhancements and bug fixes for single-machine setups:[#1953](https://github.com/vllm-project/vllm-ascend/pull/1953) [#1612](https://github.com/vllm-project/vllm-ascend/pull/1612) [#1361](https://github.com/vllm-project/vllm-ascend/pull/1361) [#1746](https://github.com/vllm-project/vllm-ascend/pull/1746) [#1552](https://github.com/vllm-project/vllm-ascend/pull/1552) [#1801](https://github.com/vllm-project/vllm-ascend/pull/1801) [#2083](https://github.com/vllm-project/vllm-ascend/pull/2083) [#1989](https://github.com/vllm-project/vllm-ascend/pull/1989)
### Model Improvement
- DeepSeek DeepSeek DBO support and improvement: [#1285](https://github.com/vllm-project/vllm-ascend/pull/1285) [#1291](https://github.com/vllm-project/vllm-ascend/pull/1291) [#1328](https://github.com/vllm-project/vllm-ascend/pull/1328) [#1420](https://github.com/vllm-project/vllm-ascend/pull/1420) [#1445](https://github.com/vllm-project/vllm-ascend/pull/1445) [#1589](https://github.com/vllm-project/vllm-ascend/pull/1589) [#1759](https://github.com/vllm-project/vllm-ascend/pull/1759) [#1827](https://github.com/vllm-project/vllm-ascend/pull/1827) [#2093](https://github.com/vllm-project/vllm-ascend/pull/2093)
- DeepSeek MTP improvement and bugfix: [#1214](https://github.com/vllm-project/vllm-ascend/pull/1214) [#943](https://github.com/vllm-project/vllm-ascend/pull/943) [#1584](https://github.com/vllm-project/vllm-ascend/pull/1584) [#1473](https://github.com/vllm-project/vllm-ascend/pull/1473) [#1294](https://github.com/vllm-project/vllm-ascend/pull/1294) [#1632](https://github.com/vllm-project/vllm-ascend/pull/1632) [#1694](https://github.com/vllm-project/vllm-ascend/pull/1694) [#1840](https://github.com/vllm-project/vllm-ascend/pull/1840) [#2076](https://github.com/vllm-project/vllm-ascend/pull/2076) [#1990](https://github.com/vllm-project/vllm-ascend/pull/1990) [#2019](https://github.com/vllm-project/vllm-ascend/pull/2019)
- Qwen3 MoE support improvement and bugfix around graph mode and DP: [#1940](https://github.com/vllm-project/vllm-ascend/pull/1940) [#2006](https://github.com/vllm-project/vllm-ascend/pull/2006) [#1832](https://github.com/vllm-project/vllm-ascend/pull/1832)
@@ -466,6 +511,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
- Ray: Fix the device error when using ray and add initialize_cache and improve warning info: [#1234](https://github.com/vllm-project/vllm-ascend/pull/1234) [#1501](https://github.com/vllm-project/vllm-ascend/pull/1501)
### Graph Mode Improvement
- Fix DeepSeek with deepseek with mc2 in [#1269](https://github.com/vllm-project/vllm-ascend/pull/1269)
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions in [#1332](https://github.com/vllm-project/vllm-ascend/pull/1332)
- Fix torchair_graph_batch_sizes bug in [#1570](https://github.com/vllm-project/vllm-ascend/pull/1570)
@@ -492,12 +538,14 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
- Turn off aclgraph when enabling TorchAir in [#2154](https://github.com/vllm-project/vllm-ascend/pull/2154)
### Operator Improvement
- Added custom AscendC kernel `vocabparallelembedding` [#796](https://github.com/vllm-project/vllm-ascend/pull/796)
- Fixed rope sin/cos cache bug in [#1267](https://github.com/vllm-project/vllm-ascend/pull/1267)
- Refactored AscendFusedMoE (#1229) in [#1264](https://github.com/vllm-project/vllm-ascend/pull/1264)
- Used fused ops npu_top_k_top_p in sampler [#1920](https://github.com/vllm-project/vllm-ascend/pull/1920)
### Core:
### Core
- Upgraded CANN to 8.2.rc1 in [#2036](https://github.com/vllm-project/vllm-ascend/pull/2036)
- Upgraded torch-npu to 2.5.1.post1 in [#2135](https://github.com/vllm-project/vllm-ascend/pull/2135)
- Upgraded python to 3.11 in [#2136](https://github.com/vllm-project/vllm-ascend/pull/2136)
@@ -540,6 +588,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
- Added D2H & initRoutingQuantV2 to improve prefill perf in [#2038](https://github.com/vllm-project/vllm-ascend/pull/2038).
### Docs
- Provide an e2e guide for execute duration profiling [#1113](https://github.com/vllm-project/vllm-ascend/pull/1113)
- Add Referer header for CANN package download url. [#1192](https://github.com/vllm-project/vllm-ascend/pull/1192)
- Add reinstall instructions doc [#1370](https://github.com/vllm-project/vllm-ascend/pull/1370)
@@ -548,6 +597,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
- Fix errors and non-standard parts in examples/disaggregate_prefill_v1/README.md in [#1965](https://github.com/vllm-project/vllm-ascend/pull/1965)
### Known Issues
- Full graph mode support are not yet available for specific hardware types with full_cuda_graphenable. [#2182](https://github.com/vllm-project/vllm-ascend/issues/2182)
- Qwen3 MoE aclgraph mode with tp failed when enable ep due to bincount error [#2226](https://github.com/vllm-project/vllm-ascend/issues/2226)
- As mentioned in the v0.9.1rc1 release note, Atlas 300I series support will NOT be included.
@@ -557,11 +607,13 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.
### Highlights
- Pooling model works with V1 engine now. You can take a try with Qwen3 embedding model [#1359](https://github.com/vllm-project/vllm-ascend/pull/1359).
- The performance on Atlas 300I series has been improved. [#1591](https://github.com/vllm-project/vllm-ascend/pull/1591)
- aclgraph mode works with Moe models now. Currently, only Qwen3 Moe is well tested. [#1381](https://github.com/vllm-project/vllm-ascend/pull/1381)
### Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.5.1.post1.dev20250619`. Dont forget to update it in your environment. [#1347](https://github.com/vllm-project/vllm-ascend/pull/1347)
- The GatherV3 error has been fixed with aclgraph mode. [#1416](https://github.com/vllm-project/vllm-ascend/pull/1416)
- W8A8 quantization works on Atlas 300I series now. [#1560](https://github.com/vllm-project/vllm-ascend/pull/1560)
@@ -569,6 +621,7 @@ This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [
- The pre-built wheel package now requires lower version of glibc. Users can use it by `pip install vllm-ascend` directly. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582)
### Others
- Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. [#1331](https://github.com/vllm-project/vllm-ascend/pull/1331)
- A new env variable `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is `0`. [#1335](https://github.com/vllm-project/vllm-ascend/pull/1335)
@@ -581,23 +634,24 @@ This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [
### Known Issues
- Pipeline parallel does not work with ray and graph mode: https://github.com/vllm-project/vllm-ascend/issues/1751 https://github.com/vllm-project/vllm-ascend/issues/1754
- Pipeline parallel does not work with ray and graph mode: <https://github.com/vllm-project/vllm-ascend/issues/1751> <https://github.com/vllm-project/vllm-ascend/issues/1754>
### New Contributors
- @xleoken made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1357
- @lyj-jjj made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1335
- @sharonyunyun made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1194
- @Pr0Wh1teGivee made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1308
- @leo-pony made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1374
- @zeshengzong made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1452
- @GDzhu01 made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1477
- @Agonixiaoxiao made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1531
- @zhanghw0354 made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1476
- @farawayboat made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1591
- @ZhengWG made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1196
- @wm901115nwpu made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1654
**Full Changelog**: https://github.com/vllm-project/vllm-ascend/compare/v0.9.1rc1...v0.9.2rc1
- @xleoken made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1357>
- @lyj-jjj made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1335>
- @sharonyunyun made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1194>
- @Pr0Wh1teGivee made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1308>
- @leo-pony made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1374>
- @zeshengzong made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1452>
- @GDzhu01 made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1477>
- @Agonixiaoxiao made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1531>
- @zhanghw0354 made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1476>
- @farawayboat made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1591>
- @ZhengWG made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1196>
- @wm901115nwpu made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1654>
**Full Changelog**: <https://github.com/vllm-project/vllm-ascend/compare/v0.9.1rc1...v0.9.2rc1>
## v0.9.1rc1 - 2025.06.22
@@ -611,12 +665,14 @@ This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the [
After careful consideration, above features **will NOT be included in v0.9.1-dev branch (v0.9.1 final release)** taking into account the v0.9.1 release quality and the feature rapid iteration. We will improve this from 0.9.2rc1 and later.
### Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.5.1.post1.dev20250528`. Dont forget to update it in your environment. [#1235](https://github.com/vllm-project/vllm-ascend/pull/1235)
- Support Atlas 300I series container image. You can get it from [quay.io](https://quay.io/repository/vllm/vllm-ascend)
- Fix token-wise padding mechanism to make multi-card graph mode work. [#1300](https://github.com/vllm-project/vllm-ascend/pull/1300)
- Upgrade vLLM to 0.9.1 [#1165](https://github.com/vllm-project/vllm-ascend/pull/1165)
### Other Improvements
- Initial support Chunked Prefill for MLA. [#1172](https://github.com/vllm-project/vllm-ascend/pull/1172)
- An example of best practices to run DeepSeek with ETP has been added. [#1101](https://github.com/vllm-project/vllm-ascend/pull/1101)
- Performance improvements for DeepSeek using the TorchAir graph. [#1098](https://github.com/vllm-project/vllm-ascend/pull/1098), [#1131](https://github.com/vllm-project/vllm-ascend/pull/1131)
@@ -631,21 +687,24 @@ After careful consideration, above features **will NOT be included in v0.9.1-dev
- Add unit test framework [#1201](https://github.com/vllm-project/vllm-ascend/pull/1201)
### Known Issues
- In some cases, the vLLM process may crash with a **GatherV3** error when **aclgraph** is enabled. We are working on this issue and will fix it in the next release. [#1038](https://github.com/vllm-project/vllm-ascend/issues/1038)
- Prefix cache feature does not work with the Ascend Scheduler but without chunked prefill enabled. This will be fixed in the next release. [#1350](https://github.com/vllm-project/vllm-ascend/issues/1350)
### Full Changelog
https://github.com/vllm-project/vllm-ascend/compare/v0.9.0rc2...v0.9.1rc1
<https://github.com/vllm-project/vllm-ascend/compare/v0.9.0rc2...v0.9.1rc1>
### New Contributors
- @farawayboat made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1333
- @yzim made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1159
- @chenwaner made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1098
- @wangyanhui-cmss made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1184
- @songshanhu07 made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1186
- @yuancaoyaoHW made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1032
**Full Changelog**: https://github.com/vllm-project/vllm-ascend/compare/v0.9.0rc2...v0.9.1rc1
- @farawayboat made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1333>
- @yzim made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1159>
- @chenwaner made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1098>
- @wangyanhui-cmss made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1184>
- @songshanhu07 made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1186>
- @yuancaoyaoHW made their first contribution in <https://github.com/vllm-project/vllm-ascend/pull/1032>
**Full Changelog**: <https://github.com/vllm-project/vllm-ascend/compare/v0.9.0rc2...v0.9.1rc1>
## v0.9.0rc2 - 2025.06.10
@@ -722,19 +781,23 @@ This is the first post release of 0.7.3. Please follow the [official doc](https:
We are excited to announce the release of 0.7.3 for vllm-ascend. This is the first official release. The functionality, performance, and stability of this release are fully tested and verified. We encourage you to try it out and provide feedback. We'll post bug fix versions in the future if needed. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.7.3) to start the journey.
### Highlights
- This release includes all features landed in the previous release candidates ([v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2)). And all the features are fully tested and verified. Visit the official doc the get the detail [feature](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/suppoted_features.html) and [model](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/supported_models.html) support matrix.
- Upgrade CANN to 8.1.RC1 to enable chunked prefill and automatic prefix caching features. You can now enable them now.
- Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automatically. [#662](https://github.com/vllm-project/vllm-ascend/pull/662)
- Integrate MindIE Turbo into vLLM Ascend to improve DeepSeek V3/R1, Qwen 2 series performance. [#708](https://github.com/vllm-project/vllm-ascend/pull/708)
### Core
- LoRAMulti-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. [#700](https://github.com/vllm-project/vllm-ascend/pull/700)
### Models
- The performance of Qwen2 vl and Qwen2.5 vl is improved. [#702](https://github.com/vllm-project/vllm-ascend/pull/702)
- The performance of `apply_penalties` and `topKtopP` ops are improved. [#525](https://github.com/vllm-project/vllm-ascend/pull/525)
### Others
- Fixed a issue that may lead CPU memory leak. [#691](https://github.com/vllm-project/vllm-ascend/pull/691) [#712](https://github.com/vllm-project/vllm-ascend/pull/712)
- A new environment `SOC_VERSION` is added. If you hit any soc detection error when building with custom ops enabled, please set `SOC_VERSION` to a suitable value. [#606](https://github.com/vllm-project/vllm-ascend/pull/606)
- openEuler container image supported with v0.7.3-openeuler tag. [#665](https://github.com/vllm-project/vllm-ascend/pull/665)
@@ -745,11 +808,13 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir
This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to start the journey. Now you can enable V1 egnine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [here](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html).
### Highlights
- Upgrade CANN version to 8.1.RC1 to support chunked prefill and automatic prefix caching (`--enable_prefix_caching`) when V1 is enabled [#747](https://github.com/vllm-project/vllm-ascend/pull/747)
- Optimize Qwen2 VL and Qwen 2.5 VL [#701](https://github.com/vllm-project/vllm-ascend/pull/701)
- Improve Deepseek V3 eager mode and graph mode performance, now you can use --additional_config={'enable_graph_mode': True} to enable graph mode. [#598](https://github.com/vllm-project/vllm-ascend/pull/598) [#719](https://github.com/vllm-project/vllm-ascend/pull/719)
### Core
- Upgrade vLLM to 0.8.5.post1 [#715](https://github.com/vllm-project/vllm-ascend/pull/715)
- Fix early return in CustomDeepseekV2MoE.forward during profile_run [#682](https://github.com/vllm-project/vllm-ascend/pull/682)
- Adapts for new quant model generated by modelslim [#719](https://github.com/vllm-project/vllm-ascend/pull/719)
@@ -759,6 +824,7 @@ This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [
- Fix `PYTHON_INCLUDE_PATH` typo in setup.py [#762](https://github.com/vllm-project/vllm-ascend/pull/762)
### Others
- Add Qwen3-0.6B test [#717](https://github.com/vllm-project/vllm-ascend/pull/717)
- Add nightly CI [#668](https://github.com/vllm-project/vllm-ascend/pull/668)
- Add accuracy test report [#542](https://github.com/vllm-project/vllm-ascend/pull/542)
@@ -768,15 +834,18 @@ This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [
This is the second release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to start the journey. Some experimental features are included in this version, such as W8A8 quantization and EP/DP support. We'll make them stable enough in the next release.
### Highlights
- Qwen3 and Qwen3MOE is supported now. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/single_npu.html) to run the quick demo. [#709](https://github.com/vllm-project/vllm-ascend/pull/709)
- Ascend W8A8 quantization method is supported now. Please take the [official doc](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_npu_quantization.html) for example. Any [feedback](https://github.com/vllm-project/vllm-ascend/issues/619) is welcome. [#580](https://github.com/vllm-project/vllm-ascend/pull/580)
- DeepSeek V3/R1 works with DP, TP and MTP now. Please note that it's still in experimental status. Let us know if you hit any problem. [#429](https://github.com/vllm-project/vllm-ascend/pull/429) [#585](https://github.com/vllm-project/vllm-ascend/pull/585) [#626](https://github.com/vllm-project/vllm-ascend/pull/626) [#636](https://github.com/vllm-project/vllm-ascend/pull/636) [#671](https://github.com/vllm-project/vllm-ascend/pull/671)
### Core
- ACLGraph feature is supported with V1 engine now. It's disabled by default because this feature rely on CANN 8.1 release. We'll make it available by default in the next release [#426](https://github.com/vllm-project/vllm-ascend/pull/426)
- Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automatically. [#661](https://github.com/vllm-project/vllm-ascend/pull/661)
### Others
- MiniCPM model works now. [#645](https://github.com/vllm-project/vllm-ascend/pull/645)
- openEuler container image supported with `v0.8.4-openeuler` tag and customs Ops build is enabled by default for openEuler OS. [#689](https://github.com/vllm-project/vllm-ascend/pull/689)
- Fix ModuleNotFoundError bug to make Lora work [#600](https://github.com/vllm-project/vllm-ascend/pull/600)
@@ -809,18 +878,22 @@ This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the
## v0.7.3rc2 - 2025.03.29
This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.7.3) to start the journey.
- Quickstart with container: https://docs.vllm.ai/projects/ascend/en/v0.7.3/quick_start.html
- Installation: https://docs.vllm.ai/projects/ascend/en/v0.7.3/installation.html
- Quickstart with container: <https://docs.vllm.ai/projects/ascend/en/v0.7.3/quick_start.html>
- Installation: <https://docs.vllm.ai/projects/ascend/en/v0.7.3/installation.html>
### Highlights
- Add Ascend Custom Ops framework. Developers now can write customs ops using AscendC. An example ops `rotary_embedding` is added. More tutorials will come soon. The Custom Ops compilation is disabled by default when installing vllm-ascend. Set `COMPILE_CUSTOM_KERNELS=1` to enable it. [#371](https://github.com/vllm-project/vllm-ascend/pull/371)
- V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us [here](https://github.com/vllm-project/vllm-ascend/issues/414). [#376](https://github.com/vllm-project/vllm-ascend/pull/376)
- Prefix cache feature works now. You can set `enable_prefix_caching=True` to enable it. [#282](https://github.com/vllm-project/vllm-ascend/pull/282)
### Core
- Bump torch_npu version to dev20250320.3 to improve accuracy to fix `!!!` output problem. [#406](https://github.com/vllm-project/vllm-ascend/pull/406)
### Models
- The performance of Qwen2-vl is improved by optimizing patch embedding (Conv3D). [#398](https://github.com/vllm-project/vllm-ascend/pull/398)
### Others
@@ -831,28 +904,34 @@ This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the [offi
## v0.7.3rc1 - 2025.03.14
🎉 Hello, World! This is the first release candidate of v0.7.3 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.7.3) to start the journey.
- Quickstart with container: https://docs.vllm.ai/projects/ascend/en/v0.7.3/quick_start.html
- Installation: https://docs.vllm.ai/projects/ascend/en/v0.7.3/installation.html
- Quickstart with container: <https://docs.vllm.ai/projects/ascend/en/v0.7.3/quick_start.html>
- Installation: <https://docs.vllm.ai/projects/ascend/en/v0.7.3/installation.html>
### Highlights
- DeepSeek V3/R1 works well now. Read the [official guide](https://docs.vllm.ai/projects/ascend/en/v0.7.3/tutorials/multi_node.html) to start! [#242](https://github.com/vllm-project/vllm-ascend/pull/242)
- Speculative decoding feature is supported. [#252](https://github.com/vllm-project/vllm-ascend/pull/252)
- Multi step scheduler feature is supported. [#300](https://github.com/vllm-project/vllm-ascend/pull/300)
### Core
- Bump torch_npu version to dev20250308.3 to improve `_exponential` accuracy
- Added initial support for pooling models. Bert based model, such as `BAAI/bge-base-en-v1.5` and `BAAI/bge-reranker-v2-m3` works now. [#229](https://github.com/vllm-project/vllm-ascend/pull/229)
### Models
- The performance of Qwen2-VL is improved. [#241](https://github.com/vllm-project/vllm-ascend/pull/241)
- MiniCPM is now supported [#164](https://github.com/vllm-project/vllm-ascend/pull/164)
### Others
- Support MTP(Multi-Token Prediction) for DeepSeek V3/R1 [#236](https://github.com/vllm-project/vllm-ascend/pull/236)
- [Docs] Added more model tutorials, include DeepSeek, QwQ, Qwen and Qwen 2.5VL. See the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.7.3/tutorials/index.html) for detail
- Pin modelscope<1.23.0 on vLLM v0.7.3 to resolve: https://github.com/vllm-project/vllm/pull/13807
- Pin modelscope<1.23.0 on vLLM v0.7.3 to resolve: <https://github.com/vllm-project/vllm/pull/13807>
### Known Issues
- In [some cases](https://github.com/vllm-project/vllm-ascend/issues/324), especially when the input/output is very long, the accuracy of output may be incorrect. We are working on it. It'll be fixed in the next release.
- Improved and reduced the garbled code in model output. But if you still hit the issue, try to change the generation config value, such as `temperature`, and try again. There is also a known issue shown below. Any [feedback](https://github.com/vllm-project/vllm-ascend/issues/267) is welcome. [#277](https://github.com/vllm-project/vllm-ascend/pull/277)

View File

@@ -2,7 +2,7 @@
The feature support principle of vLLM Ascend is: **aligned with the vLLM**. We are also actively collaborating with the community to accelerate support.
Functional call: https://docs.vllm.ai/en/latest/features/tool_calling/
Functional call: <https://docs.vllm.ai/en/latest/features/tool_calling/>
You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is the feature support status of vLLM Ascend:

View File

@@ -1,6 +1,6 @@
# Supported Models
Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/1608
Get the latest info here: <https://github.com/vllm-project/vllm-ascend/issues/1608>
## Text-Only Language Models

View File

@@ -2,29 +2,29 @@
## Environmental Dependencies
* Software:
* Python >= 3.10, < 3.12
* CANN == 8.3.rc2
* PyTorch == 2.8.0, torch-npu == 2.8.0
* vLLM (same version as vllm-ascend)
* mooncake-transfer-engine reference documentation: https://github.com/kvcache-ai/Mooncake/blob/main/doc/zh/ascend_transport.md
* Software:
* Python >= 3.10, < 3.12
* CANN == 8.3.rc2
* PyTorch == 2.8.0, torch-npu == 2.8.0
* vLLM (same version as vllm-ascend)
* mooncake-transfer-engine reference documentation: <https://github.com/kvcache-ai/Mooncake/blob/main/doc/zh/ascend_transport.md>
The vllm version must be the same as the main branch of vllm-ascend, for example, 2025/07/30. The version is
* vllm: v0.10.1
* vllm-ascend: v0.10.1rc1
* vllm: v0.10.1
* vllm-ascend: v0.10.1rc1
## run
### 1.Run `prefill` Node
```
```shell
bash run_prefill.sh
```
Content of the run_prefill.sh script
```
```shell
export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export HCCL_IF_IP=localhost
@@ -85,13 +85,13 @@ Set `GLOO_SOCKET_IFNAME`, `TP_SOCKET_IFNAME`, and `HCCL_SOCKET_IFNAME` to the co
### 2. Run `decode` Node
```
```shell
bash run_decode.sh
```
Content of the run_decode.sh script
```
```shell
export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export HCCL_IF_IP=localhost
@@ -135,9 +135,9 @@ vllm serve "/xxxxx/DeepSeek-V2-Lite-Chat" \
}'
```
### 3. Start proxy_server. ###
### 3. Start proxy_server
```
```shell
cd /vllm-ascend/examples/disaggregate_prefill_v1/
python load_balance_proxy_server_example.py --host localhost --prefiller-hosts host1 host2 --prefiller-ports 8100 8101 --decoder-hosts host3 host4 --decoder-ports 8200 8201
```
@@ -152,7 +152,7 @@ python load_balance_proxy_server_example.py --host localhost --prefiller-hosts h
Set the IP address in the inference file to the actual IP address. Set the model variable to the path of the model. Ensure that the path is the same as that in the shell script.
```
```shell
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "model_path",
"prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?",

View File

@@ -1,12 +1,14 @@
Here is an example guiding how to use `launch_online_dp.py` to launch external dp vllm servers. User can easily launch external dp servers following the steps below:
### Modify parameters in `run_dp_template.sh`
`run_dp_template.sh` is an template script used to launch each dp vllm instance separately. It will be called by `launch_online_dp.py` in multi threads and most of its configurations are set by `launch_online_dp.py`. Parameters you need to set manually include:
1. The IP and socket_ifname of your machine. If running on multi-nodes, please make sure the scripts on each node has been set with correct IP and socket_ifname of that node.
2. vLLM serving related parameters including model_path and other configurations. Note that port, dp-related parammeters and tp_size is set by `launch_online_dp.py`, all the other vLLM parameters in this file only serve as an example and you are free to modify them according to your purpose.
### Run `launch_online_dp.py` with CL arguments
All the arguments that can be set by users are:
1. `--dp-size`: global data parallel size, must be set
@@ -39,6 +41,7 @@ python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 2 --dp-rank-s
```
### (Optional) Run `dp_load_balance_proxy_server.py` to load balance requests between external dp servers
External dp server means that you need to handle load balance between multiple dp instances out of vllm by implementing your custom proxy server. Here we provide an example of request-length-aware dp load-balance proxy server for you. The arguments of `dp_load_balance_proxy_server.py` include:
1. `--port`: port of proxy server, default 8000