[v0.11.0][Doc] Update doc (#3852)

### What this PR does / why we need it?
Update doc


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
This commit is contained in:
zhangxinyuehfad
2025-10-29 11:32:12 +08:00
committed by GitHub
parent 6188450269
commit 75de3fa172
49 changed files with 724 additions and 701 deletions

View File

@@ -17,7 +17,7 @@
## Contributors
vLLM Ascend every release would not have been possible without the following contributors:
Every release of vLLM Ascend would not have been possible without the following contributors:
Updated on 2025-09-30:

View File

@@ -1,48 +1,48 @@
# Governance
## Mission
As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM.
As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for everyone on Ascend NPUs and to actively contributing to the enrichment of vLLM.
## Principles
vLLM Ascend follows the vLLM community's code of conduct[vLLM - CODE OF CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md)
vLLM Ascend follows the vLLM community's code of conduct: [vLLM - CODE OF CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md)
## Governance - Mechanics
vLLM Ascend is an open-source project under the vLLM community, where the authority to appoint roles is ultimately determined by the vLLM community. It adopts a hierarchical technical governance structure.
- Contributor:
**Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs, code
**Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs and code.
**Requirements:** Complete at least 1 contribution. Contributor is someone who consistently and actively participates in a project, included but not limited to issue/review/commits/community involvement.
**Requirements:** Complete at least 1 contribution. A contributor is someone who consistently and actively participates in a project, including but not limited to issue/review/commits/community involvement.
Contributors will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo `Triage` permissions (`Can read and clone this repository. Can also manage issues and pull requests`) to help community developers collaborate more efficiently.
The contributor permissions are granted by the [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)'s repo `Triage` on GitHub, including repo read and clone, issue and PR management, facilitating efficient collaboration between community developers.
- Maintainer:
**Responsibility:** Develop the project's vision and mission. Maintainers are responsible for driving the technical direction of the entire project and ensuring its overall success, possessing code merge permissions. They formulate the roadmap, review contributions from community members, continuously contribute code, and actively engage in community activities (such as regular meetings/events).
**Responsibility:** Develop the project's vision and mission. Maintainers are responsible for shaping the technical direction of the project and ensuring its long-term success. With code merge permissions, they lead roadmap planning, review community contributions, make ongoing code improvements, and actively participate in community engagement—such as regular meetings and events.
**Requirements:** Deep understanding of vLLM and vLLM Ascend codebases, with a commitment to sustained code contributions. Competency in design/development/PR review workflows.
- **Review Quality:** Actively participate in community code reviews, ensuring high-quality code integration.
- **Quality Contribution:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
- **Community Involvement:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
**Requirements:** Deep understanding of vLLM and vLLM Ascend code bases, with a commitment to sustained code contributions and competency in design, development, and PR review workflows.
Requires approval from existing Maintainers. The vLLM community has the final decision-making authority.
- **Review quality:** Actively participate in community code reviews, ensuring high-quality code integration.
- **Quality contribution:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
- **Community involvement:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
Maintainer will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo write permissions (`Can read, clone, and push to this repository. Can also manage issues and pull requests`).
The approval from existing Maintainers is required. The vLLM community has the final decision-making authority.
Maintainers will be granted write access to the [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) GitHub repo. This includes permission to read, clone, and push to the repository, as well as manage issues and pull requests.
## Nominating and Removing Maintainers
### The Principles
- Membership in vLLM Ascend is given to individuals on merit basis after they demonstrated strong expertise of the vLLM / vLLM Ascend through contributions, reviews and discussions.
- Membership in vLLM Ascend is given to individuals on merit basis after they demonstrate their strong expertise in vLLM/vLLM Ascend through contributions, reviews, and discussions.
- For membership in the maintainer group the individual has to demonstrate strong and continued alignment with the overall vLLM / vLLM Ascend principles.
- For membership in the maintainer group, individuals have to demonstrate strong and continued alignment with the overall vLLM/vLLM Ascend principles.
- Light criteria of moving module maintenance to emeritus status if they dont actively participate over long periods of time.
- Maintainers who have been inactive for a long time may be transitioned to **emeritus** status under lenient criteria.
- The membership is for an individual, not a company.
### Nomination and Removal
- Nomination: Anyone can nominate someone to become a maintainer (include self-nominate). All existing maintainers are responsible for evaluating the nomination. The nominator should provide nominee's info around the strength of the candidate to be a maintainer, include but not limited to review quality, quality contribution, community involvement.
- Removal: Anyone can nominate a person to be removed from maintainer position (include self-nominate). All existing maintainers are responsible for evaluating the nomination. The nominator should provide nominee's info, include but not limited to lack of activity, conflict with the overall direction and other information that makes them unfit to be a maintainer.
- Nomination: Anyone can nominate a candidate to become a maintainer, including self-nominations. All existing maintainers are responsible for reviewing and evaluating each nomination. The nominator should provide relevant information about the nominee's qualifications—such as review quality, quality contribution, and community involvement—among other strengths.
- Removal: Anyone may nominate an individual for removal from the maintainer role, including self-nominations. All current maintainers are responsible for reviewing and evaluating such nominations. The nominator should provide relevant information about the nominee—such as prolonged inactivity, misalignment with the project's overall direction, or other factors that may render them unsuitable for the maintainer position.

View File

@@ -1,16 +1,16 @@
# User Stories
# User stories
Read case studies on how users and developers solves real, everyday problems with vLLM Ascend
Read case studies on how users and developers solve real, everyday problems with vLLM Ascend
- [LLaMA-Factory](./llamafactory.md) is an easy-to-use and efficient platform for training and fine-tuning large language models, it supports vLLM Ascend to speed up inference since [LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739), gain 2x performance enhancement of inference.
- [LLaMA-Factory](./llamafactory.md) is an easy-to-use and efficient platform for training and fine-tuning large language models. It supports vLLM Ascend to speed up inference since [LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739), gaining 2x performance enhancement in inference.
- [Huggingface/trl](https://github.com/huggingface/trl) is a cutting-edge library designed for post-training foundation models using advanced techniques like SFT, PPO and DPO, it uses vLLM Ascend since [v0.17.0](https://github.com/huggingface/trl/releases/tag/v0.17.0) to support RLHF on Ascend NPU.
- [Huggingface/trl](https://github.com/huggingface/trl) is a cutting-edge library designed for post-training foundation models using advanced techniques like SFT, PPO and DPO. It uses vLLM Ascend since [v0.17.0](https://github.com/huggingface/trl/releases/tag/v0.17.0) to support RLHF on Ascend NPUs.
- [MindIE Turbo](https://pypi.org/project/mindie-turbo) is an LLM inference engine acceleration plug-in library developed by Huawei on Ascend hardware, which includes self-developed large language model optimization algorithms and optimizations related to the inference engine framework. It supports vLLM Ascend since [2.0rc1](https://www.hiascend.com/document/detail/zh/mindie/20RC1/AcceleratePlugin/turbodev/mindie-turbo-0001.html).
- [MindIE Turbo](https://pypi.org/project/mindie-turbo) is an LLM inference engine acceleration plugin library developed by Huawei on Ascend hardware, which includes self-developed LLM optimization algorithms and optimizations related to the inference engine framework. It supports vLLM Ascend since [2.0rc1](https://www.hiascend.com/document/detail/zh/mindie/20RC1/AcceleratePlugin/turbodev/mindie-turbo-0001.html).
- [GPUStack](https://github.com/gpustack/gpustack) is an open-source GPU cluster manager for running AI models. It supports vLLM Ascend since [v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2), see more GPUStack performance evaluation info on [link](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew).
- [GPUStack](https://github.com/gpustack/gpustack) is an open-source GPU cluster manager for running AI models. It supports vLLM Ascend since [v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2). See more GPUStack performance evaluation information at [this link](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew).
- [verl](https://github.com/volcengine/verl) is a flexible, efficient and production-ready RL training library for large language models (LLMs), uses vLLM Ascend since [v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0), see more info on [verl x Ascend Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html).
- [verl](https://github.com/volcengine/verl) is a flexible, efficient, and production-ready RL training library for LLMs. It uses vLLM Ascend since [v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0). See more information on [verl x Ascend Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html).
:::{toctree}
:caption: More details

View File

@@ -1,19 +1,19 @@
# LLaMA-Factory
**About / Introduction**
**Introduction**
[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.
LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model.
LLaMA-Facotory users need to evaluate and inference the model after fine-tuning.
**The Business Challenge**
**Business challenge**
LLaMA-Factory used transformers to perform inference on Ascend NPU, but the speed was slow.
LLaMA-Factory uses Transformers to perform inference on Ascend NPUs, but the speed is slow.
**Solving Challenges and Benefits with vLLM Ascend**
**Benefits with vLLM Ascend**
With the joint efforts of LLaMA-Factory and vLLM Ascend ([LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739)), the performance of LLaMA-Factory in the model inference stage has been significantly improved. According to the test results, the inference speed of LLaMA-Factory has been increased to 2x compared to the transformers version.
With the joint efforts of LLaMA-Factory and vLLM Ascend ([LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739)), LLaMA-Factory has achieved significant performance gains during model inference. Benchmark results show that its inference speed is now up to 2× faster compared to the Transformers implementation.
**Learn more**
See more about LLaMA-Factory and how it uses vLLM Ascend for inference on the Ascend NPU in the following documentation: [LLaMA-Factory Ascend NPU Inference](https://llamafactory.readthedocs.io/en/latest/advanced/npu_inference.html).
See more details about LLaMA-Factory and how it uses vLLM Ascend for inference on Ascend NPUs in [LLaMA-Factory Ascend NPU Inference](https://llamafactory.readthedocs.io/en/latest/advanced/npu_inference.html).

View File

@@ -4,21 +4,21 @@ Starting with vLLM 0.7.x, the vLLM Ascend Plugin ([vllm-project/vllm-ascend](htt
## vLLM Ascend Plugin versions
Each vLLM Ascend release will be versioned: `v[major].[minor].[micro][rcN][.postN]` (such as
Each vLLM Ascend release is versioned as `v[major].[minor].[micro][rcN][.postN]` (such as
`v0.7.3rc1`, `v0.7.3`, `v0.7.3.post1`)
- **Final releases**: will typically be released every **3 months**, will take the vLLM upstream release plan and Ascend software product release plan into comprehensive consideration.
- **Pre releases**: will typically be released **on demand**, ending with rcN, represents the Nth release candidate version, to support early testing by our users prior to a final release.
- **Post releases**: will typically be released **on demand** to support to address minor errors in a final release. It's different from [PEP-440 post release note](https://peps.python.org/pep-0440/#post-releases) suggestion, it will contain actual bug fixes considering that the final release version should be matched strictly with the vLLM final release version (`v[major].[minor].[micro]`). The post version has to be published as a patch version of the final release.
- **Final releases**: Typically scheduled every three months, with careful alignment to the vLLM upstream release cycle and the Ascend software product roadmap.
- **Pre releases**: Typically issued **on demand**, labeled with rcN to indicate the Nth release candidate. They are intended to support early testing by users ahead of the final release.
- **Post releases**: Typically issued **on demand** to address minor errors in a final release. Different from [PEP-440 post release note](https://peps.python.org/pep-0440/#post-releases) convention, these versions include actual bug fixes, as the final release version must strictly align with the vLLM final release format (`v[major].[minor].[micro]`). Any post version must be published as a patch version of the final release.
For example:
- `v0.7.x`: it's the first final release to match the vLLM `v0.7.x` version.
- `v0.7.3rc1`: will be the first pre version of vLLM Ascend.
- `v0.7.3.post1`: will be the post release if the `v0.7.3` release has some minor errors.
- `v0.7.x`: first final release to match the vLLM `v0.7.x` version.
- `v0.7.3rc1`: first pre version of vLLM Ascend.
- `v0.7.3.post1`: post release for the `v0.7.3` release if it has some minor errors.
## Release Compatibility Matrix
## Release compatibility matrix
Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
The table below is the release compatibility matrix for vLLM Ascend Plugin.
| vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu | MindIE Turbo |
|-------------|--------------|------------------|-------------|--------------------|--------------|
@@ -40,7 +40,7 @@ Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
## Release cadence
### release window
### Release window
| Date | Event |
|------------|-------------------------------------------|
@@ -66,70 +66,70 @@ Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
## Branch policy
vLLM Ascend has main branch and dev branch.
vLLM Ascend includes two branches: main and dev.
- **main**: main branchcorresponds to the vLLM main branch and latest 1 or 2 release version. It is continuously monitored for quality through Ascend CI.
- **main**: corresponds to the vLLM main branch and latest 1 or 2 release version. It is continuously monitored for quality through Ascend CI.
- **vX.Y.Z-dev**: development branch, created with part of new releases of vLLM. For example, `v0.7.3-dev` is the dev branch for vLLM `v0.7.3` version.
Usually, a commit should be ONLY first merged in the main branch, and then backported to the dev branch to reduce maintenance costs as much as possible.
Commits should typically be merged into the main branch first, and only then backported to the dev branch, to reduce maintenance costs as much as possible.
### Maintenance branch and EOL:
The branch status will be in one of the following states:
### Maintenance branch and EOL
The table below lists branch states.
| Branch | Time frame | Summary |
|-------------------|----------------------------------|----------------------------------------------------------------------|
| Maintained | Approximately 2-3 minor versions | All bugfixes are appropriate. Releases produced, CI commitment. |
| Unmaintained | Community interest driven | All bugfixes are appropriate. No Releases produced, No CI commitment |
| End of Life (EOL) | N/A | Branch no longer accepting changes |
| Branch | Time Frame | Summary |
| ----------------- | -------------------------------- | --------------------------------------------------------- |
| Maintained | Approximately 2-3 minor versions | Bugfixes received; releases produced; CI commitment |
| Unmaintained | Community-interest driven | Bugfixes received; no releases produced; no CI commitment |
| End of Life (EOL) | N/A | Branch no longer accepting changes |
### Branch state
### Branch states
Note that vLLM Ascend will only be released for a certain vLLM release version rather than all versions. Hence, You might see only part of versions have dev branches (such as only `0.7.1-dev` / `0.7.3-dev` but no `0.7.2-dev`), this is as expected.
Note that vLLM Ascend will only be released for a certain vLLM release version, not for every version. Hence, you may notice that some versions have corresponding dev branches (e.g. `0.7.1-dev` and `0.7.3-dev` ), while others do not (e.g. `0.7.2-dev`).
Usually, each minor version of vLLM (such as 0.7) will correspond to a vLLM Ascend version branch and support its latest version (for example, we plan to support version 0.7.3) as following shown:
Usually, each minor version of vLLM (such as 0.7) corresponds to a vLLM Ascend version branch and supports its latest version (such as 0.7.3), as shown below:
| Branch | Status | Note |
|------------|--------------|--------------------------------------|
| main | Maintained | CI commitment for vLLM main branch and vLLM 0.9.2 branch |
| v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.1 version |
| v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version |
| v0.7.1-dev | Unmaintained | Replaced by v0.7.3-dev |
| Branch | State | Note |
| ---------- | ------------ | -------------------------------------------------------- |
| main | Maintained | CI commitment for vLLM main branch and vLLM 0.9.2 branch |
| v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.1 version |
| v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version |
| v0.7.1-dev | Unmaintained | Replaced by v0.7.3-dev |
### Feature branches
| Branch | Status | RFC link | Merge plan | Mentor |
| Branch | State | RFC Link | Scheduled Merge Time | Mentor |
|------------|--------------|---------------------------------------|------------|--------|
|rfc/long_seq_optimization|Maintained|https://github.com/vllm-project/vllm/issues/22693|930|wangxiyuan|
- Branch: The feature branch should be created with a prefix `rfc/` followed by the feature name, such as `rfc/feature-name`.
- Status: The status of the feature branch is `Maintained` until it is merged into the main branch or deleted.
- RFC link: The feature branch should be created with a corresponding RFC issue. The creation of a feature branch requires an RFC and approval from at least two maintainers.
- Merge plan: The final goal of a feature branch is to merge it into the main branch. If it exceeds 3 months, the mentor maintainer should evaluate whether to delete the branch.
- State: The state of the feature branch is `Maintained` until it is merged into the main branch or deleted.
- RFC Link: The feature branch should be created with a corresponding RFC issue. The creation of a feature branch requires an RFC and approval from at least two maintainers.
- Scheduled Merge Time: The final goal of a feature branch is to be merged into the main branch. If it remains unmerged for more than three months, the mentor maintainer should evaluate whether to delete the branch.
- Mentor: The mentor should be a vLLM Ascend maintainer who is responsible for the feature branch.
### Backward compatibility
For main branch, vLLM Ascend should works with vLLM main branch and latest 1 or 2 release version. So to ensure the backward compatibility, we will do the following:
- Both main branch and target vLLM release is tested by Ascend E2E CI. For example, currently, vLLM main branch and vLLM 0.8.4 are tested now.
- For code changes, we will make sure that the changes are compatible with the latest 1 or 2 vLLM release version as well. In this case, vLLM Ascend introduced a version check machinism inner the code. It'll check the version of installed vLLM package first to decide which code logic to use. If users hit the `InvalidVersion` error, it sometimes means that they have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use.
- For documentation changes, we will make sure that the changes are compatible with the latest 1 or 2 vLLM release version as well. Note should be added if there are any breaking changes.
For main branch, vLLM Ascend should works with vLLM main branch and latest 1 or 2 releases. To ensure backward compatibility, do as follows:
- Both main branch and target vLLM release, such as the vLLM main branch and vLLM 0.8.4, are tested by Ascend E2E CI.
- To make sure that code changes are compatible with the latest 1 or 2 vLLM releases, vLLM Ascend introduces a version check mechanism inside the code. It checks the version of the installed vLLM package first to decide which code logic to use. If users hit the `InvalidVersion` error, it may indicate that they have installed a dev or editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use.
- Document changes should be compatible with the latest 1 or 2 vLLM releases. Notes should be added if there are any breaking changes.
## Document Branch Policy
## Document branch policy
To reduce maintenance costs, **all branch documentation content should remain consistent, and version differences can be controlled via variables in [docs/source/conf.py](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/conf.py)**. While this is not a simple task, it is a principle we should strive to follow.
| Version | Purpose | Code Branch |
|-----|-----|---------|
| latest | Doc for the latest dev branch | vX.Y.Z-dev (Will be `main` after the first final release) |
| version | Doc for historical released versions | Git tags, like vX.Y.Z[rcN] |
| stablenot yet released | Doc for latest final release branch | Will be `vX.Y.Z-dev` after the first official release |
| stable (not yet released) | Doc for latest final release branch | Will be `vX.Y.Z-dev` after the first official release |
As shown above:
Notes:
- `latest` documentation: Matches the current maintenance branch `vX.Y.Z-dev` (Will be `main` after the first final release). Continuously updated to ensure usability for the latest release.
- `version` documentation: Corresponds to specific released versions (e.g., `v0.7.3`, `v0.7.3rc1`). No further updates after release.
- `latest` documentation: Matches the current maintenance branch `vX.Y.Z-dev` (will be `main` after the first final release). It is continuously updated to ensure usability for the latest release.
- `version` documentation: Corresponds to specific released versions (e.g., `v0.7.3`, `v0.7.3rc1`). There are no further updates after release.
- `stable` documentation (**not yet released**): Official release documentation. Updates are allowed in real-time after release, typically based on vX.Y.Z-dev. Once stable documentation is available, non-stable versions should display a header warning: `You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.`.
## Software Dependency Management
## Software dependency management
- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPi](https://pypi.org/project/torch-npu)
every 3 months, a development version (aka the POC version) every month, and a nightly version every day.
The PyPi stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CANN** be used in
vLLM Ascend RC version for rapid iteration, the nightly version **CANNOT** be used in vLLM Ascend any version and branches.
The PyPi stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in
vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in vLLM Ascend any version and branch.

View File

@@ -1,16 +1,16 @@
# Contributing
## Building and testing
It's recommended to set up a local development environment to build and test
## Building and Testing
It's recommended to set up a local development environment to build vllm-ascend and run tests
before you submit a PR.
### Setup development environment
### Set up a development environment
Theoretically, the vllm-ascend build is only supported on Linux because
`vllm-ascend` dependency `torch_npu` only supports Linux.
But you can still set up dev env on Linux/Windows/macOS for linting and basic
test as following commands:
But you can still set up a development environment on Linux/Windows/macOS for linting and running basic
tests.
#### Run lint locally
@@ -27,13 +27,13 @@ cd vllm-ascend
# Install lint requirement and enable pre-commit hook
pip install -r requirements-lint.txt
# Run lint (You need install pre-commits deps via proxy network at first time)
# Run lint (You need to install pre-commits deps via proxy network at first time)
bash format.sh
```
#### Run CI locally
After complete "Run lint" setup, you can run CI locally:
After completing "Run lint" setup, you can run CI locally:
```{code-block} bash
:substitutions:
@@ -68,9 +68,9 @@ git commit -sm "your commit info"
🎉 Congratulations! You have completed the development environment setup.
### Test locally
### Testing locally
You can refer to [Testing](./testing.md) doc to help you setup testing environment and running tests locally.
You can refer to [Testing](./testing.md) to set up a testing environment and running tests locally.
## DCO and Signed-off-by
@@ -88,7 +88,7 @@ Only specific types of PRs will be reviewed. The PR title is prefixed appropriat
- `[Platform]` for new features or optimization in platform.
- `[Worker]` for new features or optimization in worker.
- `[Core]` for new features or optimization in the core vllm-ascend logic (such as platform, attention, communicators, model runner)
- `[Kernel]` changes affecting compute kernels and ops.
- `[Kernel]` for changes affecting compute kernels and ops.
- `[Bugfix]` for bug fixes.
- `[Doc]` for documentation fixes and improvements.
- `[Test]` for tests (such as unit tests).

View File

@@ -1,10 +1,10 @@
# Testing
This secition explains how to write e2e tests and unit tests to verify the implementation of your feature.
This document explains how to write E2E tests and unit tests to verify the implementation of your feature.
## Setup test environment
## Setup a test environment
The fastest way to setup test environment is to use the main branch container image:
The fastest way to setup a test environment is to use the main branch's container image:
:::::{tab-set}
:sync-group: e2e
@@ -13,7 +13,7 @@ The fastest way to setup test environment is to use the main branch container im
:selected:
:sync: cpu
You can run the unit tests on CPU with the following steps:
You can run the unit tests on CPUs with the following steps:
```{code-block} bash
:substitutions:
@@ -22,7 +22,7 @@ cd ~/vllm-project/
# ls
# vllm vllm-ascend
# Use mirror to speedup download
# Use mirror to speed up download
# docker pull quay.nju.edu.cn/ascend/cann:|cann_image_tag|
export IMAGE=quay.io/ascend/cann:|cann_image_tag|
docker run --rm --name vllm-ascend-ut \
@@ -30,7 +30,7 @@ docker run --rm --name vllm-ascend-ut \
-v ~/.cache:/root/.cache \
-ti $IMAGE bash
# (Optional) Configure mirror to speedup download
# (Optional) Configure mirror to speed up download
sed -i 's|ports.ubuntu.com|mirrors.huaweicloud.com|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.huaweicloud.com/repository/pypi/simple/
@@ -136,13 +136,13 @@ pip install -r requirements-dev.txt
## Running tests
### Unit test
### Unit tests
There are several principles to follow when writing unit tests:
- The test file path should be consistent with source file and start with `test_` prefix, such as: `vllm_ascend/worker/worker_v1.py` --> `tests/ut/worker/test_worker_v1.py`
- The vLLM Ascend test are using unittest framework, see [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
- All unit tests can be run on CPU, so you must mock the device-related function to host.
- The test file path should be consistent with the source file and start with the `test_` prefix, such as: `vllm_ascend/worker/worker_v1.py` --> `tests/ut/worker/test_worker_v1.py`
- The vLLM Ascend test uses unittest framework. See [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
- All unit tests can be run on CPUs, so you must mock the device-related function to host.
- Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py).
- You can run the unit tests using `pytest`:
@@ -161,7 +161,7 @@ TORCH_DEVICE_BACKEND_AUTOLOAD=0 pytest -sv tests/ut
::::
::::{tab-item} Single card
::::{tab-item} Single-card
:sync: single
```bash
@@ -175,7 +175,7 @@ pytest -sv tests/ut/test_ascend_config.py
::::
::::{tab-item} Multi cards test
::::{tab-item} Multi-card
:sync: multi
```bash
@@ -193,7 +193,7 @@ pytest -sv tests/ut/test_ascend_config.py
### E2E test
Although vllm-ascend CI provide [e2e test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
Although vllm-ascend CI provides the [E2E test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
locally.
:::::{tab-set}
@@ -202,10 +202,10 @@ locally.
::::{tab-item} Local (CPU)
:sync: cpu
You can't run e2e test on CPU.
You can't run the E2E test on CPUs.
::::
::::{tab-item} Single card
::::{tab-item} Single-card
:selected:
:sync: single
@@ -223,12 +223,12 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.
::::
::::{tab-item} Multi cards test
::::{tab-item} Multi-card
:sync: multi
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
# Run all the single card tests
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/
# Run a certain test script
@@ -242,7 +242,7 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_offline_inference.p
:::::
This will reproduce e2e test: [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml).
This will reproduce the E2E test. See [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml).
#### E2E test example:
@@ -251,8 +251,8 @@ This will reproduce e2e test: [vllm_ascend_test.yaml](https://github.com/vllm-pr
- Correctness test example: [`tests/e2e/singlecard/test_aclgraph.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_aclgraph.py)
- Reduced Layer model test example: [test_torchair_graph_mode.py - DeepSeek-V3-Pruning](https://github.com/vllm-project/vllm-ascend/blob/20767a043cccb3764214930d4695e53941de87ec/tests/e2e/multicard/test_torchair_graph_mode.py#L48)
The CI resource is limited, you might need to reduce layer number of the model, below is an example of how to generate a reduced layer model:
1. Fork the original model repo in modelscope, we need all the files in the repo except for weights.
The CI resource is limited, and you might need to reduce the number of layers of a model. Below is an example of how to generate a reduced layer model:
1. Fork the original model repo in modelscope. All the files in the repo except for weights are required.
2. Set `num_hidden_layers` to the expected number of layers, e.g., `{"num_hidden_layers": 2,}`
3. Copy the following python script as `generate_random_weight.py`. Set the relevant parameters `MODEL_LOCAL_PATH`, `DIST_DTYPE` and `DIST_MODEL_PATH` as needed:
@@ -275,11 +275,11 @@ This will reproduce e2e test: [vllm_ascend_test.yaml](https://github.com/vllm-pr
### Run doctest
vllm-ascend provides a `vllm-ascend/tests/e2e/run_doctests.sh` command to run all doctests in the doc files.
The doctest is a good way to make sure the docs are up to date and the examples are executable, you can run it locally as follows:
The doctest is a good way to make sure docs stay current and examples remain executable, which can be run locally as follows:
```bash
# Run doctest
/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh
```
This will reproduce the same environment as the CI: [vllm_ascend_doctest.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml).
This will reproduce the same environment as the CI. See [vllm_ascend_doctest.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml).

View File

@@ -2,7 +2,7 @@
This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
## 1. Online serving
## 1. Online server
You can run docker container to start the vLLM server on a single NPU:
@@ -31,7 +31,7 @@ docker run --rm \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
If your service start successfully, you can see the info shown below:
If the vLLM server is started successfully, you can see information shown below:
```
INFO: Started server process [6873]
@@ -39,7 +39,7 @@ INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts in new terminal:
Once your server is started, you can query the model with input prompts in a new terminal:
```
curl http://localhost:8000/v1/completions \
@@ -54,7 +54,7 @@ curl http://localhost:8000/v1/completions \
## 2. Install EvalScope using pip
You can install EvalScope by using:
You can install EvalScope as follows:
```bash
python3 -m venv .venv-evalscope
@@ -62,9 +62,9 @@ source .venv-evalscope/bin/activate
pip install gradio plotly evalscope
```
## 3. Run gsm8k accuracy test using EvalScope
## 3. Run GSM8K using EvalScope for accuracy testing
You can `evalscope eval` run gsm8k accuracy test:
You can use `evalscope eval` to run GSM8K for accuracy testing:
```
evalscope eval \
@@ -76,7 +76,7 @@ evalscope eval \
--limit 10
```
After 1-2 mins, the output is as shown below:
After 1 to 2 minutes, the output is shown below:
```shell
+---------------------+-----------+-----------------+----------+-------+---------+---------+
@@ -86,7 +86,7 @@ After 1-2 mins, the output is as shown below:
+---------------------+-----------+-----------------+----------+-------+---------+---------+
```
See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
See more detail in [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
## 4. Run model inference stress testing using EvalScope
@@ -98,7 +98,7 @@ pip install evalscope[perf] -U
### Basic usage
You can use `evalscope perf` run perf test:
You can use `evalscope perf` to run perf testing:
```
evalscope perf \
@@ -113,7 +113,7 @@ evalscope perf \
### Output results
After 1-2 mins, the output is as shown below:
After 1 to 2 minutes, the output is shown below:
```shell
Benchmarking summary:
@@ -172,4 +172,4 @@ Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
```
See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).
See more detail in [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).

View File

@@ -1,8 +1,8 @@
# Using lm-eval
This document will guide you have a accuracy testing using [lm-eval][1].
This document guides you to conduct accuracy testing using [lm-eval][1].
## Online Server
### 1. start the vLLM server
### 1. Start the vLLM server
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
@@ -31,7 +31,7 @@ docker run --rm \
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
```
Started the vLLM server successfully,if you see log as below:
The vLLM server is started successfully, if you see logs as below:
```
INFO: Started server process [9446]
@@ -39,9 +39,9 @@ INFO: Waiting for application startup.
INFO: Application startup complete.
```
### 2. Run gsm8k accuracy test using lm-eval
### 2. Run GSM8K using lm-eval for accuracy testing
You can query result with input prompts:
You can query the result with input prompts:
```
curl http://localhost:8000/v1/completions \
@@ -98,7 +98,7 @@ The output format matches the following:
}
```
Install lm-eval in the container.
Install lm-eval in the container:
```bash
export HF_ENDPOINT="https://hf-mirror.com"
@@ -116,7 +116,7 @@ lm_eval \
--output_path ./
```
After 30 mins, the output is as shown below:
After 30 minutes, the output is as shown below:
```
The markdown format results is as below:
@@ -158,8 +158,8 @@ docker run --rm \
/bin/bash
```
### 2. Run gsm8k accuracy test using lm-eval
Install lm-eval in the container.
### 2. Run GSM8K using lm-eval for accuracy testing
Install lm-eval in the container:
```bash
export HF_ENDPOINT="https://hf-mirror.com"
@@ -177,7 +177,7 @@ lm_eval \
--batch_size auto
```
After 1-2 mins, the output is as shown below:
After 1 to 2 minutes, the output is shown below:
```
The markdown format results is as below:
@@ -189,9 +189,9 @@ Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
```
## Use offline Datasets
## Use Offline Datasets
Take gsm8k(single dataset) and mmlu(multi-subject dataset) as examples, and you can see more from [here][2].
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [here][2].
```bash
# set HF_DATASETS_OFFLINE when using offline datasets
@@ -205,7 +205,7 @@ cd lm_eval/tasks/gsm8k
cd lm_eval/tasks/mmlu/default
```
set [gsm8k.yaml][3] as follows:
Set [gsm8k.yaml][3] as follows:
```yaml
tag:
@@ -230,7 +230,7 @@ training_split: train
fewshot_split: train
test_split: test
doc_to_text: 'Q: {{question}}
A(Please follow the summarize the result at the end with the format of "The answer is xxx", where xx is the result.):'
A(Please follow the summarized result at the end with the format of "The answer is xxx", where xx is the result.):'
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
metric_list:
- metric: exact_match
@@ -268,7 +268,7 @@ metadata:
version: 3.0
```
set [_default_template_yaml][4] as follows:
Set [_default_template_yaml][4] as follows:
```yaml
# set dataset_path according to the downloaded dataset

View File

@@ -1,7 +1,7 @@
# Using OpenCompass
This document will guide you have a accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
This document guides you to conduct accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
## 1. Online Serving
## 1. Online Server
You can run docker container to start the vLLM server on a single NPU:
@@ -30,7 +30,7 @@ docker run --rm \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
If your service start successfully, you can see the info shown below:
The vLLM server is started successfully, if you see information as below:
```
INFO: Started server process [6873]
@@ -38,7 +38,7 @@ INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts in new terminal:
Once your server is started, you can query the model with input prompts in a new terminal.
```
curl http://localhost:8000/v1/completions \
@@ -51,8 +51,8 @@ curl http://localhost:8000/v1/completions \
}'
```
## 2. Run ceval accuracy test using OpenCompass
Install OpenCompass and configure the environment variables in the container.
## 2. Run C-Eval using OpenCompass for accuracy testing
Install OpenCompass and configure the environment variables in the container:
```bash
# Pin Python 3.10 due to:
@@ -64,7 +64,7 @@ export DATASET_SOURCE=ModelScope
git clone https://github.com/open-compass/opencompass.git
```
Add `opencompass/configs/eval_vllm_ascend_demo.py` with the following content:
Add the following content to `opencompass/configs/eval_vllm_ascend_demo.py`:
```python
from mmengine.config import read_base
@@ -110,7 +110,7 @@ Run the following command:
python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
```
After 1-2 mins, the output is as shown below:
After 1 to 2 minutes, the output is shown below:
```
The markdown format results is as below:

View File

@@ -1,11 +1,11 @@
# Prepare inputs for model forwarding
## Purpose
What information should we have in order to perform model forward pass?
Information required to perform model forward pass:
- the inputs
- the corresponding attention metadata of the inputs
The following diagram shows what we should prepare for the model inference.
The following diagram shows what we should prepare for model inference.
```
+---------------+
@@ -17,47 +17,47 @@ attn_meta --> | |
Therefore, as long as we have these two pieces of information mentioned above, we can perform the model's forward propagation.
This article will explain **how we obtain the inputs and their corresponding attention metadata** which are on the left part of above diagram.
This document will explain **how we obtain the inputs and their corresponding attention metadata**.
## Overview
### 1. Obtain inputs
The workflow of obtain inputs:
1. Get `token positions`: The relative position of each token within its request sequence.
The workflow of obtaining inputs:
1. Get `token positions`: relative position of each token within its request sequence.
2. Get `token indices`: the index of each scheduled token in the token table.
2. Get `token indices`: index of each scheduled token in the token table.
3. Get `Token IDs`: Using token indices to retrieve the Token IDs from **token id table**.
3. Get `Token IDs`: using token indices to retrieve the Token IDs from **token id table**.
At last, these `Token IDs` required to feed into the model, and also, `positions` should be send into model to create `Rope` (Rotary positional embedding). Both of them are the inputs of a model.
At last, these `Token IDs` are required to be fed into a model, and also, `positions` should be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.
**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`.
**Note**: because the `Token IDs` is the inputs of the model, so we will call it `Inputs IDs`
### 2. Build inputs attention metadata
The model requires these attention metadata during the forward pass:
- `query start location`: represents the start and end location of each request corresponding to the scheduled tokens.
- `sequence length`: the length of each request including both computed tokens and newly scheduled tokens.
- `number of computed tokens`: the number of computed tokens for each request.
- `number of requests`: the number of requests in this batch.
- `number of tokens`: Total number of scheduled tokens in this batch.
A model requires these attention metadata during the forward pass:
- `query start location`: start and end location of each request corresponding to the scheduled tokens.
- `sequence length`: length of each request including both computed tokens and newly scheduled tokens.
- `number of computed tokens`: number of computed tokens for each request.
- `number of requests`: number of requests in this batch.
- `number of tokens`: total number of scheduled tokens in this batch.
- **`block table`**: translates the logical address (within its sequence) of each block to its global physical address in the device's memory.
- `max query len`: the longest scheduled tokens length in this requests batch.
- `slot mapping`: the indices of each token that input token will be stored into.
- `attention mask`: The mask matrix applied to attention scores before softmax to control which tokens can attend to each other. (usually a causal attention)
- `max query len`: the longest scheduled tokens length in this request batch.
- `slot mapping`: indices of each token that input token will be stored into.
- `attention mask`: mask matrix applied to attention scores before softmax to control which tokens can attend to each other (usually a causal attention).
## Before start
There are mainly three types of variables.
- token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens
- request level: represents one attribute of each scheduled request, which length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element)
- request level: represents one attribute of each scheduled request, whose length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element)
- system level:
1. **Token IDs table**: store the token ids (i.e. the inputs of the model) of each request. The shape of this table is `(max num request, max model len)`. Here, `max num request` is maximum count of concurrent requests allowed in a forward batch and `max model len` is the max token count can be handled at one request sequence in this model.
1. **Token IDs table**: stores the token IDs (i.e. the inputs of a model) of each request. The shape of this table is `(max num request, max model len)`. Here, `max num request` is the maximum count of concurrent requests allowed in a forward batch and `max model len` is the maximum token count that can be handled at one request sequence in this model.
2. **Block table**: translates the logical address (within its sequence) of each block to its global physical address in the device's memory. The shape of this table is `(max num request, max model len / block size)`
**Note**: How were these two tables formed?
- Both of them are come from the `_update_states` method before **prepare inputs**. You can take a look if you need more inspiration.
**Note**: Both of these two tables are come from the `_update_states` method before **preparing inputs**. You can take a look if you need more inspiration.
### Tips
What is `Token ID`?
For simple, a `token ID` is an **integer** (usually `int32`), which represents a token.
example of `Token ID`:
Simply put, a `token ID` is an **integer** (usually `int32`), which represents a token.
Example of `Token ID`:
```
| Token ID | Token |
@@ -77,30 +77,32 @@ example of `Token ID`:
```
## Go through details
Make a simple example, assumption:
- max tokens can be scheduled at once: 10.
Assumptions:
- maximum number of tokens can be scheduled at once: 10
- `block size`: 2
- Totally schedule 3 requests. Their prompt lengths are 3, 2, and 8 respectively.
- `max model length`: 12 (the max token count can be handled at one request sequence in this model).
- `max model length`: 12 (the maximum token count can be handled at one request sequence in a model).
These assumption are configured in the beginning when starting the vllm. They are not fixed, so you can manually set them.
These assumptions are configured in the beginning when starting vLLM. They are not fixed, so you can manually set them.
### Step 1: All requests in the prefill phase
#### Obtain inputs
Due to the max schedule token count limitation is 10, The scheduled token of each request: `{'0': 3, '1': 2, '2': 5}`. Note that the `request_2` is in chunked prefill, still has 3 prompt tokens not be scheduled.
As the maximum number of tokens that can be schedules is 10, the scheduled tokens of each request can be represented as `{'0': 3, '1': 2, '2': 5}`. Note that`request_2` uses chunked prefill, leaving 3 prompt tokens unscheduled.
##### 1. Get token positions:
First, find out each token belong to which request: the 0~2 tokens belong to request_0, 3~4 tokens belong to request_1 and 5~9 tokens belong to request_2. So, we can use `request indices` to point out each token belongs to which request. `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`
First, determine which request each token belongs to: tokens 02 are assigned to **request_0**, tokens 34 to **request_1**, and tokens 59 to **request_2**. To represent this mapping, we use `request indices`, for example, `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`.
For each request, use **the number of tokens already computed** + **the relative position in current scheduled tokens**: `request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]` and then concat them together: `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. Note: there is more efficient way (using `request indices`) to create positions in actual code.
For each request, use **the number of computed tokens** + **the relative position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`) and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`).
Finally, `token opsitions` is `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**
Note: there is more efficient way (using `request indices`) to create positions in actual code.
Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**.
##### 2. Get token indices:
Current **Token IDs table**, which shape is `(max num request, max model len)`.
The shape of the current **Token IDs table** is `(max num request, max model len)`.
Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table even them are not scheduled this time?
- We will fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we will retrieve the remain Token IDs next time.
Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table without being scheduled?
- We fill all Token IDs in one request sequence to this table at once, but we only retrieve the tokens we scheduled this time. Then we retrieve the remain Token IDs next time.
```
| T_0_0 | T_0_1 | T_0_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
@@ -112,26 +114,24 @@ Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table even them are not schedule
......
```
Note that the `T_x_x` is an `int32`
Note that`T_x_x` is an `int32`.
Let's say `M = max model len`, Then we can use `token positions` together with the `request indices` of each token to construct `token indices`.
Let's say `M = max model len`. Then we can use `token positions` together with `request indices` of each token to construct `token indices`.
So `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 * M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2, 12, 13, 24, 25, 26, 27, 28]`
##### 3. Retrieve the Token IDs
As mentioned before, we will refer to these `Token IDs` as `Input IDs`.
We use the `token indices` to select out the corresponding `Input IDs` from the token table, The Pseudocode like:
We use `token indices` to select out the corresponding `Input IDs` from the token table. The pseudocode is as follows:
```
input_ids = token_table[token_indices]
```
As mentioned before, we will refer these Token IDs as Inputs IDs:
As mentioned before, we refer to these `Token IDs` as `Input IDs`.
- `Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, T_3_3, T_3_4]`
#### Build inputs attention metadata
Current **Block Table**, we use the first block (i.e. block_0) to mark the unused block. The shape of the block is `(max num request, max model len / block size)`, the `max model len / block size = 12 / 2 = 6`
In the current **Block Table**, we use the first block (i.e. block_0) to mark the unused block. The shape of the block is `(max num request, max model len / block size)`, where `max model len / block size = 12 / 2 = 6`.
```
| 1 | 2 | 0 | 0 | 0 | 0 |
@@ -143,42 +143,53 @@ Current **Block Table**, we use the first block (i.e. block_0) to mark the unuse
......
```
The kv cache block in the device memory is like:
The KV cache block in the device memory is like:
```
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
```
Let's say `K = max model len / block size = 6`, we can get token `device block number` from
Let's say `K = max model len / block size = 6`, and we can get token `device block number`.
The workflow of achieving slot mapping:
1. get `block table indices` using `K`, `positions` and `request indices`. Purpose: For each token, it could be used to select the `device block number` from `block table`.
2. get `device block number` using `block table indices`. Purpose: `device block number` indicates each token belong to which device block.
3. get `block offsets` using `positions` and `block size`. Purpose: `block offsets` indicates the offsets of each token within a block.
4. construct `slot mapping` using `device block number` and `block offsets`. Purpose: we can use `slot mapping` to store the Token IDs into token slots.
1. Get `block table indices` using `K`, `positions` and `request indices`.
Purpose: For each token, it could be used to select `device block number` from `block table`.
2. Get `device block number` using `block table indices`.
Purpose: `device block number` indicates which device block each token belongs to.
3. Get `block offsets` using `positions` and `block size`.
Purpose: `block offsets` indicates the offsets of each token within a block.
4. construct `slot mapping` using `device block number` and `block offsets`.
Purpose: we can use `slot mapping` to store Token IDs into token slots.
Details:
1. Using a simple formula to calculate the `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select the `device block number` from `block table`. **token level**
2. Using the `block table indices` to select out the `device block number` for each scheduled token. The Pseudocode like: `block_numbers = block_table[block_table_indices]`. So `device block number = [1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`**token level**
3. `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`. **token level**
1. (**Token level**) Use a simple formula to calculate `block table indices`: `request indices * K + positions / block size`. So it equal to `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to select `device block number` from `block table`.
2. (**Token level**) Use `block table indices` to select out `device block number` for each scheduled token. The Pseudocode is `block_numbers = block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`
3. (**Token level**) `block offsets` could be computed by `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`.
4. At last, use `block offsets` and `device block number` to create `slot mapping`: `device block number * block size + block_offsets = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
First, we know the scheduled token count is `[3, 2, 5]` **request level**
(**Request level**) As we know the scheduled token count is `[3, 2, 5]`:
- So, we can use prefix sum to calculate the `query start location`: `[0, 3, 5, 10]`. **request level**
- Because in step_1 all the tokens in prefill, computed tokens count is 0, then `sequence length` = `[3, 2, 5]`. **request level**
- As mentioned above, `number of computed tokens` are all 0: `[0, 0, 0]`. **request level**
- `number of requests`: `3`.
- `number of tokens`: `[3, 2, 5]`. **request level**
- `max query len`: `5`.
- `slot mapping`: `[2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`. **token level**
- `attention mask`: For all request do prefill, we simply create only one mask matrix for reuse across different requests. The shape of this mask matrix is `5 * 5`:
- (**Request level**) Use prefix sum to calculate `query start location`: `[0, 3, 5, 10]`.
- (**Request level**) All tokens in step 1 are in the prefill stage, and the computed tokens count is 0; then `sequence length` = `[3, 2, 5]`.
- (**Request level**) As mentioned above, `number of computed tokens` are all 0s: `[0, 0, 0]`.
- `number of requests`: `3`
- (**Request level**) `number of tokens`: `[3, 2, 5]`
- `max query len`: `5`
- (**Token level**) `slot mapping`: `[2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`
- `attention mask`: For all requests that initiate a prefill process, we simply create only one mask matrix for reuse across different requests. The shape of this mask matrix is `5 * 5`:
### Step 2: Chunked prefill
In Step 2, we will no longer provide explanations or perform calculations; instead, we will directly present the final result.
In Step 2, we no longer provide explanations or perform calculations; instead, we directly present the final result.
#### Obtain inputs
The scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`.
Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
1. `request indices`: `[0, 1, 2, 2, 2]`
2. `token positions`: `[3, 2, 5, 6, 7]`
@@ -195,13 +206,15 @@ Current **Token IDs table**:
......
```
**Note**: The **T_0_3**, **T_1_2** are new Token IDs of request_0, request_1 respectively. them are sampled from the output of the model.
**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and **request_1** respectively. They are sampled from the output of the model.
3. `token indices`: `[3, 14, 29, 30, 31]`
4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`
#### Build inputs attention metadata
Current **Block Table**. **Note**: We allocate the `7` and `8` block to `request_1` and `request_2` respectively. Because they need more space in device to store kv cache after generate new tokens or chunked prefill new tokens.
We allocate the blocks `7` and `8` to `request_1` and `request_2` respectively, as they need more space in device to store KV cache following token generation or chunked prefill.
Current **Block Table**:
```
| 1 | 2 | 0 | 0 | 0 | 0 |
@@ -213,27 +226,35 @@ Current **Block Table**. **Note**: We allocate the `7` and `8` block to `request
......
```
The kv cache block in the device memory is still like:
KV cache block in the device memory:
```
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ......
```
1. `block table indices`: `[1, 7, 14, 15, 15]`. **token level**
2. `device block number`: `[2, 7, 6, 8, 8]`. **token level**
3. `block offsets`: `[1, 0, 1, 0, 1]` **token level**
4. `slot mapping`: `[5, 14, 13, 16, 17]` **token level**
1. (**Token level**) `block table indices`: `[1, 7, 14, 15, 15]`
2. (**Token level**) `device block number`: `[2, 7, 6, 8, 8]`
3. (**Token level**) `block offsets`: `[1, 0, 1, 0, 1]`
4. (**Token level**) `slot mapping`: `[5, 14, 13, 16, 17]`
scheduled token count is `[1, 1, 3]`
Scheduled token count:`[1, 1, 3]`
- `query start location`: `[0, 1, 2, 5]`
- `sequence length`: `[4, 3, 8]`
- `number of computed tokens`: `[3, 2, 5]`
- `number of requests`: `3`
- `max query len`: `3`
- `slot mapping`: `[5, 14, 13, 16, 17]`
- `attention mask`: `5 * 8` Each token will have a `1 * 8` vector, and there are 5 scheduled tokens.
- `attention mask`: `5 * 8`
Each token has a `1 * 8` vector, and there are 5 scheduled tokens.
## At last
If you under stand the step_1 and step_2, you will know the all following steps.
If you understand the step_1 and step_2, you will know the all following steps.
Hope this article can help you get better understand to how vllm prepare inputs for model forwarding. If you have any good idea, welcome to contribute to us.
Hope this document can help you better understand how vLLM prepares inputs for model forwarding. If you have any good idea, welcome to contribute to us.

View File

@@ -1,16 +1,16 @@
# Patch in vLLM Ascend
vLLM Ascend is a platform plugin for vLLM. Due to the release cycle of vLLM and vLLM Ascend is different, and the hardware limitation in some case, we need to patch some code in vLLM to make it compatible with vLLM Ascend.
vLLM Ascend is a platform plugin for vLLM. Due to the different release cycle of vLLM and vLLM Ascend and their hardware limitations, we need to patch some code in vLLM to make it compatible with vLLM Ascend.
In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to address the change for vLLM.
In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to adapt to changes in vLLM.
## Principle
We should keep in mind that Patch is not the best way to make vLLM Ascend compatible. It's just a temporary solution. The best way is to contribute the change to vLLM to make it compatible with vLLM Ascend originally. In vLLM Ascend, we have the basic principle for Patch strategy:
We should keep in mind that Patch is not the best way to make vLLM Ascend compatible. It's just a temporary solution. The best way is to contribute the change to vLLM to make it compatible with vLLM Ascend initially. In vLLM Ascend, we have the basic principle for Patch strategy:
1. Less is more. Please do not patch unless it's the only way currently.
2. Once a patch is added, it's required to describe the future plan for removing the patch.
3. Anytime, clean the patch code is welcome.
3. Anytime, cleaning the patch code is welcome.
## How it works
@@ -27,18 +27,18 @@ vllm_ascend
```
- **platform**: The patch code in this directory is for patching the code in vLLM main process. It's called by `vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when vLLM is initialized.
- For online mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
- For offline mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
- For online mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
- For offline mode, vLLM process calls the platform patch in `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
- **worker**: The patch code in this directory is for patching the code in vLLM worker process. It's called by `vllm_ascend/worker/worker_v1::NPUWorker::__init__` when the vLLM worker process is initialized.
- For both online and offline mode, vLLM engine core process calls the worker patch here `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
- For both online and offline mode, vLLM engine core process calls the worker patch in `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
## How to write a patch
Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** and **worker** folder. Here is an example to patch `distributed` module in vLLM.
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both 0.10.0 and main of vLLM.
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both `0.10.0` and `main` of vLLM.
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_distributed.py`.
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
4. Write your patch code in the new file. Here is an example:
```python
@@ -51,7 +51,7 @@ Before writing a patch, following the principle above, we should patch the least
vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
```
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_distributed` into `vllm_ascend/patch/platform/__init__.py`.
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`.
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
```
@@ -71,5 +71,5 @@ Before writing a patch, following the principle above, we should patch the least
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
## Limitation
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
2. If you are running an edited vLLM code, the version of the vLLM may be changed automatically. For example, if you runs an edited vLLM based on v0.9.n, the version of vLLM may be change to v0.9.nxxx, in this case, the patch for v0.9.n in vLLM Ascend would not work as expect, because that vLLM Ascend can't distinguish the version of vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of vLLM you're using, then the patch for v0.10.0 should work.
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.

View File

@@ -5,22 +5,22 @@ This guide demonstrates how to integrate a novel or customized model into vllm-a
## Step 1: Implementing Models with `torch` and `torch_npu`
This section provides instructions for implementing new models compatible with vllm and vllm-ascend.
This section provides instructions for implementing new models compatible with vLLM and vllm-ascend.
**Before starting:**
- Verify whether your model already exists in vllm's [models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory.
- Verify whether your model already exists in vLLM's [models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory.
- Use existing models' implementation as templates to accelerate your development.
### Method 1: Implementing New Models from Scratch
Follow vllm's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing/model/basic.html) example for guidance.
Follow vLLM's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing/model/basic.html) example for guidance.
**Key implementation requirements:**
1. Place model files in `vllm_ascend/models/` directory.
2. Standard module structure for decoder-only LLMs (please checkout vllm's implementations for other kinds of model):
2. Standard module structure for decoder-only LLMs (please checkout vLLM's implementations for other kinds of models):
- `*ModelForCausalLM` (top-level wrapper)
- `*Model` (main architecture)
@@ -31,7 +31,7 @@ Follow vllm's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing
`*` denotes your model's unique identifier.
:::
3. Critical Implementation Details:
3. Critical implementation details:
All modules must include a `prefix` argument in `__init__()`.
@@ -42,13 +42,13 @@ All modules must include a `prefix` argument in `__init__()`.
| `*ModelForCausalLM` | `get_input_embeddings`, `compute_logits`, `load_weights` |
| `*Model` | `get_input_embeddings`, `load_weights` |
4. Attention Backend Integration:
4. Attention backend integration:
Importing attention via `from vllm.attention import Attention` can automatically leverage the attention backend routing of vllm-ascend (see: `get_attn_backend_cls()` in `vllm_ascend/platform.py`).
5. Tensor Parallelism:
5. Tensor parallelism:
Use vllm's parallel layers (`ColumnParallelLinear`, `VocabParallelEmbedding`, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in `vllm_ascend/ops/` directory (RMSNorm, VocabParallelEmbedding, etc.).
Use vLLM's parallel layers (`ColumnParallelLinear`, `VocabParallelEmbedding`, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in `vllm_ascend/ops/` directory (RMSNorm, VocabParallelEmbedding, etc.).
**Reference Implementation Template** (assumed path: `vllm_ascend/models/custom_model.py`):
@@ -133,7 +133,7 @@ class CustomModelForCausalLM(nn.Module):
### Method 2: Customizing Existing vLLM Models
For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom deepseek model below (assumed path: `vllm_ascend/models/deepseek_v2.py`).
For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom DeepSeek model below (assumed path: `vllm_ascend/models/deepseek_v2.py`).
```python
from typing import List, Optional
@@ -171,12 +171,12 @@ class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
```
:::{note}
For a complete implementation reference, see: `vllm_ascend/models/deepseek_v2.py`.
For a complete implementation reference, see `vllm_ascend/models/deepseek_v2.py`.
:::
## Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM
vllm provides a plugin mechanism for registering externally implemented models without modifying its codebase.
vLLM provides a plugin mechanism for registering externally implemented models without modifying the codebase.
To integrate your implemented model from `vllm_ascend/models/` directory:
@@ -220,33 +220,33 @@ The first argument of `vllm.ModelRegistry.register_model()` indicates the unique
## Step 3: Verification
### Case 1: Overriding Existing vLLM Model Architecture
### Case 1: Overriding Existing vLLM Model Architectures
If you're registering a customized model architecture based on vllm's existing implementation (overriding vllm's original class), when executing vllm offline/online inference (using any model), you'll observe warning logs similar to the following output from `vllm/models_executor/models/registry.py`.
If you're registering a customized model architecture based on vLLM's existing implementation (overriding vLLM's original class), when executing vLLM offline/online inference (using any model), you'll observe warning logs similar to the following output from `vllm/models_executor/models/registry.py`.
```bash
Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend/models/deepseek_v2:CustomDeepseekV2ForCausalLM.
```
### Case 2: Registering New Model Architecture
### Case 2: Registering New Model Architectures
If you're registering a novel model architecture not present in vllm (creating a completely new class), current logs won't provide explicit confirmation by default. It's recommended to add the following logging statement at the end of the `register_model` method in `vllm/models_executor/models/registry.py`.
If you're registering a novel model architecture not present in vLLM (creating a completely new class), current logs won't provide explicit confirmation by default. It's recommended to add the following logging statement at the end of the `register_model` method in `vllm/models_executor/models/registry.py`.
```python
logger.info(f"model_arch: {model_arch} has been registered here!")
```
After adding this line, you will see confirmation logs shown below when running vllm offline/online inference (using any model).
After adding this line, you will see confirmation logs shown below when running vLLM offline/online inference (using any model).
```bash
model_arch: CustomModelForCausalLM has been registered here!
```
This log output confirms your novel model architecture has been successfully registered in vllm.
This log output confirms your novel model architecture has been successfully registered in vLLM.
## Step 4: Testing
After adding a new model, we should do basic functional test (offline/online inference), accuracy test and performance benchmark for the model.
After adding a new model, we should do basic functional test (offline/online inference), accuracy test, and performance benchmark for the model.
Find more details at:

View File

@@ -1,3 +1,3 @@
# Adding a New Multi-Modal Model
# Adding a New Multimodal Model
**_Comming soon ..._**
**_Coming soon ..._**

View File

@@ -1,6 +1,6 @@
# Optimization and Tuning
This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. Any feedback is welcome.
This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deployment guide and so on. Any feedback is welcome.
## Preparation
@@ -57,10 +57,10 @@ pip install modelscope pandas datasets gevent sacrebleu rouge_score pybind11 pyt
VLLM_USE_MODELSCOPE=true
```
Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vllm, vllm-ascend and mindie-turbo is installed correctly.
Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vLLM, vllm-ascend, and MindIE Turbo are installed correctly.
:::{note}
Make sure your vllm and vllm-ascend are installed after your python configuration completed, because these packages will build binary files using the python in current environment. If you install vllm, vllm-ascend and mindie-turbo before chapter 1.1, the binary files will not use the optimized python.
Make sure your vLLM and vllm-ascend are installed after your python configuration is completed, because these packages will build binary files using python in current environment. If you install vLLM, vllm-ascend, and MindIE Turbo before completing section 1.1, the binary files will not use the optimized python.
:::
## Optimizations
@@ -69,7 +69,7 @@ Make sure your vllm and vllm-ascend are installed after your python configuratio
#### 1.1. Install optimized `python`
Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered compilation optimized `python` packages directly to users for the sake of convenience. You can also reproduce the `python` build follow this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios.
Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered optimized `python` packages directly to users for the sake of convenience. You can also reproduce the `python` built following this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios.
```{code-block} bash
:substitutions:
@@ -101,7 +101,7 @@ export PATH=/usr/bin:/usr/local/python/bin:$PATH
#### 2.1. jemalloc
**jemalloc** is a memory allocator that improves performance for multi-threads scenario and can reduce memory fragment. jemalloc use thread local memory manager to allocate variables, which can avoid lock competition between multi-threads and can hugely optimize performance.
**jemalloc** is a memory allocator that improves performance for multi-thread scenarios and can reduce memory fragmentation. jemalloc uses local thread memory manager to allocate variables, which can avoid lock competition between threads and can hugely optimize performance.
```{code-block} bash
:substitutions:
@@ -144,18 +144,18 @@ Memory optimization:
```{code-block} bash
:substitutions:
# Upper limit of memory block splitting allowed (MB), Setting this parameter can prevent large memory blocks from being split.
# Upper limit of memory block splitting allowed (MB): Setting this parameter can prevent large memory blocks from being split.
export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"
# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
```
Schedule optimization:
Scheduling optimization:
```{code-block} bash
:substitutions:
# Optimize operator delivery queue, this will affect the memory peak value, and may degrade if the memory is tight.
# Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
export TASK_QUEUE_ENABLE=2
# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
@@ -168,7 +168,7 @@ export CPU_AFFINITY_CONF=1
There are some performance tuning features in HCCL, which are controlled by environment variables.
You can configure HCCL to use "AIV" mode to optimize performance by setting the environment variable shown below. In "AIV" mode, the communication is scheduled by AI vector core directly with ROCE, instead of being scheduled by AI cpu.
You can configure HCCL to use "AIV" mode to optimize performance by setting the environment variable shown below. In "AIV" mode, the communication is scheduled by AI vector core directly with RoCE, instead of being scheduled by AI CPU.
```{code-block} bash
:substitutions:
@@ -177,7 +177,7 @@ export HCCL_OP_EXPANSION_MODE="AIV"
Plus, there are more features for performance optimization in specific scenarios, which are shown below.
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).

View File

@@ -1,7 +1,7 @@
# Performance Benchmark
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
**Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
## 1. Run docker container
@@ -37,7 +37,7 @@ pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/si
pip install -r benchmarks/requirements-bench.txt
```
## 3. (Optional)Prepare model weights
## 3. (Optional) Prepare model weights
For faster running speed, we recommend downloading the model in advance
```bash
@@ -68,7 +68,7 @@ Run benchmark script:
bash benchmarks/scripts/run-performance-benchmarks.sh
```
After about 10 mins, the output is as shown below:
After about 10 mins, the output is shown below:
```bash
online serving:
@@ -180,7 +180,7 @@ Total num prompt tokens: 42659
Total num output tokens: 43545
```
The result json files are generated into the path `benchmark/results`
The result json files are generated into the path `benchmark/results`.
These files contain detailed benchmarking results for further analysis.
```bash

View File

@@ -9,7 +9,7 @@ The execution duration of each stage (including pre/post-processing, model forwa
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execution duration profiling. Execute the script as follows:**
```
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py

View File

@@ -9,21 +9,21 @@
### 1. What devices are currently supported?
Currently, **ONLY** Atlas A2 series(Ascend-cann-kernels-910b)Atlas A3 series(Atlas-A3-cann-kernels) and Atlas 300I(Ascend-cann-kernels-310p) series are supported:
Currently, **ONLY** Atlas A2 series (Ascend-cann-kernels-910b)Atlas A3 series (Atlas-A3-cann-kernels) and Atlas 300I (Ascend-cann-kernels-310p) series are supported:
- Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
- Atlas 800I A2 Inference series (Atlas 800I A2)
- Atlas A3 Training series (Atlas 800T A3, Atlas 900 A3 SuperPoD, Atlas 9000 A3 SuperPoD)
- Atlas 800I A3 Inference series (Atlas 800I A3)
- [Experimental] Atlas 300I Inference series (Atlas 300I Duo). Currently for 310I Duo the stable version is vllm-ascend v0.10.0rc1.
- Atlas A2 training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
- Atlas 800I A2 inference series (Atlas 800I A2)
- Atlas A3 training series (Atlas 800T A3, Atlas 900 A3 SuperPoD, Atlas 9000 A3 SuperPoD)
- Atlas 800I A3 inference series (Atlas 800I A3)
- [Experimental] Atlas 300I inference series (Atlas 300I Duo). Currently for 310I Duo, the stable version is vllm-ascend v0.10.0rc1.
Below series are NOT supported yet:
- Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
- Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet
From a technical view, vllm-ascend support would be possible if the torch-npu is supported. Otherwise, we have to implement it by using custom ops. We are also welcome to join us to improve together.
From a technical view, vllm-ascend support would be possible if the torch-npu is supported. Otherwise, we have to implement it by using custom operators. You are also welcome to join us to improve together.
### 2. How to get our docker containers?
### 2. How to get our Docker containers?
You can get our containers at `Quay.io`, e.g., [<u>vllm-ascend</u>](https://quay.io/repository/ascend/vllm-ascend?tab=tags) and [<u>cann</u>](https://quay.io/repository/ascend/cann?tab=tags).
@@ -35,8 +35,8 @@ TAG=v0.7.3rc2
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG
```
#### Load Docker Images for offline environment
If you want to use container image for offline environments (no internet connection), you need to download container image in a environment with internet access:
#### Load Docker images for the offline environment
If you want to use container images for offline environments (without Internet connection), you need to download the container image in an environment with Internet access:
**Exporting Docker images:**
@@ -62,20 +62,20 @@ docker load -i vllm-ascend-$TAG.tar.gz
docker images | grep vllm-ascend
```
### 3. What models does vllm-ascend supports?
### 3. What models does vllm-ascend support?
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_models.html).
### 4. How to get in touch with our community?
There are many channels that you can communicate with our community developers / users:
There are many channels that you can communicate with our community developers and users:
- Submit a GitHub [<u>issue</u>](https://github.com/vllm-project/vllm-ascend/issues?page=1).
- Join our [<u>weekly meeting</u>](https://docs.google.com/document/d/1hCSzRTMZhIB8vRq1_qOOjx4c9uYUxvdQvDsMV2JcSrw/edit?tab=t.0#heading=h.911qu8j8h35z) and share your ideas.
- Join our [<u>WeChat</u>](https://github.com/vllm-project/vllm-ascend/issues/227) group and ask your quenstions.
- Join our ascend channel in [<u>vLLM forums</u>](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support/6) and publish your topics.
- Join our [<u>WeChat</u>](https://github.com/vllm-project/vllm-ascend/issues/227) group and ask your questions.
- Join the Ascend channel in [<u>vLLM forums</u>](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support/6) and publish your topics.
### 5. What features does vllm-ascend V1 supports?
### 5. What features does vllm-ascend V1 support?
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html).
@@ -86,7 +86,7 @@ Basically, the reason is that the NPU environment is not configured correctly. Y
2. try `source /usr/local/Ascend/ascend-toolkit/set_env.sh` to enable CANN package.
3. try `npu-smi info` to check whether the NPU is working.
If all above steps are not working, you can try the following code with python to check whether there is any error:
If all above steps are not working, you can try the following code with Python to check whether there is any error:
```
import torch
@@ -94,72 +94,72 @@ import torch_npu
import vllm
```
If all above steps are not working, feel free to submit a GitHub issue.
If the problem still persists, feel free to submit a GitHub issue.
### 7. How does vllm-ascend perform?
Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
Currently, the performance is improved on some models, such as `Qwen2.5 VL`, `Qwen3`, and `Deepseek V3`. From 0.9.0rc2, Qwen and DeepSeek work with graph mode to deliver good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
### 8. How vllm-ascend work with vllm?
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
### 8. How does vllm-ascend work with vllm?
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For the main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
### 9. Does vllm-ascend support Prefill Disaggregation feature?
### 9. Does vllm-ascend support the prefill-decode disaggregation feature?
Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, We will make it stable and supported by vllm-ascend in the future.
Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, we will make it stable and supported by vllm-ascend in the future.
### 10. Does vllm-ascend support quantization method?
### 10. Does vllm-ascend support quantization methods?
Currently, w8a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, If you're using vllm 0.7.3 version, w8a8 quantization is supporeted with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`.
Currently, W8A8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher. If you're using vllm 0.7.3, W8A8 quantization is supported with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`.
### 11. How to run w8a8 DeepSeek model?
### 11. How to run a W8A8 DeepSeek model?
Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model to DeepSeek.
Follow the [inference tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace the model with DeepSeek.
### 12. There is no output in log when loading models using vllm-ascend, How to solve it?
### 12. How to solve the problem that there is no output in the log when loading models using vllm-ascend?
If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
If you're using vllm 0.7.3, this is a known progress bar display issue in vLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
### 13. How vllm-ascend is tested
### 13. How is vllm-ascend tested?
vllm-ascend is tested by functional test, performance test and accuracy test.
vllm-ascend is tested in three aspects, functions, performance, and accuracy.
- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit testson vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html) via e2e test
- **Functional test**: We added CI, including part of vllm's native unit tests and vllm-ascend's own unit tests. On vllm-ascend's test, we test basic functionalities, popular model availability, and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html) through E2E test.
- **Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website to show the performance test results for each pull request
- **Performance test**: We provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for E2E performance benchmark, which can be easily re-routed locally. We will publish a perf website to show the performance test results for each pull request.
- **Accuracy test**: we're working on adding accuracy test to CI as well.
- **Accuracy test**: We are working on adding accuracy test to the CI as well.
Finnall, for each release, we'll publish the performance test and accuracy test report in the future.
Finally, for each release, we will publish the performance test and accuracy test report in the future.
### 14. How to fix the error "InvalidVersion" when using vllm-ascend?
It's usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the env variable `VLLM_VERSION` to the version of vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
The problem is usually caused by the installation of a dev or editable version of the vLLM package. In this case, we provide the environment variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the environment variable `VLLM_VERSION` to the version of the vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
### 15. How to handle Out Of Memory?
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM's OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
### 15. How to handle the out-of-memory issue?
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
- **Adjust `--gpu-memory-utilization`**: If unspecified, will use the default value of `0.9`. You can decrease this param to reserve more memory to reduce fragmentation risks. See more note in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
- **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can use `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime. See details in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
### 16. Failed to enable NPU graph mode when running DeepSeek?
You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.
### 16. Failed to enable NPU graph mode when running DeepSeek.
You may encounter the following error if running DeepSeek with NPU graph mode is enabled. The allowed number of queries per KV when enabling both MLA and Graph mode is {32, 64, 128}. **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be implemented in the future.
And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads/num_kv_heads is {32, 64, 128}.
```bash
[rank0]: RuntimeError: EZ9999: Inner Error!
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
```
### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use `python setup.py install` to install, or use `python setup.py clean` to clear the cache.
### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend.
You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, use `python setup.py install` (recommended) to install, or use `python setup.py clean` to clear the cache.
### 18. How to generate determinitic results when using vllm-ascend?
### 18. How to generate deterministic results when using vllm-ascend?
There are several factors that affect output certainty:
1. Sampler Method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
1. Sampler method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
```python
from vllm import LLM, SamplingParams
@@ -184,7 +184,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
2. Set the following enveriments parameters:
2. Set the following environment parameters:
```bash
export LCCL_DETERMINISTIC=1
@@ -193,9 +193,9 @@ export ATB_MATMUL_SHUFFLE_K_ENABLE=0
export ATB_LLM_LCOC_ENABLE=0
```
### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.
### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for the Qwen2.5-Omni model
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`.
This package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensure that the audio processing functionality works correctly.
### 20. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
@@ -207,10 +207,10 @@ ERROR 09-26 10:48:07 [model_runner_v1.py:3029] ACLgraph has insufficient availab
Recommended mitigation strategies:
1. Manually configure the compilation_config parameter with a reduced size set: '{"cudagraph_capture_sizes":[size1, size2, size3, ...]}'.
2. Employ ACLgraph's full graph mode as an alternative to the piece-wise approach.
2. Employ ACLGraph's full graph mode as an alternative to the piece-wise approach.
Root cause analysis:
The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements - such as operator characteristics and specific hardware features - consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.
The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements, such as operator characteristics and specific hardware features, consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.
### 21. Installing vllm-ascend will overwrite the existing torch-npu package
### 21. Installing vllm-ascend will overwrite the existing torch-npu package.
Installing vllm-ascend will overwrite the existing torch-npu package. If you need to install a specific version of torch-npu, you can manually install the specified version of torch-npu after installing vllm-ascend.

View File

@@ -16,13 +16,13 @@ This document describes how to install vllm-ascend manually.
| torch-npu | >= 2.7.1.dev20250724 | Required for vllm-ascend, No need to install manually, it will be auto installed in below steps |
| torch | >= 2.7.1 | Required for torch-npu and vllm |
You have 2 way to install:
There are two installation methods:
- **Using pip**: first prepare env manually or via CANN image, then install `vllm-ascend` using pip.
- **Using docker**: use the `vllm-ascend` pre-built docker image directly.
## Configure a new environment
Before installing, you need to make sure firmware/driver and CANN are installed correctly, refer to [link](https://ascend.github.io/docs/sources/ascend/quick_install.html) for more details.
Before installation, you need to make sure firmware/driver and CANN are installed correctly, refer to [Ascend Environment Setup Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) for more details.
### Configure hardware environment
@@ -71,11 +71,11 @@ docker run --rm \
You can also install CANN manually:
```bash
# Create a virtual environment
# Create a virtual environment.
python -m venv vllm-ascend-env
source vllm-ascend-env/bin/activate
# Install required python packages.
# Install required Python packages.
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple attrs 'numpy<2.0.0' decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions
# Download and install the CANN package.
@@ -102,11 +102,11 @@ source /usr/local/Ascend/nnal/atb/set_env.sh
::::{tab-item} Before using docker
:sync: docker
No more extra step if you are using `vllm-ascend` prebuilt docker image.
No more extra step if you are using `vllm-ascend` prebuilt Docker image.
::::
:::::
Once it's done, you can start to set up `vllm` and `vllm-ascend`.
Once it is done, you can start to set up `vllm` and `vllm-ascend`.
## Setup vllm and vllm-ascend
@@ -117,7 +117,7 @@ Once it's done, you can start to set up `vllm` and `vllm-ascend`.
:selected:
:sync: pip
First install system dependencies and config pip mirror:
First install system dependencies and configure pip mirror:
```bash
# Using apt-get with mirror
@@ -129,7 +129,7 @@ apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
**[Optional]** Then config the extra-index of `pip` if you are working on a x86 machine or using torch-npu dev version:
**[Optional]** Then configure the extra-index of `pip` if you are working on an x86 machine or using torch-npu dev version:
```bash
# For torch-npu dev version or x86 machine
@@ -158,26 +158,26 @@ or build from **source code**:
```{code-block} bash
:substitutions:
# Install vLLM
# Install vLLM.
git clone --depth 1 --branch |vllm_version| https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install -v -e .
cd ..
# Install vLLM Ascend
# Install vLLM Ascend.
git clone --depth 1 --branch |vllm_ascend_version| https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -v -e .
cd ..
```
vllm-ascend will build custom ops by default. If you don't want to build it, set `COMPILE_CUSTOM_KERNELS=0` environment to disable it.
vllm-ascend will build custom operators by default. If you don't want to build it, set `COMPILE_CUSTOM_KERNELS=0` environment to disable it.
:::
```{note}
If you are building from v0.7.3-dev and intend to use sleep mode feature, you should set `COMPILE_CUSTOM_KERNELS=1` manually.
To build custom ops, gcc/g++ higher than 8 and c++ 17 or higher is required. If you're using `pip install -e .` and encounter a torch-npu version conflict, please install with `pip install --no-build-isolation -e .` to build on system env.
If you encounter other problems during compiling, it is probably because unexpected compiler is being used, you may export `CXX_COMPILER` and `C_COMPILER` in env to specify your g++ and gcc locations before compiling.
To build custom operators, gcc/g++ higher than 8 and c++ 17 or higher is required. If you're using `pip install -e .` and encounter a torch-npu version conflict, please install with `pip install --no-build-isolation -e .` to build on system env.
If you encounter other problems during compiling, it is probably because unexpected compiler is being used, you may export `CXX_COMPILER` and `C_COMPILER` in environment to specify your g++ and gcc locations before compiling.
```
::::
@@ -220,7 +220,7 @@ docker run --rm \
-it $IMAGE bash
```
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) (`pip install -e`) to help developer immediately take place changes without requiring a new installation.
::::
:::::

View File

@@ -3,11 +3,11 @@
## Prerequisites
### Supported Devices
- Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
- Atlas 800I A2 Inference series (Atlas 800I A2)
- Atlas A3 Training series (Atlas 800T A3, Atlas 900 A3 SuperPoD, Atlas 9000 A3 SuperPoD)
- Atlas 800I A3 Inference series (Atlas 800I A3)
- [Experimental] Atlas 300I Inference series (Atlas 300I Duo)
- Atlas A2 training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
- Atlas 800I A2 inference series (Atlas 800I A2)
- Atlas A3 training series (Atlas 800T A3, Atlas 900 A3 SuperPoD, Atlas 9000 A3 SuperPoD)
- Atlas 800I A3 inference series (Atlas 800I A3)
- [Experimental] Atlas 300I inference series (Atlas 300I Duo)
## Setup environment using container
@@ -71,11 +71,11 @@ yum update -y && yum install -y curl
::::
:::::
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) (`pip install -e`) to help developers immediately make changes effective without requiring a new installation.
## Usage
You can use Modelscope mirror to speed up download:
You can use ModelScope mirror to speed up download:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
@@ -128,7 +128,7 @@ the following command to start the vLLM server with the
vllm serve Qwen/Qwen2.5-0.5B-Instruct &
```
If you see log as below:
If you see a log as below:
```
INFO: Started server process [3594]
@@ -139,7 +139,7 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Congratulations, you have successfully started the vLLM server!
You can query the list the models:
You can query the list of models:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
@@ -162,8 +162,7 @@ curl http://localhost:8000/v1/completions \
}' | python3 -m json.tool
```
vLLM is serving as background process, you can use `kill -2 $VLLM_PID` to stop the background process gracefully,
it's equal to `Ctrl-C` to stop foreground vLLM process:
vLLM is serving as a background process, you can use `kill -2 $VLLM_PID` to stop the background process gracefully, which is similar to `Ctrl-C` for stopping the foreground vLLM process:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
@@ -172,7 +171,7 @@ it's equal to `Ctrl-C` to stop foreground vLLM process:
kill -2 "$VLLM_PID"
```
You will see output as below:
The output is as below:
```
INFO: Shutting down FastAPI HTTP server.
@@ -181,6 +180,6 @@ INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
```
Finally, you can exit container by using `ctrl-D`.
Finally, you can exit the container by using `ctrl-D`.
::::
:::::

View File

@@ -1,7 +1,7 @@
# Multi-Node (DeepSeek V3.2)
:::{note}
Only machines with aarch64 is supported currently, x86 is coming soon. This guide take A3 as the example.
Only machines with AArch64 are supported currently. x86 will be supported soon. This guide takes A3 as the example.
:::
## Verify Multi-Node Communication Environment
@@ -80,14 +80,14 @@ for i in {0..15}; do hccn_tool -i $i -ip -g | grep ipaddr; done
hccn_tool -i 0 -ping -g address 10.20.0.20
```
## Deploy DeepSeek-V3.2-Exp with vLLM-Ascend:
## Deploy DeepSeek-V3.2-Exp with vLLM-Ascend
Currently, we provide a all-in-one image (include CANN 8.2RC1 + [SparseFlashAttention/LightningIndexer](https://gitcode.com/cann/cann-recipes-infer/tree/master/ops/ascendc) + [MLAPO](https://github.com/vllm-project/vllm-ascend/pull/3226)). You can also build your own image refer to [link](https://github.com/vllm-project/vllm-ascend/issues/3278).
Currently, we provide a all-in-one image (include CANN 8.2RC1 + [SparseFlashAttention/LightningIndexer](https://gitcode.com/cann/cann-recipes-infer/tree/master/ops/ascendc) + [MLAPO](https://github.com/vllm-project/vllm-ascend/pull/3226)). You can also build your own image by referring to [link](https://github.com/vllm-project/vllm-ascend/issues/3278).
- `DeepSeek-V3.2-Exp`: requreid 2 Atlas 800 A3(64G*16) nodes or 4 Atlas 800 A2(64G*8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
- `DeepSeek-V3.2-Exp-w8a8`: requreid 1 Atlas 800 A3(64G*16) node or 2 Atlas 800 A2(64G*8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
- `DeepSeek-V3.2-Exp`: require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
- `DeepSeek-V3.2-Exp-w8a8`: require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
Run the following command to start the container in each node(This guide suppose you have download the weight to /root/.cache already):
Run the following command to start the container in each node (You should download the weight to /root/.cache in advance):
:::::{tab-set}
::::{tab-item} A2 series
@@ -180,13 +180,13 @@ docker run --rm \
:::::{tab-set}
::::{tab-item} DeepSeek-V3.2-Exp A3 series
Run the following scripts on two nodes respectively
Run the following scripts on two nodes respectively.
:::{note}
Before launch the inference server, ensure the following environment variables are set for multi node communication
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
**node0**
**Node 0**
```shell
#!/bin/sh
@@ -225,7 +225,7 @@ vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
```
**node1**
**Node 1**
```shell
#!/bin/sh
@@ -297,9 +297,9 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
::::
::::{tab-item} DeepSeek-V3.2-Exp-W8A8 A2 series
Run the following scripts on two nodes respectively
Run the following scripts on two nodes respectively.
**node0**
**Node 0**
```shell
#!/bin/sh
@@ -341,7 +341,7 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
```
**node1**
**Node 1**
```shell
#!/bin/sh

View File

@@ -3,18 +3,18 @@
## Getting Start
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
Each DP rank is deployed as a separate core engine process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
Each DP rank is deployed as a separate "core engine" process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
- Use **Data Parallel (DP)** for attention layers, which are replicated across devices and handle separate batches.
- Use **Expert or Tensor Parallel (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
This division enables attention layers to be replicated across DP ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using DP/TP, maximizing hardware utilization and efficiency.
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
For MoE models, when any requests are in progress in any rank, we must ensure that empty dummy forward passes are performed in all ranks which dont currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP `Coordinator` process, which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
## Verify Multi-Node Communication Environment
@@ -56,8 +56,8 @@ for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
hccn_tool -i 0 -ping -g address 10.20.0.20
```
## Run with docker
Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multi-node.
## Run with Docker
Assume you have two Atlas 800 A2 (64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multiple nodes.
```{code-block} bash
:substitutions:
@@ -66,7 +66,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
@@ -91,13 +91,13 @@ docker run --rm \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively
Run the following scripts on two nodes respectively.
:::{note}
Before launch the inference server, ensure the following environment variables are set for multi node communication
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
**node0**
**Node 0**
```shell
#!/bin/sh
@@ -116,8 +116,8 @@ export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--host 0.0.0.0 \
--port 8004 \
@@ -139,7 +139,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
**node1**
**Node 1**
```shell
#!/bin/sh
@@ -185,7 +185,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
The Deployment view looks like:
The deployment view looks like:
![alt text](../assets/multi_node_dp_deepseek.png)
Once your server is started, you can query the model with input prompts:
@@ -201,8 +201,8 @@ curl http://{ node0 ip:8004 }/v1/completions \
}'
```
## Run benchmarks
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
## Run Benchmarks
For details, refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
```shell
export VLLM_USE_MODELSCOPE=true

View File

@@ -2,10 +2,10 @@
## Verify Multi-Node Communication Environment
referring to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process)
Refer to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process).
## Run with docker
Assume you have two Atlas 800 A3(64G*16) nodes(or 4 * A2), and want to deploy the `Kimi-K2-Instruct-W8A8` quantitative model across multi-node.
## Run with Docker
Assume you have two Atlas 800 A3 (64G*16) or four A2 nodes, and want to deploy the `Kimi-K2-Instruct-W8A8` quantitative model across multiple nodes.
```{code-block} bash
:substitutions:
@@ -14,7 +14,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
@@ -47,13 +47,13 @@ docker run --rm \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively
Run the following scripts on two nodes respectively.
:::{note}
Before launch the inference server, ensure the following environment variables are set for multi node communication
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
**node0**
**Node 0**
```shell
#!/bin/sh
@@ -72,8 +72,8 @@ export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--host 0.0.0.0 \
--port 8004 \
@@ -96,7 +96,7 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
**node1**
**Node 1**
```shell
#!/bin/sh
@@ -141,7 +141,7 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
The Deployment view looks like:
The deployment view looks like:
![alt text](../assets/multi_node_dp_kimi.png)
Once your server is started, you can query the model with input prompts:

View File

@@ -2,9 +2,9 @@
## Getting Start
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
vLLM-Ascend now supports prefill-decode (PD) disaggregation with Expert Parallel (EP) options. This guide takes one-by-one steps to verify these features with constrained resources.
Take the Qwen3-30B-A3B model as an example, use vllm-ascend v0.10.1rc1 (with vLLM v0.10.1.1) on 3 Atlas 800T A2 servers to deploy the "1P2D" architecture. Assume the ip of the prefiller server is 192.0.0.1, and the decoder servers are 192.0.0.2 (decoder 1) and 192.0.0.3 (decoder 2). On each server, use 2 NPUs to deploy one service instance.
Using the Qwen3-30B-A3B model as an example, use vllm-ascend v0.10.1rc1 (with vLLM v0.10.1.1) on 3 Atlas 800T A2 servers to deploy the "1P2D" architecture. Assume the IP address of the prefiller server is 192.0.0.1, and the decoder servers are 192.0.0.2 (decoder 1) and 192.0.0.3 (decoder 2). On each server, use 2 NPUs to deploy one service instance.
## Verify Multi-Node Communication Environment
@@ -49,7 +49,7 @@ for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
## Generate Ranktable
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. For more details please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md). Execute the following commands for reference.
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. For more details, please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md). Execute the following commands for reference.
```shell
cd vllm-ascend/examples/disaggregate_prefill_v1/
@@ -58,7 +58,7 @@ bash gen_ranktable.sh --ips <prefiller_node1_local_ip> <prefiller_node2_local_ip
[--local-device-ids <id_1>,<id_2>,<id_3>...]
```
Assume that we use device 0,1 on the prefiller server node and device 6,7 on both of the decoder server nodes. Take the following commands as an example. (`--local-device-ids` is necessary if you specify certain NPU devices on the local server.)
Assume that we use devices 0 and 1 on the prefiller server node and devices 6 and 7 on both of the decoder server nodes. The following commands are for reference. (`--local-device-ids` is necessary if you specify certain NPU devices on the local server.)
```shell
# On the prefiller node
@@ -77,20 +77,20 @@ bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 6,7
```
Rank table will generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
The rank table will be generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
|Parameter | meaning |
|Parameter | Meaning |
| --- | --- |
| --ips | Each node's local ip (prefiller nodes should be front of decoder nodes) |
| --npus-per-node | Each node's npu clips |
| --ips | Each node's local IP address (prefiller nodes should be in front of decoder nodes) |
| --npus-per-node | Each node's NPU clips |
| --network-card-name | The physical machines' NIC |
|--prefill-device-cnt | Npu clips used for prefill |
|--decode-device-cnt |Npu clips used for decode |
|--prefill-device-cnt | NPU clips used for prefill |
|--decode-device-cnt |NPU clips used for decode |
|--local-device-ids |Optional. No need if using all devices on the local node. |
## Prefiller / Decoder Deployment
## Prefiller/Decoder Deployment
We can run the following scripts to launch a server on the prefiller/decoder node respectively.
We can run the following scripts to launch a server on the prefiller/decoder node, respectively.
:::::{tab-set}
@@ -214,9 +214,9 @@ vllm serve /model/Qwen3-30B-A3B \
:::::
## Example proxy for Deployment
## Example Proxy for Deployment
Run a proxy server on the same node with prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
```shell
python load_balance_proxy_server_example.py \

View File

@@ -2,7 +2,7 @@
## Getting Start
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide takes one-by-one steps to verify these features with constrained resources.
Take the Qwen3-235B model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
@@ -60,7 +60,7 @@ Mooncake is the serving platform for Kimi, a leading LLM service provided by Moo
git clone -b pooling_async_memecpy_v1 https://github.com/AscendTransport/Mooncake
```
Update and install Python
Update and install Python.
```shell
apt-get update
@@ -97,11 +97,11 @@ cp mooncake-transfer-engine/src/transport/ascend_transport/hccl_transport/ascend
cp mooncake-transfer-engine/src/libtransfer_engine.so /usr/local/Ascend/ascend-toolkit/latest/python/site-packages/
```
## Prefiller / Decoder Deployment
## Prefiller/Decoder Deployment
We can run the following scripts to launch a server on the prefiller/decoder node respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
### layerwise
### Layerwise
:::::{tab-set}
@@ -223,7 +223,7 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
::::
::::{tab-item} Decoder node 1 (master Node)
::::{tab-item} Decoder node 1 (master node)
```shell
unset ftp_proxy
@@ -284,7 +284,7 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
::::
::::{tab-item} Decoder node 2 (primary Node)
::::{tab-item} Decoder node 2 (primary node)
```shell
unset ftp_proxy
@@ -348,7 +348,7 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
:::::
### non-layerwise
### Non-layerwise
:::::{tab-set}
@@ -595,13 +595,13 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
:::::
## Example proxy for Deployment
## Example Proxy for Deployment
Run a proxy server on the same node with prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py) or [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py) or [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
:::::{tab-set}
::::{tab-item} layerwise
::::{tab-item} Layerwise
```shell
python load_balance_proxy_layerwise_server_example.py \
@@ -615,7 +615,7 @@ python load_balance_proxy_layerwise_server_example.py \
::::
::::{tab-item} non-layerwise
::::{tab-item} Non-layerwise
```shell
python load_balance_proxy_server_example.py \

View File

@@ -1,15 +1,15 @@
# Multi-Node-DP (Qwen3-VL-235B-A22B)
:::{note}
Qwen3 VL rely on the newest version of `transformers`(>4.56.2). Please install it from source until it's released.
Qwen3 VL relies on the newest version of `transformers` (>4.56.2). Please install it from source.
:::
## Verify Multi-Node Communication Environment
referring to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process)
Refer to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process).
## Run with docker
Assume you have an Atlas 800 A3(64G*16) nodes(or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multi-node.
## Run with Docker
Assume you have Atlas 800 A3 (64G*16) nodes (or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
```{code-block} bash
:substitutions:
@@ -49,13 +49,13 @@ docker run --rm \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively
Run the following scripts on two nodes respectively.
:::{note}
Before launch the inference server, ensure the following environment variables are set for multi node communication
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
node0
Node 0
```shell
#!/bin/sh
@@ -93,7 +93,7 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--gpu-memory-utilization 0.8 \
```
node1
Node 1
```shell
#!/bin/sh
@@ -136,7 +136,7 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--gpu-memory-utilization 0.8 \
```
If the service starts successfully, the following information will be displayed on node0:
If the service starts successfully, the following information will be displayed on node 0:
```shell
INFO: Started server process [44610]

View File

@@ -1,10 +1,10 @@
# Multi-Node-Ray (Qwen/Qwen3-235B-A22B)
Multi-node inference is suitable for the scenarios that the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
Multi-node inference is suitable for scenarios where the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
* **Verify Multi-Node Communication Environment**
* **Set Up and Start the Ray Cluster**
* **Start the Online Inference Service on multinode**
* **Start the Online Inference Service on Multi-node**
## Verify Multi-Node Communication Environment
@@ -48,9 +48,9 @@ hccn_tool -i 0 -ping -g address 10.20.0.20
## Set Up and Start the Ray Cluster
### Setting Up the Basic Container
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is advised to use Docker images.
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the primary and secondary nodes, with the `--net=host` option to enable proper network connectivity.
Below is the example container setup command, which should be executed on **all nodes** :
@@ -89,13 +89,13 @@ docker run --rm \
### Start Ray Cluster
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
Choose one machine as the primary node and the others as secondary nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues.
Below are the commands for the head and worker nodes:
Below are the commands for the primary and secondary nodes:
**Head node**:
**Primary node**:
:::{note}
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
@@ -112,7 +112,7 @@ export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ray start --head
```
**Worker node**:
**Secondary node**:
:::{note}
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
@@ -130,7 +130,7 @@ ray start --address='{head_node_ip}:6379' --node-ip-address={local_ip}
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
## Start the Online Inference Service on multinode scenario
## Start the Online Inference Service on Multi-node scenario
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster.
**You only need to run the vllm command on one node.**

View File

@@ -27,7 +27,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -39,13 +39,13 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
Run the following script to start the vLLM server on multi-NPU:
```bash
vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/completions \

View File

@@ -27,7 +27,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
@@ -43,7 +43,7 @@ git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
Run the following script to start the vLLM server on multi-NPU:
```bash
vllm serve /path/to/pangu-pro-moe-model \

View File

@@ -1,8 +1,8 @@
# Multi-NPU (QwQ 32B W8A8)
## Run docker container
## Run Docker Container
:::{note}
w8a8 quantization feature is supported by v0.8.4rc2 or higher
w8a8 quantization feature is supported by v0.8.4rc2 and later.
:::
```{code-block} bash
@@ -28,7 +28,7 @@ docker run --rm \
-it $IMAGE bash
```
## Install modelslim and convert model
## Install modelslim and Convert Model
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8
@@ -53,8 +53,8 @@ SAVE_PATH=/home/models/QwQ-32B-w8a8
python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --calib_file ../common/boolq.jsonl --w_bit 8 --a_bit 8 --device_type npu --anti_method m1 --trust_remote_code True
```
## Verify the quantized model
The converted model files looks like:
## Verify the Quantized Model
The converted model files look like:
```bash
.
@@ -68,10 +68,10 @@ The converted model files looks like:
`-- tokenizer_config.json
```
Run the following script to start the vLLM server with quantized model:
Run the following script to start the vLLM server with the quantized model:
:::{note}
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released. You can cherry-pick this commit for now.
:::
```bash
@@ -93,10 +93,10 @@ curl http://localhost:8000/v1/completions \
}'
```
Run the following script to execute offline inference on multi-NPU with quantized model:
Run the following script to execute offline inference on multi-NPU with the quantized model:
:::{note}
To enable quantization for ascend, quantization method must be "ascend"
To enable quantization for ascend, quantization method must be "ascend".
:::
```python

View File

@@ -27,7 +27,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -41,13 +41,13 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
Run the following script to start the vLLM server on Multi-NPU:
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32GB of memory, tensor-parallel-size should be at least 4.
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32 GB of memory, tensor-parallel-size should be at least 4.
```bash
vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable_expert_parallel
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{

View File

@@ -1,7 +1,7 @@
# Multi-NPU (Qwen3-Next)
```{note}
The Qwen3 Next are using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes around stability, accuracy and performance improvement.
The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement.
```
## Run vllm-ascend on Multi-NPU with Qwen3 Next
@@ -31,7 +31,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -41,7 +41,7 @@ export VLLM_USE_MODELSCOPE=True
### Install Triton Ascend
:::::{tab-set}
::::{tab-item} Linux (aarch64)
::::{tab-item} Linux (AArch64)
The [Triton Ascend](https://gitee.com/ascend/triton-ascend) is required when you run Qwen3 Next, please follow the instructions below to install it and its dependency.
@@ -72,7 +72,7 @@ Coming soon ...
### Inference on Multi-NPU
Please make sure you already executed the command:
Please make sure you have already executed the command:
```bash
source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
@@ -81,15 +81,15 @@ source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
:::::{tab-set}
::::{tab-item} Online Inference
Run the following script to start the vLLM server on Multi-NPU:
Run the following script to start the vLLM server on multi-NPU:
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32GB of memory, tensor-parallel-size should be at least 8.
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32 GB of memory, tensor-parallel-size should be at least 8.
```bash
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --enforce-eager
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{

View File

@@ -1,13 +1,13 @@
# Single Node (Atlas 300I series)
# Single Node (Atlas 300I Series)
```{note}
1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement.
2. Currently, the 310I series only supports eager mode and the data type is float16.
3. There are some known issues for running vLLM on 310p series, you can refer to vllm-ascend [<u>#3316</u>](https://github.com/vllm-project/vllm-ascend/issues/3316),
[<u>#2795</u>](https://github.com/vllm-project/vllm-ascend/issues/2795), you can use v0.10.0rc1 version first.
1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes related to model coverage and performance improvement.
2. Currently, the 310I series only supports eager mode and the float16 data type.
3. There are some known issues for running vLLM on 310p series, you can refer to vllm-ascend [<u>#3316</u>](https://github.com/vllm-project/vllm-ascend/issues/3316) and
[<u>#2795</u>](https://github.com/vllm-project/vllm-ascend/issues/2795). You can use v0.10.0rc1 version first.
```
## Run vLLM on Altlas 300I series
## Run vLLM on Atlas 300I Series
Run docker container:
@@ -38,7 +38,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -50,7 +50,7 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
### Online Inference on NPU
Run the following script to start the vLLM server on NPU(Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
Run the following script to start the vLLM server on NPU (Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
:::::{tab-set}
:sync-group: inference
@@ -170,7 +170,7 @@ vllm serve /home/pangu-pro-moe-mode/ \
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
export question="你是谁?"

View File

@@ -26,7 +26,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -164,7 +164,7 @@ vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
:::::
:::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
:::
If your service start successfully, you can see the info shown below:

View File

@@ -26,7 +26,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -119,4 +119,4 @@ The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little La
### Online Serving on Single NPU
Currently, vllm's OpenAI-compatible server doesn't support audio inputs, find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).
Currently, vLLM's OpenAI-Compatible server doesn't support audio inputs. Find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).

View File

@@ -26,7 +26,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -149,10 +149,10 @@ vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
```
:::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
:::
If your service start successfully, you can see the info shown below:
If your service starts successfully, you can see the info shown below:
```bash
INFO: Started server process [2736]

View File

@@ -2,9 +2,9 @@
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
## Run docker container
## Run Docker Container
Take Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
Using the Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
```{code-block} bash
:substitutions:
@@ -26,7 +26,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -42,7 +42,7 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
vllm serve Qwen/Qwen3-Embedding-8B --task embed
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{

View File

@@ -1,8 +1,8 @@
# Single-NPU (Qwen3 8B W4A8)
## Run docker container
## Run Docker Container
:::{note}
w4a8 quantization feature is supported by v0.9.1rc2 or higher
w4a8 quantization feature is supported by v0.9.1rc2 and later.
:::
```{code-block} bash
@@ -25,7 +25,7 @@ docker run --rm \
-it $IMAGE bash
```
## Install modelslim and convert model
## Install modelslim and Convert Model
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/Qwen3-8B-W4A8
@@ -65,8 +65,8 @@ python quant_qwen.py \
--w_method HQQ
```
## Verify the quantized model
The converted model files looks like:
## Verify the Quantized Model
The converted model files look like:
```bash
.
@@ -84,13 +84,13 @@ The converted model files looks like:
`-- tokenizer_config.json
```
Run the following script to start the vLLM server with quantized model:
Run the following script to start the vLLM server with the quantized model:
```bash
vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/completions \
@@ -105,10 +105,10 @@ curl http://localhost:8000/v1/completions \
}'
```
Run the following script to execute offline inference on Single-NPU with quantized model:
Run the following script to execute offline inference on single-NPU with the quantized model:
:::{note}
To enable quantization for ascend, quantization method must be "ascend"
To enable quantization for ascend, quantization method must be "ascend".
:::
```python

View File

@@ -1,6 +1,6 @@
# Additional Configuration
additional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by their own. vLLM Ascend uses this mechanism to make the project more flexible.
Additional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by their own. vLLM Ascend uses this mechanism to make the project more flexible.
## How to use
@@ -22,52 +22,52 @@ LLM(model="Qwen/Qwen3-8B", additional_config={"config_key":"config_value"})
### Configuration options
The following table lists the additional configuration options available in vLLM Ascend:
The following table lists additional configuration options available in vLLM Ascend:
| Name | Type | Default | Description |
|-------------------------------------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| `torchair_graph_config` | dict | `{}` | The config options for torchair graph mode |
| `ascend_scheduler_config` | dict | `{}` | The config options for ascend scheduler |
| `weight_prefetch_config` | dict | `{}` | The config options for weight prefetch |
| `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf or ut/e2e test case. |
| `expert_map_path` | str | `None` | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
| `kv_cache_dtype` | str | `None` | When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. |
| `enable_shared_expert_dp` | bool | `False` | When the shared expert in DP, it has better performance but consumes more memory. Currently only DeepSeek series models are supported to use. |
| `torchair_graph_config` | dict | `{}` | Configuration options for torchair graph mode |
| `ascend_scheduler_config` | dict | `{}` | Configuration options for ascend scheduler |
| `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch |
| `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
| `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
| `kv_cache_dtype` | str | `None` | When using the KV cache quantization method, KV cache dtype needs to be set, currently only int8 is supported. |
| `enable_shared_expert_dp` | bool | `False` | When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. |
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. |
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effects on moe models with shared experts. |
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic eplb |
| `num_iterations_eplb_update` | int | `400` | Forward iterations when eplb would begin |
| `gate_eplb` | bool | `False` | Whether to enale eplb only once. |
| `num_wait_worker_iterations` | int | `30` | The forward iterations when eplb worker will finish cpu task. In our test default value 30 would cover most cases. |
| `expert_map_record_path` | str | `None` | When dynamic eplb is completed, save the current expert load heatmap to the specified path. |
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effects on MoE models with shared experts. |
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic EPLB. |
| `num_iterations_eplb_update` | int | `400` | Forward iterations when EPLB begins. |
| `gate_eplb` | bool | `False` | Whether to enable EPLB only once. |
| `num_wait_worker_iterations` | int | `30` | The forward iterations when the EPLB worker will finish CPU tasks. In our test default value 30 can cover most cases. |
| `expert_map_record_path` | str | `None` | When dynamic EPLB is completed, save the current expert load heatmap to the specified path. |
| `init_redundancy_expert` | int | `0` | Specify redundant experts during initialization. |
The details of each config option are as follows:
The details of each configuration option are as follows:
**torchair_graph_config**
| Name | Type | Default | Description |
| ---- | ---- | ------- | ----------- |
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported to use torchair graph mode |
| `mode` | str | `None` | When using reduce-overhead mode for torchair, mode needs to be set |
| `enable_multistream_mla`| bool | `False` | Whether to put vector ops of MLA to another stream. This option only takes effects on models using MLA (e.g., DeepSeek). |
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported. |
| `mode` | str | `None` | When using reduce-overhead mode for torchair, it needs to be set. |
| `enable_multistream_mla`| bool | `False` | Whether to put vector operators of MLA to another stream. This option only takes effect on models using MLA (for example, DeepSeek). |
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization. |
| `enable_frozen_parameter` | bool | `True` | Whether to fix the memory address of weights during inference to reduce the input address refresh time during graph execution. |
| `use_cached_graph` | bool | `False` | Whether to use cached graph |
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache |
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty |
| `enable_kv_nz`| bool | `False` | Whether to enable kvcache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek). |
| `use_cached_graph` | bool | `False` | Whether to use cached graph. |
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache. |
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty. |
| `enable_kv_nz`| bool | `False` | Whether to enable KV Cache NZ layout. This option only takes effect on models using MLA (for example, DeepSeek). |
| `enable_super_kernel` | bool | `False` | Whether to enable super kernel to fuse operators in deepseek moe layers. This option only takes effects on moe models using dynamic w8a8 quantization.|
**ascend_scheduler_config**
| Name | Type | Default | Description |
| ---- | ---- | ------- | ----------- |
| `enabled` | bool | `False` | Whether to enable ascend scheduler for V1 engine|
| `enable_pd_transfer` | bool | `False` | Whether to enable pd transfer. When using it, decode is started only when prefill of all requests is done. This option only takes effects on offline inference. |
| `decode_max_num_seqs` | int | `0` | Whether to change max_num_seqs of decode phase when enable pd transfer. This option only takes effects when enable_pd_transfer is True. |
| `max_long_partial_prefills` | Union[int, float] | `float('inf')` | the maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. |
| `enabled` | bool | `False` | Whether to enable ascend scheduler for V1 engine.|
| `enable_pd_transfer` | bool | `False` | Whether to enable P-D transfer. When it is enabled, decode is started only when prefill of all requests is done. This option only takes effect on offline inference. |
| `decode_max_num_seqs` | int | `0` | Whether to change max_num_seqs of decode phase when P-D transfer is enabled. This option only takes effect when enable_pd_transfer is True. |
| `max_long_partial_prefills` | Union[int, float] | `float('inf')` | The maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. |
| `long_prefill_token_threshold` | Union[int, float] | `float('inf')` | a request is considered long if the prompt is longer than this number of tokens. |
ascend_scheduler_config also support the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well.

View File

@@ -2,7 +2,7 @@
## Overview
Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Tokens Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
## EPLB Effects
@@ -16,7 +16,7 @@ Expert balancing for MoE models in LLM serving is essential for optimal performa
### Dynamic EPLB
We need to add environment variable `export DYNAMIC_EPLB="true"` to enable vllm eplb. Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.
We need to add the environment variable `export PYTHONOPTIMIZE=1` to get context of the vllm process. Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.
```shell
vllm serve Qwen/Qwen3-235B-A22 \
@@ -32,7 +32,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
### Static EPLB
#### Initial Setup (Record Expert Map)
We need to add environment variable `export EXPERT_MAP_RECORD="true"` to record expert map.Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
```shell
vllm serve Qwen/Qwen3-235B-A22 \
@@ -61,16 +61,16 @@ vllm serve Qwen/Qwen3-235B-A22 \
## Critical Considerations
1. Parameter Tuning:
- num_iterations_eplb_update: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.
- num_wait_worker_iterations: Should be ≥30 to avoid premature balancing during startup.
- num_wait_worker_iterations: Should be ≥ 30 to avoid premature balancing during startup.
- init_redundancy_expert: Must match tensor-parallel size (e.g., 16 for 16 GPUs) to ensure sufficient redundancy.
2. Hardware Requirements:
- Ensure all GPUs have identical memory capacity and compute capabilities.
- Network bandwidth must support expert redistribution traffic (≥10Gbps recommended).
- Ensure that all GPUs have identical memory capacity and compute capabilities.
- Network bandwidth must support expert redistribution traffic (≥ 10 Gbps recommended).
3. Model Compatibility:
- Only MoE models with explicit expert parallelism support (e.g., Qwen3-235B-A22) are compatible.
- Verify model architecture supports dynamic expert routing via --enable-expert-parallel.
- Verify model architecture supports dynamic expert routing through --enable-expert-parallel.
4. Gating Configuration:
- When gate_eplb=true, validate that the gating mechanism can handle expert movement without routing errors.
@@ -83,7 +83,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
6. Startup Behavior:
- Initial requests may experience higher latency during the first balancing cycle (typically 1-2 minutes).
- Avoid sudden traffic spikes during warm-up phase.
- Avoid sudden traffic spikes during the warm-up phase.
7. Common Pitfalls:
- Incorrect tensor-parallel-size vs. actual GPU count → causes resource underutilization.

View File

@@ -4,11 +4,11 @@
This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
```
This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested from 0.9.0rc1. We'll make it stable and generalize in the next release.
This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested from 0.9.0rc1. We will make it stable and generalized in the next release.
## Getting Started
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by set `enforce_eager=True` when initializing the model.
There are two kinds for graph mode supported by vLLM Ascend:
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested.
@@ -17,7 +17,7 @@ There are two kinds for graph mode supported by vLLM Ascend:
## Using ACLGraph
ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
offline example:
Offline example:
```python
import os
@@ -28,7 +28,7 @@ model = LLM(model="Qwen/Qwen2-7B-Instruct")
outputs = model.generate("Hello, how are you?")
```
online example:
Online example:
```shell
vllm serve Qwen/Qwen2-7B-Instruct
@@ -36,9 +36,9 @@ vllm serve Qwen/Qwen2-7B-Instruct
## Using TorchAirGraph
If you want to run DeepSeek series models with graph mode, you should use [TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html). In this case, additional config is required.
If you want to run DeepSeek series models with the graph mode, you should use [TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html). In this case, additional configuration is required.
offline example:
Offline example:
```python
import os
@@ -49,19 +49,19 @@ model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_g
outputs = model.generate("Hello, how are you?")
```
online example:
Online example:
```shell
vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}'
```
You can find more detail about additional config [here](../configuration/additional_config.md).
You can find more details about additional configuration [here](../configuration/additional_config.md).
## Fallback to Eager Mode
## Fallback to the Eager Mode
If both `ACLGraph` and `TorchAirGraph` fail to run, you should fallback to eager mode.
If both `ACLGraph` and `TorchAirGraph` fail to run, you should fallback to the eager mode.
offline example:
Offline example:
```python
import os
@@ -71,7 +71,7 @@ model = LLM(model="someother_model_weight", enforce_eager=True)
outputs = model.generate("Hello, how are you?")
```
online example:
Online example:
```shell
vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager

View File

@@ -8,7 +8,7 @@ You can refer to [Supported Models](https://docs.vllm.ai/en/latest/models/suppor
You can run LoRA with ACLGraph mode now. Please refer to [Graph Mode Guide](./graph_mode.md) for a better LoRA performance.
## Example
We show a simple LoRA example here, which enables the ACLGraph mode as default.
We provide a simple LoRA example here, which enables the ACLGraph mode by default.
```shell
vllm serve meta-llama/Llama-2-7b \
@@ -20,4 +20,4 @@ vllm serve meta-llama/Llama-2-7b \
We have implemented LoRA-related AscendC operators, such as bgmv_shrink, bgmv_expand, sgmv_shrink and sgmv_expand. You can find them under the "csrc/kernels" directory of [vllm-ascend repo](https://github.com/vllm-project/vllm-ascend.git).
When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you don't want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. To require more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).
When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you do not want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. To require more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).

View File

@@ -2,13 +2,13 @@
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. Well support more quantization algorithm and models in the future.
Since version 0.9.0rc2, the quantization feature is experimentally supported by vLLM Ascend. Users can enable the quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We will support more quantization algorithms and models in the future.
## Install modelslim
## Install ModelSlim
To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
To quantize a model, you should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
Install modelslim:
Install ModelSlim:
```bash
# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
@@ -23,16 +23,16 @@ pip install accelerate
## Quantize model
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
This conversion process will require a larger CPU memory, please ensure that the RAM size is greater than 2TB
You can choose to convert the model yourself or use the quantized model we uploaded.
See https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8.
This conversion process requires a larger CPU memory, ensure that the RAM size is greater than 2 TB.
:::
### Adapts and change
### Adapt to changes
1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder
2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder.
### Generate the w8a8 weights
### Generate the W8A8 weights
```bash
cd example/DeepSeek
@@ -63,7 +63,7 @@ Here is the full converted model files except safetensors:
## Run the model
Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.
Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows:
### Offline inference
@@ -93,26 +93,25 @@ for output in outputs:
### Online inference
Enable quantization by specifying `--quantization ascend`, for more details, see DeepSeek-V3-W8A8 [tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html)
Enable quantization by specifying `--quantization ascend`, for more details, see the [DeepSeek-V3-W8A8 Tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html).
## FAQs
### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?
### 1. How to solve the KeyError "xxx.layers.0.self_attn.q_proj.weight"?
First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim version. Finally, if it still doesn't work, please
submit a issue, maybe some new models need to be adapted.
First, make sure you specify `ascend` as the quantization method. Second, check if your model is converted by the `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim version. Finally, if it still does not work, submit an issue. Maybe some new models need to be adapted.
### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim, this version has fixed the missing configuration_deepseek.py error.
Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim, where the missing configuration_deepseek.py error has been fixed.
### 3. When converting deepseek series models with modelslim, what should you pay attention?
### 3. What should be considered when converting DeepSeek series models with ModelSlim?
When the mla portion of the weights used `W8A8_DYNAMIC` quantization, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
When the MLA portion of the weights used the `W8A8_DYNAMIC` quantization with the torchair graph mode enabled, modify the configuration file in the CANN package to prevent incorrect inference results.
The operation steps are as follows:
1. Search in the CANN package directory used, for example:
1. Search in the CANN package directory, for example:
find /usr/local/Ascend/ -name fusion_config.json
2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` and `"MultiAddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:

View File

@@ -8,9 +8,9 @@ Since the generation and training phases may employ different model parallelism
## Getting started
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in vllm is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
The engine(v0/v1) supports two sleep levels to manage memory during idle periods:
The engine (v0/v1) supports two sleep levels to manage memory during idle periods:
- Level 1 Sleep
- Action: Offloads model weights and discards the KV cache.
@@ -20,16 +20,16 @@ The engine(v0/v1) supports two sleep levels to manage memory during idle periods
- Level 2 Sleep
- Action: Discards both model weights and KV cache.
- Memory: The content of both the model weights and kv cache is forgotten.
- Memory: The content of both the model weights and KV cache is forgotten.
- Use Case: Ideal when switching to a different model or updating the current one.
Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.
Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and build from source. If you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`. For the latest version (v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set to 1 by default while building from source.
## Usage
The following is a simple example of how to use sleep mode.
- offline inference:
- Offline inference:
```python
import os
@@ -68,9 +68,9 @@ The following is a simple example of how to use sleep mode.
assert output[0].outputs[0].text == output2[0].outputs[0].text
```
- online serving:
- Online serving:
:::{note}
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up).
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the dev environment `VLLM_SERVER_DEV_MODE` to expose these endpoints (sleep/wake up).
:::
```bash

View File

@@ -2,34 +2,34 @@
## Overview
### What is Structured Output?
### What is structured output?
LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification. **Structured Output (also called Guided Decoding)** enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.
LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON without guidance, it might produce valid text that breaks JSON specification. **Structured Output (also known as Guided Decoding)** enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.
In simple terms, structured decoding gives LLMs a template to follow. Users provide a schema that influences the models output, ensuring compliance with the desired structure.
In simple terms, structured decoding gives LLMs a "template" to follow. Users provide a schema that "influences" the model output, ensuring compliance with the desired structure.
![structured decoding](./images/structured_output_1.png)
### Structured Output in vllm-ascend
### Structured output in vllm-ascend
Currently, vllm-ascend supports **xgrammar** and **guidance** backend for structured output with vllm v1 engine.
Currently, vllm-ascend supports **xgrammar** and **guidance** backends for structured output with vllm v1 engine.
XGrammar introduces a new technique that batch constrained decoding via pushdown automaton (PDA). You can think of a PDA as a collection of FSMs, and each FSM represents a context-free grammar (CFG). One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimisation (for those who are interested) to reduce grammar compilation overhead. Besides, you can also find more details about guidance by yourself.
XGrammar introduces a new technique that batch constrained decoding through pushdown automaton (PDA). You can think of a PDA as a "collection of FSMs, and each FSM represents a context-free grammar (CFG)." One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimizations (for those who are interested) to reduce grammar compilation overhead. Besides, you can also find more details about guidance by yourself.
## How to Use Structured Output?
## How to use structured output?
### Online Inference
### Online inference
You can also generate structured outputs using the OpenAI's Completions and Chat API. The following parameters are supported, which must be added as extra parameters:
You can also generate structured outputs using the Completions and Chat API of OpenAI. The following parameters are supported, which must be added as extra parameters:
- `guided_choice`: the output will be exactly one of the choices.
- `guided_regex`: the output will follow the regex pattern.
- `guided_json`: the output will follow the JSON schema.
- `guided_grammar`: the output will follow the context free grammar.
Structured outputs are supported by default in the OpenAI-Compatible Server. You can choose to specify the backend to use by setting the `--guided-decoding-backend` flag to vllm serve. The default backend is `auto`, which will try to choose an appropriate backend based on the details of the request. You may also choose a specific backend, along with some options.
Structured outputs are supported by default in an OpenAI-Compatible Server. You can choose to specify the backend by setting the `--guided-decoding-backend` flag to vLLM serve. The default backend is `auto`, which will try to choose an appropriate backend based on the details of the request. You may also choose a specific backend, along with some options.
Now let´s see an example for each of the cases, starting with the guided_choice, as it´s the easiest one:
The following are examples for each of the cases, starting with the guided_choice, as it's the easiest one:
```python
from openai import OpenAI
@@ -64,7 +64,7 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. For this we can use the guided_json parameter in two different ways:
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. To achieve this, we can use the guided_json parameter in two different ways:
- Using a JSON Schema.
- Defining a Pydantic model and then extracting the JSON Schema from it.
@@ -101,7 +101,7 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```
Finally we have the guided_grammar option, which is probably the most difficult to use, but it´s really powerful. It allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar. As an example, we can use to define a specific format of simplified SQL queries:
Finally we have the guided_grammar option, which is probably the most difficult to use, but it´s really powerful. It allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar. As an example, we can define a specific format of simplified SQL queries:
```python
simplified_sql_grammar = """
@@ -133,16 +133,16 @@ print(completion.choices[0].message.content)
Find more examples [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/structured_outputs.py).
### Offline Inference
### Offline inference
To use Structured Output, we'll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`. The main available options inside `GuidedDecodingParams` are:
To use structured output, we need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`. The main available options inside `GuidedDecodingParams` are:
- json
- regex
- choice
- grammar
One example for the usage of the choice parameter is shown below:
One example for using the choice parameter is shown below:
```python
from vllm import LLM, SamplingParams

View File

@@ -1,4 +1,4 @@
# Release note
# Release Notes
## v0.11.0rc0 - 2025.09.30
@@ -17,7 +17,7 @@ This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow
- Mooncacke store kvcache connector is supported now. [#2913](https://github.com/vllm-project/vllm-ascend/pull/2913)
- CPU offload connector is supported now. [#1659](https://github.com/vllm-project/vllm-ascend/pull/1659)
### Other
### Others
- Qwen3-next is stable now. [#3007](https://github.com/vllm-project/vllm-ascend/pull/3007)
- Fixed a lot of bugs introduced in v0.10.2 by Qwen3-next. [#2964](https://github.com/vllm-project/vllm-ascend/pull/2964) [#2781](https://github.com/vllm-project/vllm-ascend/pull/2781) [#3070](https://github.com/vllm-project/vllm-ascend/pull/3070) [#3113](https://github.com/vllm-project/vllm-ascend/pull/3113)
@@ -30,8 +30,8 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
### Highlights
- Add support for Qwen3 Next. Please note that expert parallel and MTP feature doesn't work with this release. We'll make it work enough soon. Follow the [official guide](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html) to get start [#2917](https://github.com/vllm-project/vllm-ascend/pull/2917)
- Add quantization support for aclgraph [#2841](https://github.com/vllm-project/vllm-ascend/pull/2841)
- Added support for Qwen3-Next. Please note that expert parallel and MTP feature doesn't work with this release. We will make it work enough soon. Follow the [official guide](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html) to get start. [#2917](https://github.com/vllm-project/vllm-ascend/pull/2917)
- Added quantization support for aclgraph [#2841](https://github.com/vllm-project/vllm-ascend/pull/2841)
### Core
@@ -41,15 +41,15 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
- Improved the performance with async scheduler enabled. [#2783](https://github.com/vllm-project/vllm-ascend/pull/2783)
- Fixed the performance regression with non MLA model when use default scheduler. [#2894](https://github.com/vllm-project/vllm-ascend/pull/2894)
### Other
- The performance of w8a8 quantization is improved. [#2275](https://github.com/vllm-project/vllm-ascend/pull/2275)
- The performance of moe model is improved. [#2689](https://github.com/vllm-project/vllm-ascend/pull/2689) [#2842](https://github.com/vllm-project/vllm-ascend/pull/2842)
### Others
- The performance of W8A8 quantization is improved. [#2275](https://github.com/vllm-project/vllm-ascend/pull/2275)
- The performance is improved for moe models. [#2689](https://github.com/vllm-project/vllm-ascend/pull/2689) [#2842](https://github.com/vllm-project/vllm-ascend/pull/2842)
- Fixed resources limit error when apply speculative decoding and aclgraph. [#2472](https://github.com/vllm-project/vllm-ascend/pull/2472)
- Fixed the git config error in docker images. [#2746](https://github.com/vllm-project/vllm-ascend/pull/2746)
- Fixed the git config error in Docker images. [#2746](https://github.com/vllm-project/vllm-ascend/pull/2746)
- Fixed the sliding windows attention bug with prefill. [#2758](https://github.com/vllm-project/vllm-ascend/pull/2758)
- The official doc for Prefill Decode Disaggregation with Qwen3 is added. [#2751](https://github.com/vllm-project/vllm-ascend/pull/2751)
- The official doc for Prefill-Decode Disaggregation with Qwen3 is added. [#2751](https://github.com/vllm-project/vllm-ascend/pull/2751)
- `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` env works again. [#2740](https://github.com/vllm-project/vllm-ascend/pull/2740)
- A new improvement for oproj in deepseek is added. Set `oproj_tensor_parallel_size` to enable this feature[#2167](https://github.com/vllm-project/vllm-ascend/pull/2167)
- A new improvement for oproj in deepseek is added. Set `oproj_tensor_parallel_size` to enable this feature. [#2167](https://github.com/vllm-project/vllm-ascend/pull/2167)
- Fix a bug that deepseek with torchair doesn't work as expect when `graph_batch_sizes` is set. [#2760](https://github.com/vllm-project/vllm-ascend/pull/2760)
- Avoid duplicate generation of sin_cos_cache in rope when kv_seqlen > 4k. [#2744](https://github.com/vllm-project/vllm-ascend/pull/2744)
- The performance of Qwen3 dense model is improved with flashcomm_v1. Set `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1` and `VLLM_ASCEND_ENABLE_FLASHCOMM=1` to enable it. [#2779](https://github.com/vllm-project/vllm-ascend/pull/2779)
@@ -59,10 +59,10 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
- Add warm_up_atb step to speed up the inference. [#2823](https://github.com/vllm-project/vllm-ascend/pull/2823)
- Fixed the aclgraph steam error for moe model. [#2827](https://github.com/vllm-project/vllm-ascend/pull/2827)
### Known issue
- The server will be hang when running Prefill Decode Disaggregation with different TP size for P and D. It's fixed by [vLLM commit](https://github.com/vllm-project/vllm/pull/23917) which is not included in v0.10.2. You can pick this commit to fix the issue.
- The HBM usage of Qwen3 Next is higher than expected. It's a [known issue](https://github.com/vllm-project/vllm-ascend/issues/2884) and we're working on it. You can set `max_model_len` and `gpu_memory_utilization` to suitable value basing on your parallel config to avoid oom error.
- We notice that lora doesn't work with this release due to the refactor of kv cache. We'll fix it soon. [2941](https://github.com/vllm-project/vllm-ascend/issues/2941)
### Known Issues
- The server will hang when running Prefill Decode Disaggregation with different TP size for P and D. It's fixed by [vLLM commit](https://github.com/vllm-project/vllm/pull/23917) which is not included in v0.10.2. You can pick this commit to fix the issue.
- The HBM usage of Qwen3-Next is higher than expected. It is a [known issue](https://github.com/vllm-project/vllm-ascend/issues/2884) and we are working on it. You can set `max_model_len` and `gpu_memory_utilization` to suitable value based on your parallel configuration to avoid oom error.
- We notice that LoRA does not work with this release due to the refactor of KV cache. We will fix it soon. [2941](https://github.com/vllm-project/vllm-ascend/issues/2941)
- Please do not enable chunked prefill with prefix cache when running with Ascend scheduler. The performance and accuracy is not good/correct. [#2943](https://github.com/vllm-project/vllm-ascend/issues/2943)
## v0.10.1rc1 - 2025.09.04
@@ -75,40 +75,40 @@ This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the
- Support capture custom ops into aclgraph now. [#2113](https://github.com/vllm-project/vllm-ascend/pull/2113)
### Core
- Add MLP tensor parallel to improve performance, but note that this will increase memory usage. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
- Added MLP tensor parallel to improve performance, but note that this will increase memory usage. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
- openEuler is upgraded to 24.03. [#2631](https://github.com/vllm-project/vllm-ascend/pull/2631)
- Add custom lmhead tensor parallel to achieve reduced memory consumption and improved TPOT performance. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
- Added custom lmhead tensor parallel to achieve reduced memory consumption and improved TPOT performance. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
- Qwen3 MoE/Qwen2.5 support torchair graph now. [#2403](https://github.com/vllm-project/vllm-ascend/pull/2403)
- Support Sliding Window Attention with AscendSceduler, thus fixing Gemma3 accuracy issue. [#2528](https://github.com/vllm-project/vllm-ascend/pull/2528)
### Other
### Others
- Bug fixes:
* Update the graph capture size calculation, somehow alleviated the problem that npu stream not enough in some scenarios [#2511](https://github.com/vllm-project/vllm-ascend/pull/2511)
* Fix bugs and refactor cached mask generation logic. [#2442](https://github.com/vllm-project/vllm-ascend/pull/2442)
* Fix the nz format does not work in quantization scenarios. [#2549](https://github.com/vllm-project/vllm-ascend/pull/2549)
* Fix accuracy issue on Qwen series caused by enabling `enable_shared_pert_dp` by default. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
* Fix accuracy issue on models whose rope dim is not equal to head dim, e.g., GLM4.5. [#2601](https://github.com/vllm-project/vllm-ascend/pull/2601)
* Updated the graph capture size calculation, somehow alleviated the problem that NPU stream not enough in some scenarios. [#2511](https://github.com/vllm-project/vllm-ascend/pull/2511)
* Fixed bugs and refactor cached mask generation logic. [#2442](https://github.com/vllm-project/vllm-ascend/pull/2442)
* Fixed the nz format does not work in quantization scenarios. [#2549](https://github.com/vllm-project/vllm-ascend/pull/2549)
* Fixed the accuracy issue on Qwen series caused by enabling `enable_shared_pert_dp` by default. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
* Fixed the accuracy issue on models whose rope dim is not equal to head dim, e.g., GLM4.5. [#2601](https://github.com/vllm-project/vllm-ascend/pull/2601)
- Performance improved through a lot of prs:
* Remove torch.cat and replace it by List[0]. [#2153](https://github.com/vllm-project/vllm-ascend/pull/2153)
* Convert the format of gmm to nz. [#2474](https://github.com/vllm-project/vllm-ascend/pull/2474)
* Optimize parallel strategies to reduce communication overhead [#2198](https://github.com/vllm-project/vllm-ascend/pull/2198)
* Optimize reject sampler in greedy situation [#2137](https://github.com/vllm-project/vllm-ascend/pull/2137)
- A batch of refactoring prs to enhance the code architecture:
* Removed torch.cat and replaced it with List[0]. [#2153](https://github.com/vllm-project/vllm-ascend/pull/2153)
* Converted the format of gmm to nz. [#2474](https://github.com/vllm-project/vllm-ascend/pull/2474)
* Optimized parallel strategies to reduce communication overhead. [#2198](https://github.com/vllm-project/vllm-ascend/pull/2198)
* Optimized reject sampler in greedy situation. [#2137](https://github.com/vllm-project/vllm-ascend/pull/2137)
- A batch of refactoring PRs to enhance the code architecture:
* Refactor on MLA. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
* Refactor on torchair fused_moe. [#2438](https://github.com/vllm-project/vllm-ascend/pull/2438)
* Refactor on allgather/mc2-related fused_experts. [#2369](https://github.com/vllm-project/vllm-ascend/pull/2369)
* Refactor on torchair model runner. [#2208](https://github.com/vllm-project/vllm-ascend/pull/2208)
* Refactor on CI. [#2276](https://github.com/vllm-project/vllm-ascend/pull/2276)
- Parameters changes:
* Add `lmhead_tensor_parallel_size` in `additional_config`, set it to enable lmhead tensor parallel. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
* Some unused environ variables `HCCN_PATH`, `PROMPT_DEVICE_ID`, `DECODE_DEVICE_ID`, `LLMDATADIST_COMM_PORT` and `LLMDATADIST_SYNC_CACHE_WAIT_TIME` are removed. [#2448](https://github.com/vllm-project/vllm-ascend/pull/2448)
* Environ variable `VLLM_LLMDD_RPC_PORT` is renamed to `VLLM_ASCEND_LLMDD_RPC_PORT` now. [#2450](https://github.com/vllm-project/vllm-ascend/pull/2450)
* Add `VLLM_ASCEND_ENABLE_MLP_OPTIMIZE` in environ variables, Whether to enable mlp optimize when tensor parallel is enabled, this feature in eager mode will get better performance. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
* Remove `MOE_ALL2ALL_BUFFER` and `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environ variables.[#2612](https://github.com/vllm-project/vllm-ascend/pull/2612)
* Add `enable_prefetch` in `additional_config`, whether to enable weight prefetch. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
* Add `mode` in `additional_config.torchair_graph_config`, When using reduce-overhead mode for torchair, mode needs to be set. [#2461](https://github.com/vllm-project/vllm-ascend/pull/2461)
* `enable_shared_expert_dp` in `additional_config` is disabled by default now, and it is recommended to enable when inferencing with deepseek. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
* Added `lmhead_tensor_parallel_size` in `additional_config`, set it to enable lmhead tensor parallel. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
* Some unused environment variables `HCCN_PATH`, `PROMPT_DEVICE_ID`, `DECODE_DEVICE_ID`, `LLMDATADIST_COMM_PORT` and `LLMDATADIST_SYNC_CACHE_WAIT_TIME` are removed. [#2448](https://github.com/vllm-project/vllm-ascend/pull/2448)
* Environment variable `VLLM_LLMDD_RPC_PORT` is renamed to `VLLM_ASCEND_LLMDD_RPC_PORT` now. [#2450](https://github.com/vllm-project/vllm-ascend/pull/2450)
* Added `VLLM_ASCEND_ENABLE_MLP_OPTIMIZE` in environ variables, Whether to enable mlp optimize when tensor parallel is enabled, this feature in eager mode will get better performance. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
* Removed `MOE_ALL2ALL_BUFFER` and `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environ variables.[#2612](https://github.com/vllm-project/vllm-ascend/pull/2612)
* Added `enable_prefetch` in `additional_config`, whether to enable weight prefetch. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
* Added `mode` in `additional_config.torchair_graph_config`, When using reduce-overhead mode for torchair, mode needs to be set. [#2461](https://github.com/vllm-project/vllm-ascend/pull/2461)
* `enable_shared_expert_dp` in `additional_config` is disabled by default now, and it is recommended to be enabled when inferencing with deepseek. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
### Known Issues
@@ -208,7 +208,7 @@ This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the
- Torchair graph mode works with tp > 4 now. [#1508](https://github.com/vllm-project/vllm-ascend/issues/1508)
- MTP support torchair graph mode now [#2145](https://github.com/vllm-project/vllm-ascend/pull/2145)
### Other
### Others
- Bug fixes:
* Fix functional problem of multi-modality models like Qwen2-audio with Aclgraph. [#1803](https://github.com/vllm-project/vllm-ascend/pull/1803)
@@ -255,7 +255,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
- Dynamic EPLB support in [#1943](https://github.com/vllm-project/vllm-ascend/pull/1943)
- Disaggregated Prefilling support for V1 Engine and improvement, continued development and stabilization of the disaggregated prefill feature, including performance enhancements and bug fixes for single-machine setups:[#1953](https://github.com/vllm-project/vllm-ascend/pull/1953) [#1612](https://github.com/vllm-project/vllm-ascend/pull/1612) [#1361](https://github.com/vllm-project/vllm-ascend/pull/1361) [#1746](https://github.com/vllm-project/vllm-ascend/pull/1746) [#1552](https://github.com/vllm-project/vllm-ascend/pull/1552) [#1801](https://github.com/vllm-project/vllm-ascend/pull/1801) [#2083](https://github.com/vllm-project/vllm-ascend/pull/2083) [#1989](https://github.com/vllm-project/vllm-ascend/pull/1989)
### Models improvement:
### Model Improvement
- DeepSeek DeepSeek DBO support and improvement: [#1285](https://github.com/vllm-project/vllm-ascend/pull/1285) [#1291](https://github.com/vllm-project/vllm-ascend/pull/1291) [#1328](https://github.com/vllm-project/vllm-ascend/pull/1328) [#1420](https://github.com/vllm-project/vllm-ascend/pull/1420) [#1445](https://github.com/vllm-project/vllm-ascend/pull/1445) [#1589](https://github.com/vllm-project/vllm-ascend/pull/1589) [#1759](https://github.com/vllm-project/vllm-ascend/pull/1759) [#1827](https://github.com/vllm-project/vllm-ascend/pull/1827) [#2093](https://github.com/vllm-project/vllm-ascend/pull/2093)
- DeepSeek MTP improvement and bugfix: [#1214](https://github.com/vllm-project/vllm-ascend/pull/1214) [#943](https://github.com/vllm-project/vllm-ascend/pull/943) [#1584](https://github.com/vllm-project/vllm-ascend/pull/1584) [#1473](https://github.com/vllm-project/vllm-ascend/pull/1473) [#1294](https://github.com/vllm-project/vllm-ascend/pull/1294) [#1632](https://github.com/vllm-project/vllm-ascend/pull/1632) [#1694](https://github.com/vllm-project/vllm-ascend/pull/1694) [#1840](https://github.com/vllm-project/vllm-ascend/pull/1840) [#2076](https://github.com/vllm-project/vllm-ascend/pull/2076) [#1990](https://github.com/vllm-project/vllm-ascend/pull/1990) [#2019](https://github.com/vllm-project/vllm-ascend/pull/2019)
- Qwen3 MoE support improvement and bugfix around graph mode and DP: [#1940](https://github.com/vllm-project/vllm-ascend/pull/1940) [#2006](https://github.com/vllm-project/vllm-ascend/pull/2006) [#1832](https://github.com/vllm-project/vllm-ascend/pull/1832)
@@ -264,12 +264,12 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
- Qwen2.5 VL improvement via mrope/padding mechanism improvement: [#1261](https://github.com/vllm-project/vllm-ascend/pull/1261) [#1705](https://github.com/vllm-project/vllm-ascend/pull/1705) [#1929](https://github.com/vllm-project/vllm-ascend/pull/1929) [#2007](https://github.com/vllm-project/vllm-ascend/pull/2007)
- Ray: Fix the device error when using ray and add initialize_cache and improve warning info: [#1234](https://github.com/vllm-project/vllm-ascend/pull/1234) [#1501](https://github.com/vllm-project/vllm-ascend/pull/1501)
### Graph mode improvement:
### Graph Mode Improvement
- Fix DeepSeek with deepseek with mc2 in [#1269](https://github.com/vllm-project/vllm-ascend/pull/1269)
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions in [#1332](https://github.com/vllm-project/vllm-ascend/pull/1332)
- Fix torchair_graph_batch_sizes bug in [#1570](https://github.com/vllm-project/vllm-ascend/pull/1570)
- Enable the limit of tp <= 4 for torchair graph mode in [#1404](https://github.com/vllm-project/vllm-ascend/pull/1404)
- Fix rope accruracy bug [#1887](https://github.com/vllm-project/vllm-ascend/pull/1887)
- Fix rope accuracy bug [#1887](https://github.com/vllm-project/vllm-ascend/pull/1887)
- Support multistream of shared experts in FusedMoE [#997](https://github.com/vllm-project/vllm-ascend/pull/997)
- Enable kvcache_nz for the decode process in torchair graph mode[#1098](https://github.com/vllm-project/vllm-ascend/pull/1098)
- Fix chunked-prefill with torchair case to resolve UnboundLocalError: local variable 'decode_hs_or_q_c' issue in [#1378](https://github.com/vllm-project/vllm-ascend/pull/1378)
@@ -290,55 +290,55 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
- Fix DeepSeek OOM issue in extreme `--gpu-memory-utilization` scenario in [#1829](https://github.com/vllm-project/vllm-ascend/pull/1829)
- Turn off aclgraph when enabling TorchAir in [#2154](https://github.com/vllm-project/vllm-ascend/pull/2154)
### Ops improvement:
- add custom ascendc kernel vocabparallelembedding [#796](https://github.com/vllm-project/vllm-ascend/pull/796)
- fix rope sin/cos cache bug in [#1267](https://github.com/vllm-project/vllm-ascend/pull/1267)
- Refactoring AscendFusedMoE (#1229) in [#1264](https://github.com/vllm-project/vllm-ascend/pull/1264)
- Use fused ops npu_top_k_top_p in sampler [#1920](https://github.com/vllm-project/vllm-ascend/pull/1920)
### Operator Improvement
- Added custom AscendC kernel vocabparallelembedding [#796](https://github.com/vllm-project/vllm-ascend/pull/796)
- Fixed rope sin/cos cache bug in [#1267](https://github.com/vllm-project/vllm-ascend/pull/1267)
- Refactored AscendFusedMoE (#1229) in [#1264](https://github.com/vllm-project/vllm-ascend/pull/1264)
- Used fused ops npu_top_k_top_p in sampler [#1920](https://github.com/vllm-project/vllm-ascend/pull/1920)
### Core:
- Upgrade CANN to 8.2.rc1 in [#2036](https://github.com/vllm-project/vllm-ascend/pull/2036)
- Upgrade torch-npu to 2.5.1.post1 in [#2135](https://github.com/vllm-project/vllm-ascend/pull/2135)
- Upgrade python to 3.11 in [#2136](https://github.com/vllm-project/vllm-ascend/pull/2136)
- Disable quantization in mindie_turbo in [#1749](https://github.com/vllm-project/vllm-ascend/pull/1749)
- fix v0 spec decode in [#1323](https://github.com/vllm-project/vllm-ascend/pull/1323)
- Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode in [#1271](https://github.com/vllm-project/vllm-ascend/pull/1271)
- Upgraded CANN to 8.2.rc1 in [#2036](https://github.com/vllm-project/vllm-ascend/pull/2036)
- Upgraded torch-npu to 2.5.1.post1 in [#2135](https://github.com/vllm-project/vllm-ascend/pull/2135)
- Upgraded python to 3.11 in [#2136](https://github.com/vllm-project/vllm-ascend/pull/2136)
- Disabled quantization in mindie_turbo in [#1749](https://github.com/vllm-project/vllm-ascend/pull/1749)
- Fixed v0 spec decode in [#1323](https://github.com/vllm-project/vllm-ascend/pull/1323)
- Enabled `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode in [#1271](https://github.com/vllm-project/vllm-ascend/pull/1271)
- Refactoring forward_context and model_runner_v1 in [#1422](https://github.com/vllm-project/vllm-ascend/pull/1422)
- Fix sampling params in [#1423](https://github.com/vllm-project/vllm-ascend/pull/1423)
- add a switch for enabling NZ layout in weights and enable NZ for GMM. in [#1409](https://github.com/vllm-project/vllm-ascend/pull/1409)
- Fixed sampling params in [#1423](https://github.com/vllm-project/vllm-ascend/pull/1423)
- Added a switch for enabling NZ layout in weights and enable NZ for GMM. in [#1409](https://github.com/vllm-project/vllm-ascend/pull/1409)
- Resolved bug in ascend_forward_context in [#1449](https://github.com/vllm-project/vllm-ascend/pull/1449) [#1554](https://github.com/vllm-project/vllm-ascend/pull/1554) [#1598](https://github.com/vllm-project/vllm-ascend/pull/1598)
- Address PrefillCacheHit state to fix prefix cache accuracy bug in [#1492](https://github.com/vllm-project/vllm-ascend/pull/1492)
- Fix load weight error and add new e2e case in [#1651](https://github.com/vllm-project/vllm-ascend/pull/1651)
- Optimize the number of rope-related index selections in deepseek. in [#1614](https://github.com/vllm-project/vllm-ascend/pull/1614)
- add mc2 mask in [#1642](https://github.com/vllm-project/vllm-ascend/pull/1642)
- Fix static EPLB log2phy condition and improve unit test in [#1667](https://github.com/vllm-project/vllm-ascend/pull/1667) [#1896](https://github.com/vllm-project/vllm-ascend/pull/1896) [#2003](https://github.com/vllm-project/vllm-ascend/pull/2003)
- add chunk mc2 for prefill in [#1703](https://github.com/vllm-project/vllm-ascend/pull/1703)
- Fix mc2 op GroupCoordinator bug in [#1711](https://github.com/vllm-project/vllm-ascend/pull/1711)
- Fix the failure to recognize the actual type of quantization in [#1721](https://github.com/vllm-project/vllm-ascend/pull/1721)
- Fix deepseek bug when tp_size == 1 in [#1755](https://github.com/vllm-project/vllm-ascend/pull/1755)
- Fixed load weight error and add new e2e case in [#1651](https://github.com/vllm-project/vllm-ascend/pull/1651)
- Optimized the number of rope-related index selections in deepseek. in [#1614](https://github.com/vllm-project/vllm-ascend/pull/1614)
- Added mc2 mask in [#1642](https://github.com/vllm-project/vllm-ascend/pull/1642)
- Fixed static EPLB log2phy condition and improve unit test in [#1667](https://github.com/vllm-project/vllm-ascend/pull/1667) [#1896](https://github.com/vllm-project/vllm-ascend/pull/1896) [#2003](https://github.com/vllm-project/vllm-ascend/pull/2003)
- Added chunk mc2 for prefill in [#1703](https://github.com/vllm-project/vllm-ascend/pull/1703)
- Fixed mc2 op GroupCoordinator bug in [#1711](https://github.com/vllm-project/vllm-ascend/pull/1711)
- Fixed the failure to recognize the actual type of quantization in [#1721](https://github.com/vllm-project/vllm-ascend/pull/1721)
- Fixed DeepSeek bug when tp_size == 1 in [#1755](https://github.com/vllm-project/vllm-ascend/pull/1755)
- Added support for delay-free blocks in prefill nodes in [#1691](https://github.com/vllm-project/vllm-ascend/pull/1691)
- Moe alltoallv communication optimization for unquantized RL training & alltoallv support dpo in [#1547](https://github.com/vllm-project/vllm-ascend/pull/1547)
- Adapt dispatchV2 interface in [#1822](https://github.com/vllm-project/vllm-ascend/pull/1822)
- Fix disaggregate prefill hang issue in long output in [#1807](https://github.com/vllm-project/vllm-ascend/pull/1807)
- Fix flashcomm_v1 when engine v0 in [#1859](https://github.com/vllm-project/vllm-ascend/pull/1859)
- ep_group is not equal to word_size in some cases. in [#1862](https://github.com/vllm-project/vllm-ascend/pull/1862)
- Fix wheel glibc version incompatibility in [#1808](https://github.com/vllm-project/vllm-ascend/pull/1808)
- Fix mc2 process group to resolve self.cpu_group is None in [#1831](https://github.com/vllm-project/vllm-ascend/pull/1831)
- Pin vllm version to v0.9.1 to make mypy check passed in [#1904](https://github.com/vllm-project/vllm-ascend/pull/1904)
- Apply npu_moe_gating_top_k_softmax for moe to improve perf in [#1902](https://github.com/vllm-project/vllm-ascend/pull/1902)
- Fix bug in path_decorator when engine v0 in [#1919](https://github.com/vllm-project/vllm-ascend/pull/1919)
- Avoid performing cpu all_reduce in disaggregated-prefill scenario. in [#1644](https://github.com/vllm-project/vllm-ascend/pull/1644)
- add super kernel in decode moe in [#1916](https://github.com/vllm-project/vllm-ascend/pull/1916)
- [Prefill Perf] Parallel Strategy Optimizations (VRAM-for-Speed Tradeoff) in [#1802](https://github.com/vllm-project/vllm-ascend/pull/1802)
- Remove unnecessary reduce_results access in shared_experts.down_proj in [#2016](https://github.com/vllm-project/vllm-ascend/pull/2016)
- Optimize greedy reject sampler with vectorization. in [#2002](https://github.com/vllm-project/vllm-ascend/pull/2002)
- Make multiple Ps and Ds work on a single machine in [#1936](https://github.com/vllm-project/vllm-ascend/pull/1936)
- Fixes the shape conflicts between shared & routed experts for deepseek model when tp > 1 and multistream_moe enabled in [#2075](https://github.com/vllm-project/vllm-ascend/pull/2075)
- Add cpu binding support [#2031](https://github.com/vllm-project/vllm-ascend/pull/2031)
- Add with_prefill cpu allreduce to handle D-node recomputatio in [#2129](https://github.com/vllm-project/vllm-ascend/pull/2129)
- Add D2H & initRoutingQuantV2 to improve prefill perf in [#2038](https://github.com/vllm-project/vllm-ascend/pull/2038)
- MoE alltoallv communication optimization for unquantized RL training & alltoallv support dpo in [#1547](https://github.com/vllm-project/vllm-ascend/pull/1547)
- Adapted dispatchV2 interface in [#1822](https://github.com/vllm-project/vllm-ascend/pull/1822)
- Fixed disaggregate prefill hang issue in long output in [#1807](https://github.com/vllm-project/vllm-ascend/pull/1807)
- Fixed flashcomm_v1 when engine v0 in [#1859](https://github.com/vllm-project/vllm-ascend/pull/1859)
- ep_group is not equal to word_size in some cases in [#1862](https://github.com/vllm-project/vllm-ascend/pull/1862).
- Fixed wheel glibc version incompatibility in [#1808](https://github.com/vllm-project/vllm-ascend/pull/1808).
- Fixed mc2 process group to resolve self.cpu_group is None in [#1831](https://github.com/vllm-project/vllm-ascend/pull/1831).
- Pin vllm version to v0.9.1 to make mypy check passed in [#1904](https://github.com/vllm-project/vllm-ascend/pull/1904).
- Applied npu_moe_gating_top_k_softmax for moe to improve perf in [#1902](https://github.com/vllm-project/vllm-ascend/pull/1902).
- Fixed bug in path_decorator when engine v0 in [#1919](https://github.com/vllm-project/vllm-ascend/pull/1919).
- Avoid performing cpu all_reduce in disaggregated-prefill scenario in [#1644](https://github.com/vllm-project/vllm-ascend/pull/1644).
- Added super kernel in decode MoE in [#1916](https://github.com/vllm-project/vllm-ascend/pull/1916)
- [Prefill Perf] Parallel Strategy Optimizations (VRAM-for-Speed Tradeoff) in [#1802](https://github.com/vllm-project/vllm-ascend/pull/1802).
- Removed unnecessary reduce_results access in shared_experts.down_proj in [#2016](https://github.com/vllm-project/vllm-ascend/pull/2016).
- Optimized greedy reject sampler with vectorization in [#2002](https://github.com/vllm-project/vllm-ascend/pull/2002).
- Made multiple Ps and Ds work on a single machine in [#1936](https://github.com/vllm-project/vllm-ascend/pull/1936).
- Fixed the shape conflicts between shared & routed experts for deepseek model when tp > 1 and multistream_moe enabled in [#2075](https://github.com/vllm-project/vllm-ascend/pull/2075).
- Added CPU binding support [#2031](https://github.com/vllm-project/vllm-ascend/pull/2031).
- Added with_prefill cpu allreduce to handle D-node recomputation in [#2129](https://github.com/vllm-project/vllm-ascend/pull/2129).
- Added D2H & initRoutingQuantV2 to improve prefill perf in [#2038](https://github.com/vllm-project/vllm-ascend/pull/2038).
### Docs:
### Docs
- Provide an e2e guide for execute duration profiling [#1113](https://github.com/vllm-project/vllm-ascend/pull/1113)
- Add Referer header for CANN package download url. [#1192](https://github.com/vllm-project/vllm-ascend/pull/1192)
- Add reinstall instructions doc [#1370](https://github.com/vllm-project/vllm-ascend/pull/1370)
@@ -349,7 +349,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
### Known Issues
- Full graph mode support are not yet available for specific hardware types with full_cuda_graphenable. [#2182](https://github.com/vllm-project/vllm-ascend/issues/2182)
- Qwen3 MoE aclgraph mode with tp failed when enable ep due to bincount error [#2226](https://github.com/vllm-project/vllm-ascend/issues/2226)
- As mentioend in v0.9.1rc1 release note, Altlas 300I series support will NOT be included.
- As mentioend in v0.9.1rc1 release note, Atlas 300I series support will NOT be included.
## v0.9.2rc1 - 2025.07.11
@@ -367,7 +367,7 @@ This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [
- Fix the accuracy problem with deploy models with parallel parameters. [#1678](https://github.com/vllm-project/vllm-ascend/pull/1678)
- The pre-built wheel package now requires lower version of glibc. Users can use it by `pip install vllm-ascend` directly. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582)
### Other
### Others
- Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. [#1331](https://github.com/vllm-project/vllm-ascend/pull/1331)
- A new env variable `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is `0`. [#1335](https://github.com/vllm-project/vllm-ascend/pull/1335)
@@ -473,13 +473,13 @@ This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [
- Input embedding feature works with V0 Engine now. [#916](https://github.com/vllm-project/vllm-ascend/pull/916)
- Sleep mode feature works with V1 Engine now. [#1084](https://github.com/vllm-project/vllm-ascend/pull/1084)
### Model
### Models
- Qwen2.5 VL works with V1 Engine now. [#736](https://github.com/vllm-project/vllm-ascend/pull/736)
- LLama4 works now. [#740](https://github.com/vllm-project/vllm-ascend/pull/740)
- A new kind of DeepSeek model called dual-batch overlap(DBO) is added. Please set `VLLM_ASCEND_ENABLE_DBO=1` to use it. [#941](https://github.com/vllm-project/vllm-ascend/pull/941)
### Other
### Others
- online serve with ascend quantization works now. [#877](https://github.com/vllm-project/vllm-ascend/pull/877)
- A batch of bugs for graph mode and moe model have been fixed. [#773](https://github.com/vllm-project/vllm-ascend/pull/773) [#771](https://github.com/vllm-project/vllm-ascend/pull/771) [#774](https://github.com/vllm-project/vllm-ascend/pull/774) [#816](https://github.com/vllm-project/vllm-ascend/pull/816) [#817](https://github.com/vllm-project/vllm-ascend/pull/817) [#819](https://github.com/vllm-project/vllm-ascend/pull/819) [#912](https://github.com/vllm-project/vllm-ascend/pull/912) [#897](https://github.com/vllm-project/vllm-ascend/pull/897) [#961](https://github.com/vllm-project/vllm-ascend/pull/961) [#958](https://github.com/vllm-project/vllm-ascend/pull/958) [#913](https://github.com/vllm-project/vllm-ascend/pull/913) [#905](https://github.com/vllm-project/vllm-ascend/pull/905)
@@ -498,10 +498,10 @@ This is the first post release of 0.7.3. Please follow the [official doc](https:
### Highlights
- Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 is well tested. You can try it now. Mindie Turbo is recomanded to improve the performance of Qwen3. [#903](https://github.com/vllm-project/vllm-ascend/pull/903) [#915](https://github.com/vllm-project/vllm-ascend/pull/915)
- Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 is well tested. You can try it now. Mindie Turbo is recommended to improve the performance of Qwen3. [#903](https://github.com/vllm-project/vllm-ascend/pull/903) [#915](https://github.com/vllm-project/vllm-ascend/pull/915)
- Added a new performance guide. The guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. [#878](https://github.com/vllm-project/vllm-ascend/pull/878) [Doc Link](https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/developer_guide/performance/optimization_and_tuning.html)
### Bug Fix
### Bug Fixes
- Qwen2.5-VL works for RLHF scenarios now. [#928](https://github.com/vllm-project/vllm-ascend/pull/928)
- Users can launch the model from online weights now. e.g. from huggingface or modelscope directly [#858](https://github.com/vllm-project/vllm-ascend/pull/858) [#918](https://github.com/vllm-project/vllm-ascend/pull/918)
@@ -529,11 +529,11 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir
### Core
- LoRAMulti-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. [#700](https://github.com/vllm-project/vllm-ascend/pull/700)
### Model
### Models
- The performance of Qwen2 vl and Qwen2.5 vl is improved. [#702](https://github.com/vllm-project/vllm-ascend/pull/702)
- The performance of `apply_penalties` and `topKtopP` ops are improved. [#525](https://github.com/vllm-project/vllm-ascend/pull/525)
### Other
### Others
- Fixed a issue that may lead CPU memory leak. [#691](https://github.com/vllm-project/vllm-ascend/pull/691) [#712](https://github.com/vllm-project/vllm-ascend/pull/712)
- A new environment `SOC_VERSION` is added. If you hit any soc detection error when building with custom ops enabled, please set `SOC_VERSION` to a suitable value. [#606](https://github.com/vllm-project/vllm-ascend/pull/606)
- openEuler container image supported with v0.7.3-openeuler tag. [#665](https://github.com/vllm-project/vllm-ascend/pull/665)
@@ -557,7 +557,7 @@ This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [
- Optimize NPU memory usage to make DeepSeek R1 W8A8 32K model len work. [#728](https://github.com/vllm-project/vllm-ascend/pull/728)
- Fix `PYTHON_INCLUDE_PATH` typo in setup.py [#762](https://github.com/vllm-project/vllm-ascend/pull/762)
### Other
### Others
- Add Qwen3-0.6B test [#717](https://github.com/vllm-project/vllm-ascend/pull/717)
- Add nightly CI [#668](https://github.com/vllm-project/vllm-ascend/pull/668)
- Add accuracy test report [#542](https://github.com/vllm-project/vllm-ascend/pull/542)
@@ -575,7 +575,7 @@ This is the second release candidate of v0.8.4 for vllm-ascend. Please follow th
- ACLGraph feature is supported with V1 engine now. It's disabled by default because this feature rely on CANN 8.1 release. We'll make it available by default in the next release [#426](https://github.com/vllm-project/vllm-ascend/pull/426)
- Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automatically. [#661](https://github.com/vllm-project/vllm-ascend/pull/661)
### Other
### Others
- MiniCPM model works now. [#645](https://github.com/vllm-project/vllm-ascend/pull/645)
- openEuler container image supported with `v0.8.4-openeuler` tag and customs Ops build is enabled by default for openEuler OS. [#689](https://github.com/vllm-project/vllm-ascend/pull/689)
- Fix ModuleNotFoundError bug to make Lora work [#600](https://github.com/vllm-project/vllm-ascend/pull/600)
@@ -588,7 +588,7 @@ This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the
### Highlights
- vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcely.
- vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcibly.
- LoRAMulti-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the [official doc](https://docs.vllm.ai/en/latest/features/lora.html) for more usage information. Thanks for the contribution from China Merchants Bank. [#521](https://github.com/vllm-project/vllm-ascend/pull/521).
- Sleep Mode feature is supported. Currently it's only work on V0 engine. V1 engine support will come soon. [#513](https://github.com/vllm-project/vllm-ascend/pull/513)
@@ -599,7 +599,7 @@ This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the
- Spec decode feature works now. Currently it's only work on V0 engine. V1 engine support will come soon. [#500](https://github.com/vllm-project/vllm-ascend/pull/500)
- Structured output feature works now on V1 Engine. Currently it only supports xgrammar backend while using guidance backend may get some errors. [#555](https://github.com/vllm-project/vllm-ascend/pull/555)
### Other
### Others
- A new communicator `pyhccl` is added. It's used for call CANN HCCL library directly instead of using `torch.distribute`. More usage of it will be added in the next release [#503](https://github.com/vllm-project/vllm-ascend/pull/503)
- The custom ops build is enabled by default. You should install the packages like `gcc`, `cmake` first to build `vllm-ascend` from source. Set `COMPILE_CUSTOM_KERNELS=0` environment to disable the compilation if you don't need it. [#466](https://github.com/vllm-project/vllm-ascend/pull/466)
@@ -612,17 +612,17 @@ This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the [offi
- Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html
### Highlights
- Add Ascend Custom Ops framewrok. Developers now can write customs ops using AscendC. An example ops `rotary_embedding` is added. More tutorials will come soon. The Custom Ops compilation is disabled by default when installing vllm-ascend. Set `COMPILE_CUSTOM_KERNELS=1` to enable it. [#371](https://github.com/vllm-project/vllm-ascend/pull/371)
- Add Ascend Custom Ops framework. Developers now can write customs ops using AscendC. An example ops `rotary_embedding` is added. More tutorials will come soon. The Custom Ops compilation is disabled by default when installing vllm-ascend. Set `COMPILE_CUSTOM_KERNELS=1` to enable it. [#371](https://github.com/vllm-project/vllm-ascend/pull/371)
- V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us [here](https://github.com/vllm-project/vllm-ascend/issues/414). [#376](https://github.com/vllm-project/vllm-ascend/pull/376)
- Prefix cache feature works now. You can set `enable_prefix_caching=True` to enable it. [#282](https://github.com/vllm-project/vllm-ascend/pull/282)
### Core
- Bump torch_npu version to dev20250320.3 to improve accuracy to fix `!!!` output problem. [#406](https://github.com/vllm-project/vllm-ascend/pull/406)
### Model
### Models
- The performance of Qwen2-vl is improved by optimizing patch embedding (Conv3D). [#398](https://github.com/vllm-project/vllm-ascend/pull/398)
### Other
### Others
- Fixed a bug to make sure multi step scheduler feature work. [#349](https://github.com/vllm-project/vllm-ascend/pull/349)
- Fixed a bug to make prefix cache feature works with correct accuracy. [#424](https://github.com/vllm-project/vllm-ascend/pull/424)
@@ -642,18 +642,18 @@ This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the [offi
- Bump torch_npu version to dev20250308.3 to improve `_exponential` accuracy
- Added initial support for pooling models. Bert based model, such as `BAAI/bge-base-en-v1.5` and `BAAI/bge-reranker-v2-m3` works now. [#229](https://github.com/vllm-project/vllm-ascend/pull/229)
### Model
### Models
- The performance of Qwen2-VL is improved. [#241](https://github.com/vllm-project/vllm-ascend/pull/241)
- MiniCPM is now supported [#164](https://github.com/vllm-project/vllm-ascend/pull/164)
### Other
### Others
- Support MTP(Multi-Token Prediction) for DeepSeek V3/R1 [#236](https://github.com/vllm-project/vllm-ascend/pull/236)
- [Docs] Added more model tutorials, include DeepSeek, QwQ, Qwen and Qwen 2.5VL. See the [official doc](https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/tutorials/index.html) for detail
- Pin modelscope<1.23.0 on vLLM v0.7.3 to resolve: https://github.com/vllm-project/vllm/pull/13807
### Known issues
### Known Issues
- In [some cases](https://github.com/vllm-project/vllm-ascend/issues/324), especially when the input/output is very long, the accuracy of output may be incorrect. We are working on it. It'll be fixed in the next release.
- Improved and reduced the garbled code in model output. But if you still hit the issue, try to change the generation config value, such as `temperature`, and try again. There is also a knonwn issue shown below. Any [feedback](https://github.com/vllm-project/vllm-ascend/issues/267) is welcome. [#277](https://github.com/vllm-project/vllm-ascend/pull/277)
- Improved and reduced the garbled code in model output. But if you still hit the issue, try to change the generation config value, such as `temperature`, and try again. There is also a known issue shown below. Any [feedback](https://github.com/vllm-project/vllm-ascend/issues/267) is welcome. [#277](https://github.com/vllm-project/vllm-ascend/pull/277)
## v0.7.1rc1 - 2025.02.19
@@ -676,13 +676,13 @@ Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/v0.7.1-de
- Added the Ascend quantization config option, the implementation will coming soon. [#7](https://github.com/vllm-project/vllm-ascend/pull/7) [#73](https://github.com/vllm-project/vllm-ascend/pull/73)
- Add silu_and_mul and rope ops and add mix ops into attention layer. [#18](https://github.com/vllm-project/vllm-ascend/pull/18)
### Other
### Others
- [CI] Enable Ascend CI to actively monitor and improve quality for vLLM on Ascend. [#3](https://github.com/vllm-project/vllm-ascend/pull/3)
- [Docker] Add vllm-ascend container image [#64](https://github.com/vllm-project/vllm-ascend/pull/64)
- [Docs] Add a [live doc](https://vllm-ascend.readthedocs.org) [#55](https://github.com/vllm-project/vllm-ascend/pull/55)
### Known issues
### Known Issues
- This release relies on an unreleased torch_npu version. It has been installed within official container image already. Please [install](https://vllm-ascend.readthedocs.io/en/v0.7.1rc1/installation.html) it manually if you are using non-container environment.
- There are logs like `No platform detected, vLLM is running on UnspecifiedPlatform` or `Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")` shown when running vllm-ascend. It actually doesn't affect any functionality and performance. You can just ignore it. And it has been fixed in this [PR](https://github.com/vllm-project/vllm/pull/12432) which will be included in v0.7.3 soon.

View File

@@ -1,6 +1,6 @@
# Features and models
# Features and Models
This section provides a detailed supported matrix by vLLM Ascend.
This section provides a detailed matrix supported by vLLM Ascend.
:::{toctree}
:caption: Support Matrix

View File

@@ -1,4 +1,4 @@
# Feature Support
# Supported Features
The feature support principle of vLLM Ascend is: **aligned with the vLLM**. We are also actively collaborating with the community to accelerate support.
@@ -6,11 +6,11 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th
| Feature | Status | Next Step |
|-------------------------------|----------------|------------------------------------------------------------------------|
| Chunked Prefill | 🟢 Functional | Functional, see detail note: [Chunked Prefill][cp] |
| Automatic Prefix Caching | 🟢 Functional | Functional, see detail note: [vllm-ascend#732][apc] |
| Chunked Prefill | 🟢 Functional | Functional, see detailed note: [Chunked Prefill][cp] |
| Automatic Prefix Caching | 🟢 Functional | Functional, see detailed note: [vllm-ascend#732][apc] |
| LoRA | 🟢 Functional | [vllm-ascend#396][multilora], [vllm-ascend#893][v1 multilora] |
| Speculative decoding | 🟢 Functional | Basic support |
| Pooling | 🟢 Functional | CI needed and adapting more models; V1 support rely on vLLM support. |
| Pooling | 🟢 Functional | CI needed to adapt to more models; V1 support rely on vLLM support. |
| Enc-dec | 🟡 Planned | vLLM should support this feature first. |
| Multi Modality | 🟢 Functional | [Tutorial][multimodal], optimizing and adapting more models |
| LogProbs | 🟢 Functional | CI needed |
@@ -18,20 +18,20 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th
| Async output | 🟢 Functional | CI needed |
| Beam search | 🟢 Functional | CI needed |
| Guided Decoding | 🟢 Functional | [vllm-ascend#177][guided_decoding] |
| Tensor Parallel | 🟢 Functional | Make TP >4 work with graph mode |
| Tensor Parallel | 🟢 Functional | Make TP >4 work with graph mode. |
| Pipeline Parallel | 🟢 Functional | Write official guide and tutorial. |
| Expert Parallel | 🟢 Functional | Dynamic EPLB support. |
| Expert Parallel | 🟢 Functional | Support dynamic EPLB. |
| Data Parallel | 🟢 Functional | Data Parallel support for Qwen3 MoE. |
| Prefill Decode Disaggregation | 🟢 Functional | Functional, xPyD is supported. |
| Quantization | 🟢 Functional | W8A8 available; working on more quantization method support(W4A8, etc) |
| Graph Mode | 🔵 Experimental| Experimental, see detail note: [vllm-ascend#767][graph_mode] |
| Quantization | 🟢 Functional | W8A8 available; working on more quantization method support (W4A8, etc) |
| Graph Mode | 🔵 Experimental| Experimental, see detailed note: [vllm-ascend#767][graph_mode] |
| Sleep Mode | 🟢 Functional | |
- 🟢 Functional: Fully operational, with ongoing optimizations.
- 🔵 Experimental: Experimental support, interfaces and functions may change.
- 🚧 WIP: Under active development, will be supported soon.
- 🟡 Planned: Scheduled for future implementation (some may have open PRs/RFCs).
- 🔴 NO plan / Deprecated: No plan or deprecated by vLLM.
- 🔴 NO plan/Deprecated: No plan or deprecated by vLLM.
[v1_user_guide]: https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html
[multimodal]: https://vllm-ascend.readthedocs.io/en/latest/tutorials/single_npu_multimodal.html

View File

@@ -1,20 +1,22 @@
# Model Support
# Supported Models
Get the newest info here: https://github.com/vllm-project/vllm-ascend/issues/1608
Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/1608
## Text-only Language Models
## Text-Only Language Models
### Generative Models
| Model | Supported | Note |
| Model | Support | Note |
|-------------------------------|-----------|----------------------------------------------------------------------|
| DeepSeek v3 | ✅ | |
| DeepSeek V3/3.1 | ✅ | |
| DeepSeek V3.2 EXP | ✅ | |
| DeepSeek R1 | ✅ | |
| DeepSeek Distill (Qwen/LLama) | ✅ | |
| Qwen3 | ✅ | |
| Qwen3-based | ✅ | |
| Qwen3-Coder | ✅ | |
| Qwen3-Moe | ✅ | |
| Qwen3-Next | ✅ | |
| Qwen2.5 | ✅ | |
| Qwen2 | ✅ | |
| Qwen2-based | ✅ | |
@@ -32,17 +34,17 @@ Get the newest info here: https://github.com/vllm-project/vllm-ascend/issues/160
| Gemma-3 | ✅ | |
| Phi-3/4 | ✅ | |
| Mistral/Mistral-Instruct | ✅ | |
| GLM-4.5 | ✅ | |
| GLM-4.5 | ✅ | |
| GLM-4 | ❌ | [#2255](https://github.com/vllm-project/vllm-ascend/issues/2255) |
| GLM-4-0414 | ❌ | [#2258](https://github.com/vllm-project/vllm-ascend/issues/2258) |
| ChatGLM | ❌ | [#554](https://github.com/vllm-project/vllm-ascend/issues/554) |
| DeepSeek v2.5 | 🟡 | Need test |
| DeepSeek V2.5 | 🟡 | Need test |
| Mllama | 🟡 | Need test |
| MiniMax-Text | 🟡 | Need test |
### Pooling Models
| Model | Supported | Note |
| Model | Support | Note |
|-------------------------------|-----------|----------------------------------------------------------------------|
| Qwen3-Embedding | ✅ | |
| Molmo | ✅ | [1942](https://github.com/vllm-project/vllm-ascend/issues/1942) |
@@ -52,10 +54,12 @@ Get the newest info here: https://github.com/vllm-project/vllm-ascend/issues/160
### Generative Models
| Model | Supported | Note |
| Model | Support | Note |
|--------------------------------|---------------|----------------------------------------------------------------------|
| Qwen2-VL | ✅ | |
| Qwen2.5-VL | ✅ | |
| Qwen3-VL | ✅ | |
| Qwen3-VL-MOE | ✅ | |
| Qwen2.5-Omni | ✅ | [1760](https://github.com/vllm-project/vllm-ascend/issues/1760) |
| QVQ | ✅ | |
| LLaVA 1.5/1.6 | ✅ | [1962](https://github.com/vllm-project/vllm-ascend/issues/1962) |
@@ -76,4 +80,4 @@ Get the newest info here: https://github.com/vllm-project/vllm-ascend/issues/160
| GLM-4V | ❌ | [2260](https://github.com/vllm-project/vllm-ascend/issues/2260) |
| InternVL2.0/2.5/3.0<br>InternVideo2.5/Mono-InternVL | ❌ | [2064](https://github.com/vllm-project/vllm-ascend/issues/2064) |
| Whisper | ❌ | [2262](https://github.com/vllm-project/vllm-ascend/issues/2262) |
| Ultravox | 🟡 Need test | |
| Ultravox | 🟡 | Need test |