### What this PR does / why we need it?
Fix multi dp padding logic for eager mode, bacause its will cause rank0
load imbalance in kimi-k2.5-w4a8 with the all the padding tokens router
to rank0. And the fix can also apply to other model in multi dp.
- before
hbm usage:
<img width="2229" height="733" alt="image"
src="https://github.com/user-attachments/assets/50479b6d-cfd0-4206-8e80-974024652997"
/>
preformance:
```shell
Concurrency NumPrompts QPS TTFT_Avg TTFT_P50 TPOT_Avg TPOT_P50 TPOT_P90
============ ============ ============ ============ ============ ============ ============ ============
1 15 0.0179 1667.7803 1673.3437 35.2973 35.2775 35.3784
32 480 0.4725 2764.8027 1905.2137 40.8030 40.6978 41.0179
64 960 0.7820 4123.7096 3485.6153 48.0461 48.1598 48.2971
100 1500 1.0852 6216.7988 5714.0082 52.9323 53.0613 54.6304
108 1620 1.1040 6277.4892 5798.7425 56.3862 56.9224 57.2901
116 1740 1.1680 6563.3293 6039.5659 56.9894 57.4027 57.5786
128 1920 1.2555 7822.5551 7604.1662 57.7660 58.1768 58.2717
192 2880 1.4314 9212.1953 9131.3461 58.9905 59.1683 59.2791
256 3840 1.4480 9028.0812 8913.7937 59.0092 59.2385 59.3516
```
- after
hbm usage:
<img width="2246" height="1005" alt="image"
src="https://github.com/user-attachments/assets/d0936481-5a58-4bc5-a6f1-b92735d47885"
/>
preformance:
```shell
Concurrency NumPrompts QPS TTFT_Avg TTFT_P50 TPOT_Avg TPOT_P50 TPOT_P90
============ ============ ============ ============ ============ ============ ============ ============
1 15 0.0181 601.4171 600.9774 35.6270 35.6254 35.6480
32 480 0.4455 720.8782 724.2889 45.4250 45.4755 45.6318
64 960 0.8445 729.6209 728.2149 47.0464 47.0896 47.1985
100 1500 1.2601 723.4834 724.6673 48.3108 48.3844 48.5355
108 1620 1.3409 727.1509 720.6772 48.8962 48.9409 49.0489
116 1740 1.4080 679.9799 677.6119 49.1253 49.1983 49.3087
128 1920 1.4155 680.6284 674.9436 49.2193 49.2450 49.3763
192 2880 1.4422 684.6577 676.7833 49.2059 49.2264 49.3229
256 3840 1.4558 685.2462 678.1709 49.2191 49.2351 49.3419
```
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: fny-coder <985619145@qq.com>
vLLM Ascend Plugin
| About Ascend | Documentation | #SIG-Ascend | Users Forum | Weekly Meeting |
English | 中文
Latest News 🔥
- [2026/02] We released the new official version v0.13.0! Please follow the official guide to start using vLLM Ascend Plugin on Ascend.
- [2025/12] We released the new official version v0.11.0! Please follow the official guide to start using vLLM Ascend Plugin on Ascend.
- [2025/09] We released the new official version v0.9.1! Please follow the official guide to start deploying large-scale Expert Parallelism (EP) on Ascend.
- [2025/08] We hosted the vLLM Beijing Meetup with vLLM and Tencent! Please find the meetup slides here.
- [2025/06] User stories page is now live! It kicks off with LLaMA-Factory/verl/TRL/GPUStack to demonstrate how vLLM Ascend assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios.
- [2025/06] Contributors page is now live! All contributions deserve to be recorded, thanks for all contributors.
- [2025/05] We've released the first official version v0.7.3! We collaborated with the vLLM community to publish a blog post sharing our practice: Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU.
- [2025/03] We hosted the vLLM Beijing Meetup with vLLM team! Please find the meetup slides here.
- [2025/02] vLLM community officially created vllm-project/vllm-ascend repo for running vLLM seamlessly on the Ascend NPU.
- [2024/12] We are working with the vLLM community to support [RFC]: Hardware pluggable.
Overview
vLLM Ascend (vllm-ascend) is a community maintained hardware plugin for running vLLM seamlessly on the Ascend NPU.
It is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [RFC]: Hardware pluggable, providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.
By using vLLM Ascend plugin, popular open-source models, including Transformer-like, Mixture-of-Experts (MoE), Embedding, Multi-modal LLMs can run seamlessly on the Ascend NPU.
Prerequisites
- Hardware: Atlas 800I A2 Inference series, Atlas A2 Training series, Atlas 800I A3 Inference series, Atlas A3 Training series, Atlas 300I Duo (Experimental)
- OS: Linux
- Software:
- Python >= 3.10, < 3.12
- CANN == 8.5.0 (Ascend HDK version refers to here)
- PyTorch == 2.9.0, torch-npu == 2.9.0
- vLLM (the same version as vllm-ascend)
Getting Started
Please use the following recommended versions to get started quickly:
| Version | Release type | Doc |
|---|---|---|
| v0.17.0rc1 | Latest release candidate | See QuickStart and Installation for more details |
| v0.13.0 | Latest stable version | See QuickStart and Installation for more details |
Contributing
See CONTRIBUTING for more details, which is a step-by-step guide to help you set up the development environment, build and test.
We welcome and value any contributions and collaborations:
- Please let us know if you encounter a bug by filing an issue
- Please use User forum for usage questions and help.
Branch
vllm-ascend has a main branch and a dev branch.
- main: main branch, corresponds to the vLLM main branch, and is continuously monitored for quality through Ascend CI.
- releases/vX.Y.Z: development branch, created alongside new releases of vLLM. For example,
releases/v0.13.0is the dev branch for vLLMv0.13.0version.
Below are the maintained branches:
| Branch | Status | Note |
|---|---|---|
| main | Maintained | CI commitment for vLLM main branch and vLLM v0.17.0 tag |
| v0.7.1-dev | Unmaintained | Only doc fixes are allowed |
| v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version, only bug fixes are allowed, and no new release tags anymore. |
| v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.1 version |
| v0.11.0-dev | Maintained | CI commitment for vLLM 0.11.0 version |
| releases/v0.13.0 | Maintained | CI commitment for vLLM 0.13.0 version |
| rfc/feature-name | Maintained | Feature branches for collaboration |
Please refer to Versioning policy for more details.
Weekly Meeting
- vLLM Ascend Weekly Meeting: https://tinyurl.com/vllm-ascend-meeting
- Wednesday, 15:00 - 16:00 (UTC+8, Convert to your timezone)
License
Apache License 2.0, as found in the LICENSE file.
