Commit Graph

2741 Commits

Author SHA1 Message Date
zhangxinyuehfad
d781902ce9 [v0.18.0][CI] Fix releases/v0.18.0 ci test only support vllm v0.18.0 (#7686)
### What this PR does / why we need it?
Fix releases/v0.18.0 ci test only support vllm v0.18.0 

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-26 18:36:04 +08:00
zhangxinyuehfad
124bb00158 [CI][v0.18.0] Build nightly image for releases/v0.18.0 per pr (#7662)
### What this PR does / why we need it?
This patch add per pr image build for branch `releases/v0.18.0`, Due to
the limitations of the quay naming convention, we should not name the
image tag the same as branch name, we name the image
tag`releases-v0.18.0` for daily build.

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-26 16:48:51 +08:00
cvSoldier
2db33868a4 [kernel] Recompilation optimization triggered by triton function parameter optimization (#7645)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
- Please clarify why the changes are needed. For instance, the use case
and bug description.
Some parameters of Triton operators are unnecessarily modified with the
"constexpr" modifier. When these parameters change, recompilation is
triggered, which significantly affects the model performance. Therefore,
these parameters need to be rectified.
main branch:https://github.com/vllm-project/vllm-ascend/pull/7483

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: cvSoldier <610496306@qq.com>
2026-03-26 16:31:34 +08:00
Mr.WXS
dba34d4915 [v0.18.0][Triton][Qwen3.5] delete expr for kernels args (#7646)
### What this PR does / why we need it?
Some parameters of Triton operators are unnecessarily modified with the
"constexpr" modifier. When these parameters change, recompilation is
triggered, which significantly affects the model performance. Therefore,
these parameters need to be rectified.
backport: https://github.com/vllm-project/vllm-ascend/pull/7482


Signed-off-by: w30012745 <wangxiaoshuai2@h-partners.com>
Co-authored-by: w30012745 <wangxiaoshuai2@h-partners.com>
2026-03-25 23:31:27 +08:00
Wangbei25
dd55736ee4 fix uncompatible between fc1 and non-sp-padding (#7643)
cherry pick https://github.com/vllm-project/vllm-ascend/pull/7614
### What this PR does / why we need it?
fix uncompatible between fc1 and non-sp-padding
After PR
[non-sp-padding](https://github.com/vllm-project/vllm-ascend/pull/7297),
kimi2.5 open flashcomm1 will raise an error : The expanded size of the
tensor do not match the existing size at non-singleton dimension 0.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM-Ascend main: 9976e685b7

Signed-off-by: Wangbei25 <wangbei41@huawie.com>
Co-authored-by: Wangbei25 <wangbei41@huawie.com>
2026-03-25 23:23:37 +08:00
wangbj127
2ad0ca52a6 Qwen3.5 MoE supports flashcomm v1 (#7644)
cherry pick from https://github.com/vllm-project/vllm-ascend/pull/7486
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Multimodal models like Qwen3.5 MoE does embedding in model_runner, so
when flash comm is enabled, the first AllGather operation should be
skipped.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
- vLLM version: v0.18.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Wangbingjie <wangbj1207@126.com>
Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>
2026-03-25 23:09:33 +08:00
Wang Kunpeng
ff1860bd81 [CI]fix lint (#7641)
### What this PR does / why we need it?
This pull request addresses a linting issue by reordering a specific
configuration assignment within the `apply_config_platform_defaults`
method in `vllm_ascend/platform.py`. This change ensures compliance with
code style guidelines without altering the functional behavior of the
system.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-03-25 18:48:10 +08:00
linfeng-yuan
05a561129e [Graph][Bugfix] Set default cudagraph max capture size via platform defaults (#7572)
### What this PR does / why we need it?

This PR lets NPU platform provide its own default
`max_cudagraph_capture_size` via
`NPUPlatform.apply_config_platform_defaults()`.

Previously, when cudagraph sizing was left unset, Ascend inherited
vLLM's upstream default heuristic in `_set_cudagraph_sizes()`, which
uses `max_num_seqs * decode_query_len * 2`. This PR changes Ascend's
default to `min(max_num_seqs * decode_query_len, 512)` while keeping the
rest of vLLM's cudagraph sizing logic unchanged.

### Does this PR introduce _any_ user-facing change?

Yes, but only for Ascend when users do not explicitly configure
cudagraph sizing.

If `max_cudagraph_capture_size` and `cudagraph_capture_sizes` are both
unset, we now uses `max_num_seqs * decode_query_len` (capped at `512`)
instead of the upstream `* 2` default. Explicit user settings are
unchanged.

### How was this patch tested?

Add unit tests to cover:

- default max injection via `apply_config_platform_defaults()`
- explicit `max_cudagraph_capture_size` is preserved
- explicit `cudagraph_capture_sizes` are preserved
- Ascend default max no longer uses the upstream `* 2`
- late `_set_cudagraph_sizes()` recomputation reuses the current max
input

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-25 17:57:19 +08:00
linfeng-yuan
d452d04656 [A5][bugfix] Fix fused MoE A5 MXFP8 scale normalization, load-balance routing and gating_topk ops (#7573)
### What this PR does / why we need it?
This PR fixes A5 MXFP8 MoE scale handling in the fused MoE path.

- It normalizes MXFP8 activation scales to the packed 3D layout expected
by A5 kernels, including both precomputed dynamic_scale inputs and gmm1
output scales before they are consumed by downstream grouped matmul ops.
- It also refines the MXFP8 force load-balancing path in profiling runs.
- This PR also enables npu_gating_top_k from torch_npu instead of custom
op when running ascend950 chip.
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI and E2E serving tests on Ascend950DT passed.

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-25 17:20:28 +08:00
Shaoxu Cheng
e0e585a109 [310P]: add torch chunk gated delta rule and 910b parity ut (#7594)
### What this PR does / why we need it?
RFC https://github.com/vllm-project/vllm-ascend/issues/7394
Add a PyTorch implementation of the  chunk gated delta rule on 310P.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
UT

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-25 16:46:43 +08:00
Marck
17da96658f [ModelLoader][Feature] Add rfork support for fast model loading (#7392)
### What this PR does / why we need it?
Support an new load format: RFORK

For implementation details of this feature, please refer to #7441


### Does this PR introduce _any_ user-facing change?

add an new options for load-format: rfork

e.g.
```bash
vllm serve /workspace/models/Qwen3-8B --load-format rfork
```

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: Marck <1412354149@qq.com>
2026-03-25 16:40:30 +08:00
pichangping
6ddfc41312 [bugfix] Fixed the error issue when overlaying MTP and full decode on DSV3.1 C8. (#7571)
…DSV3.1 C8.

### What this PR does / why we need it?
DeepSeek v3.1 C8 had a hanging issue when overlaying MTP and full graph
modes; this pull request resolves that issue.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: pichangping <1337510399@qq.com>
2026-03-25 14:36:26 +08:00
lilinsiman
95d33f05c2 [eagle3][pcp] fix acceptance rate for eagle3 and pcp enabled (#7549)
### What this PR does / why we need it?
fix the position 3 acceptance rate for eagle3 and pcp enabled

detail:
In the merged graph of eagle_proposer, the code logic was changed from
updating the code once before the forward pass of the draft model to
updating all three positions of common_attn_metadata in the merged graph
before performing the forward pass of the model. As a result, the update
of position 2 and position 3 affected the update of position 1.

For example, in the following field:
common_attn_metadata.block_table_tensor[:batch_size] =
common_attn_metadata.block_table_tensor[block_indices]

When updating the block_table_tensor at position 2, the modification of
this field occurred at the original address of common_attn_metadata. As
a result, the parameter at position 1 was also modified, but the forward
pass at position 1 had not been performed. Therefore, a copy of the
address of block_table_tensor needs to be made, and the modification
needs to be performed on the new address to ensure complete isolation
between positions.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
tests and ut

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-03-25 11:52:04 +08:00
meihanc
114ec75a06 [bugfix][CI] fix '_OpNamespace' 'vllm' object has no attribute 'qkv_rmsnorm_rope' (#7620)
### What this PR does / why we need it?
fix '_OpNamespace' 'vllm' object has no attribute 'qkv_rmsnorm_rope' by
uinstall triton

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-03-25 11:05:34 +08:00
Li Wang
8e3f8bab57 [Nightly] Nightly pre-build image (#7388)
### What this PR does / why we need it?
This pull request refactor nightly image build and simplify the logic of
multi workflows.
1. Nightly image build become the prerequisite when the test are
triggered by `schedule` or `workflow_dispatch`
2. Simplify the pull request select case logic
3. Next step: Implement replaceable nightly tests. Specifically, if
nightly tests are manually triggered, they can accept any optional
docker image to meet the needs of different commits(Which means the
image is customizable).
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-25 09:24:01 +08:00
Yaphets24
8977be1df3 [Bugfix]Fix deepseek 3.2 C8 precision by rotary tensor (#7537)
### What this PR does / why we need it?
During the attention quantization process of DeepSeek V3.2, it is
necessary to retrieve the Hadamard matrix from the weights to facilitate
the computation.

### Does this PR introduce _any_ user-facing change?
No. But there will be two new tensor in quant weight.

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

---------

Signed-off-by: mayumeng <m30059191@china.huawei.com>
Co-authored-by: mayumeng <m30059191@china.huawei.com>
2026-03-25 09:18:00 +08:00
Ronald
d96440924a adapt to main2main for model runner v2 (#7578)
### What this PR does / why we need it?
This PR aims to adapt to newest commit of vllm main branch for model
runner v2. please refer to
https://github.com/vllm-project/vllm-ascend/issues/5208
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-03-25 09:08:44 +08:00
Zhu Yi Lin
fc3ec100bc [Patch] Fix balance scheduling (#7611)
### What this PR does / why we need it?
This PR introduces a "balance scheduling" feature, enabled by the
`VLLM_ASCEND_BALANCE_SCHEDULING` environment variable. This feature
adjusts the scheduling logic to better balance the load across
data-parallel workers, preventing a single worker from blocking
scheduling for others. This can improve overall throughput.

Additionally, this PR includes a number of other updates and fixes to
the scheduler, syncing it with a more recent version of the upstream
vLLM scheduler. These changes include:
- Handling for paused scheduler state.
- Support for Mamba block-aligned splits.
- Handling for streaming requests.
- Refinements in preemption logic and resource management (KV cache,
encoder cache).
- General code refactoring for clarity and correctness.

Fixes #

### Does this PR introduce _any_ user-facing change?
Yes, this PR introduces a new feature controlled by the
`VLLM_ASCEND_BALANCE_SCHEDULING` environment variable. When enabled, the
scheduling behavior changes, which could affect performance and request
throughput.

### How was this patch tested?
CI passed. Further testing should be done to validate the performance
and correctness of the new scheduling logic under various workloads,
with and without the feature flag enabled.

Signed-off-by: GDzhu01 <809721801@qq.com>
2026-03-25 08:57:06 +08:00
Shaoxu Cheng
3f4087a8f0 [310P]fused recurrent gated delta rule pytorch core and ut (#7398)
### What this PR does / why we need it?
RFC https://github.com/vllm-project/vllm-ascend/issues/7394
Add a PyTorch implementation of the fused recurrent gated delta ruler on
310P.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
UT
- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-25 08:53:14 +08:00
drizzlezyk
54879467c4 [CI] refine issue triage rules, wan regex and update stale setting (#7531)
- Update issue labeler regex for wan to match numeric suffix only,
including both standalone wan label and multi-modality-generate
aggregate rule.
- Add title-based gate conditions in issue triage workflow so
auto-labeling runs only for expected issue templates ( [Bug]: ,
[Installation]: , [Usage]: , [Doc]: ).
- Adjust scheduled stale workflow configuration for the
awaiting-feedback processing block.

### What this PR does / why we need it?
- Update issue labeler regex for wan to match numeric suffixes only, in
both:
- standalone wan label rule
- multi-modality-generate aggregate rule
- Add title-based gate conditions in issue triage workflow so
auto-labeling runs only for expected templates:
 [Bug]:/ [Installation]:/ [Usage]:/ [Doc]:
- Adjust the scheduled stale workflow configuration for the
awaiting-feedback processing block.

### Does this PR introduce _any_ user-facing change?
- No runtime/API user-facing change.
- This PR only updates repository automation behavior in GitHub
workflows and issue labeling rules.

### How was this patch tested?
- Performed config-level validation by reviewing diffs and final YAML
content for:
- .github/issue-labeler.yml
- .github/workflows/bot_issue_manage.yaml
- .github/workflows/schedule_stale_manage.yaml
- Verified wan regex now requires numeric suffix (e.g., wan2 , wan2.1 )
and no longer matches alphabetic suffix forms (e.g., wana ).
- Verified triage workflow includes title-based if conditions for
expected issue templates.
- Verified stale workflow’s awaiting-feedback block reflects the
intended configuration adjustment.
- No unit/e2e tests were added because this PR changes GitHub Actions
and labeling configuration only.

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

---------

Signed-off-by: drizzlezyk <drizzlezyk@163.com>
2026-03-24 20:11:31 +08:00
SILONG ZENG
1e3c1e76bf [Lint]Add lint hooks for clang-format, shellcheck, forbidden imports, and boolean context manager checks (#7511)
### What this PR does / why we need it?
This PR introduces several upstream `vllm`-aligned lint hooks into
`vllm-ascend` and makes them part of the actual `pre-commit` flow.

Main changes in this PR:
- add `check-boolean-context-manager` to catch boolean expressions in
`with` statements
- add `check-forbidden-imports` to forbid direct `re` imports and
disallowed direct `triton` imports
- enable shell script linting through `tools/shellcheck.sh`
- add root `.clang-format` aligned with upstream `vllm`, enable
`clang-format` in `pre-commit`, temporarily **exclude all `csrc/**`**
from `clang-format` to avoid bringing a large native code reformat into
this PR

This PR focuses on landing the smaller and immediately useful lint
alignment first, without mixing in the larger requirements-management
migration.

### Does this PR introduce _any_ user-facing change?
No.

This PR only updates repository lint configuration, static checks, and
internal import/style enforcement. It does not change runtime behavior
or public interfaces.

### How was this patch tested?
Tested locally in the project virtual environment.

Commands used:
```bash
bash format.sh
```
Verified checks passed:
``` bash
ruff check...............................................................Passed
ruff format..............................................................Passed
codespell................................................................Passed
typos....................................................................Passed
clang-format.............................................................Passed
Lint GitHub Actions workflow files.......................................Passed
Lint shell scripts.......................................................Passed
Lint PNG exports from excalidraw.........................................Passed
Check for spaces in all filenames........................................Passed
Enforce __init__.py in Python packages...................................Passed
Check for forbidden imports..............................................Passed
Check for boolean ops in with-statements.................................Passed
Suggestion...............................................................Passed
- hook id: suggestion
- duration: 0s

To bypass pre-commit hooks, add --no-verify to git commit.
```
**note:**
clang-format is enabled but currently excludes all csrc/**


- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-24 20:03:01 +08:00
rjg-lyh
d1a83a72f7 [doc] add enable_sparse_c8 option in configuration options (#7600)
### What this PR does / why we need it?
This PR adds enable_sparse_c8 option in configuration options

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-24 19:36:34 +08:00
zouyida2052
0210cc0b07 lower log level in PD Disaggregation (#7589)
### What this PR does / why we need it?
This log is printed too frequently and unecessary, Thus lowering its
level from INFO to DEBUG.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2026-03-24 18:03:17 +08:00
lhp-deep
0e3186f07c [model_runner_v2]:optimize the performance of the _compute_slot_mappings_kernel (#7575)
### What this PR does / why we need it?

This PR optimizes the `_compute_slot_mappings_kernel` for Ascend NPUs to
improve performance. The key changes include:
- A new Triton kernel implementation (`_compute_slot_mappings_kernel`)
with NPU-specific optimizations, such as using `tl.gather` to handle
non-contiguous memory access and replacing modulo operations.
- A new method `compute_slot_mappings` in `AscendBlockTables` to use
this new kernel.
- An end-to-end test to verify the correctness of the new kernel against
the reference GPU implementation.

The optimization is needed to avoid performance degradation from scalar
computation on Ascend devices.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>
2026-03-24 17:29:14 +08:00
realliujiaxu
5d12446573 [Feat][SP] Suport SP for VL MoE models (#7044)
### What this PR does / why we need it?

2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712,
extend SP to VL MoE models.


### Does this PR introduce _any_ user-facing change?
remove `sp_threshold` in additional config and reuse `sp_min_token_num`
from vLLM.


### How was this patch tested?
- Model: Qwen3-VL-30B-A3B, 
- TP4 DP2
- 100 reqs
- max concurrency 1

| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|------------|---------------------|------------------------|
| 4k         | 429.40               | 323.3                  |
| 16k        | 1297.01              | 911.74                |

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-24 17:16:00 +08:00
LeeWenquan
9615bc33fd Fix Qwen3Next CI Config (#7561)
### What this PR does / why we need it?
This pr modifies qwen3Next nightly CI config. 
(1) Add a nightly CI .
(2) Set a more precise accuracy standard

- vLLM version: v0.18.0
- vLLM main:
6a9cceb219

Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
2026-03-24 17:08:17 +08:00
panchao-hub
d98a0727c8 [Feat] Add npugraph_ex enablement logging (#7574)
### What this PR does / why we need it?

- Replace local logging with vllm.logger for consistency
- Add info log when enable_npugraph_ex is enabled
- Add info log when enable_static_kernel is enabled
- Unify logging message format to use config switch names consistently
- This helps users understand which compilation optimizations are active

### Does this PR introduce _any_ user-facing change?

Yes. Users will now see informational log messages when
enable_npugraph_ex or enable_static_kernel features are enabled,
providing better visibility into the compilation optimization settings
being used.

### How was this patch tested?

- Code passes all pre-commit hooks (ruff check, ruff format, codespell,
typos)
- Follows project coding conventions and style guidelines
- Logger import matches the pattern used elsewhere in the codebase

Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
2026-03-24 17:04:48 +08:00
Angazenn
bdb65319a9 [UT] Align input arguments with Ascend(Yarn)RotaryEmbedding with vLLM and add ut (#7358)
### What this PR does / why we need it?
This PR adds missing arguments in `AscendRotaryEmbedding`,
`AscendYarnRotaryEmbedding` to conform with vLLM. Besides, corresponding
ut is introduced.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Angazenn <supperccell@163.com>
2026-03-24 16:02:56 +08:00
liziyu
568b6d0601 [P/D] Check wildcard address for layerwise connector (#7389)
### What this PR does / why we need it?
Check wildcard  address address for layerwise connector

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-03-24 15:50:06 +08:00
liziyu
73cadecfb4 [P/D] [Bugfix] fix mooncake layerconnector dead when update_decoder_info fail (#7514)
### What this PR does / why we need it?
Fix mooncake layerconnector dead when update_decoder_info fail. For the
scenario where node D is dead, node P failing to update_decoder_info
should not cause node P to become dead.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
by CI

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-03-24 15:49:46 +08:00
zxr2333
67aad1fce8 [BugFix][P/D] fix padding error on FullGraph mode && fix layerwise connector mamba accuracy (#7506)
### What this PR does / why we need it?
1. When the FullGraph mode is used, the branches in the Triton operator
are compiled and fixed during the graph capture process, causing the
branch condition in the `fused_recurrent_gated_delta_rule` operator,
which checks whether `ssm_state_indices >= 0` before writing to the SSM
cache, to become invalid. Now, the write operation is performed
regardless of the value. This results in the operator performing address
offset calculations and writing to the SSM cache based on the -1 offset
after -1 is used for padding in vLLM GDN backend. Since the conv cache
and SSM cache in vLLM Ascend implementation are actually a single
continuous tensor divided into two parts, this leads to data overwriting
and the generation of NaN values.
This PR addresses two cases where padding -1 is required in the GDN
metadata builder. The same logic is used to replace the padding with 0
to avoid the problem of memory overwriting, because block 0 is a
reserved block.
2. Fix layerwise connector bug for mamba cache sending on heterogeneous
TP.

- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-24 15:15:55 +08:00
LeeWenquan
475b4b0cea Revert "GMM custom operator optimization in small batch scenarios (vllm-project#7100)" (#7557)
### What this PR does / why we need it?
This reverts commit 42bcad7e9b. The commit
cause accuracy decrease of qwen3Next, 150 items of gsm8k, 98 -> 91.

- vLLM version: v0.18.0
- vLLM main:
6a9cceb219

Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
2026-03-24 14:24:44 +08:00
Shaoxu Cheng
83bd77c983 [310p]: add rmsnorm gated fallback and unit test (#7424)
### What this PR does / why we need it?
RFC #7394
310P cannot use the fused `rmsnormgated` operator and must fall back to
the native implementation.

### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
ut
- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-24 09:00:11 +08:00
jiaojiao
1de805ce0a [Ops][Misc] Refactor and optimize CausalConv1d for Ascend (#7495)
### What this PR does / why we need it?
During the prefill phase of Qwen3-Next and Qwen3.5, the
`torch.ops._C_ascend.causal_conv1d_fn` operator exhibits significant
performance bottlenecks. To address this, we have re-implemented the
optimization using `torch.ops._C_ascend.npu_causal_conv1d_custom`.

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
1 accuracy test
```
[2026-03-20 16:44:22,961] [ais_bench] [INFO] Start launch task state board ...
+-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+
| Task Name                   |   Process | Progress   | Time Cost   | Status   | Log Path                                  | Extend Parameters   |
+=============================+===========+============+=============+==========+===========================================+=====================+
| vllm-api-general-chat/gsm8k |   2918978 | NA         | 0:00:01     | finish   | logs/eval/vllm-api-general-chat/gsm8k.out | None                |
+-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+
[2026-03-20 16:44:34,284] [ais_bench] [INFO] Evaluation tasks completed.
[2026-03-20 16:44:34,287] [ais_bench] [INFO] Summarizing evaluation results...
dataset    version    metric    mode      vllm-api-general-chat
---------  ---------  --------  ------  -----------------------
gsm8k      271d0b     accuracy  gen                       96.21
```
2 ut modify test
`pytest -sv
/home/c30006096/vllm-ascend/tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_causal_conv1d.py::test_ascend_causal_conv1d`

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

Signed-off-by: wenba0 <3054239545@qq.com>
Signed-off-by: jiaojiao <56385650+wenba0@users.noreply.github.com>
2026-03-24 00:07:12 +08:00
ZhuQi-seu
e942b62d74 [features]support split qkv rmsnorm rmope for qwen3.5 (#7368)
### What this PR does / why we need it?
Qwen3.5 full attention supports enabling the split_qkv_rmsnorm_mrope
fusion operator.

### How was this patch tested?
vLLM version: v0.16.0
vLLM-Ascend main: https://github.com/vllm-project/vllm-ascend/pull/6730
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
2026-03-23 23:58:12 +08:00
Nengjun Ma
8e0789bb36 [CI] Recover pd disaggregated encoder test case that been incorrectly skipped (#7505)
### What this PR does / why we need it?
[CI] Recover pd disaggregated encoder test case that been incorrectly
skipped in PR: https://github.com/vllm-project/vllm-ascend/pull/7412

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-03-23 21:41:28 +08:00
Nengjun Ma
fcba91a392 Main2main Upgrade vllm commit to 0320 17:00 (#7510)
### What this PR does / why we need it?
Main2main Upgrade vllm commit to 0320 17:00

1. fix vllm refactored `_moe_forward` to call
`runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True.
vllm PR:"[MoE Refactor] DefaultMoERunner simplification
[#33049](https://github.com/vllm-project/vllm/pull/33049)"

2.fix vllm moved the call to `self._set_compile_ranges()` in
`VllmConfig.__post_init__` from **before** `check_and_update_config()`
to **after** it (to allow platforms to lower `max_num_batched_tokens`
first). vllm PR: "fix(xpu): Re-compute compile ranges after
platform-specific config updates"
[#37523](https://github.com/vllm-project/vllm/pull/37523)


### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
2026-03-23 21:37:41 +08:00
weijinqian0
bdd90c0088 [model_runner_v2]optimize the performance of the post_update. (#7496)
### What this PR does / why we need it?
- This PR aims to enhance the operator performance in the `post_update`
phase of `model_runner_v2` on NPUs. By optimizing the relevant
operations, it is expected to improve the overall efficiency and speed
of the model running on NPU hardware, which is crucial for scenarios
where high-performance inference is required.
- when bs = 256, time cost reduce from 26us to 11 us; 

### Does this PR introduce _any_ user-facing change?
No, there are no changes to the API, interface, or other high-level
behaviors that would directly affect the user's code or interaction with
the system beyond the performance improvement.

### How was this patch tested?
CI passed with new added/existing tests. In addition to the regular CI
tests, specific benchmark tests were conducted on NPU hardware to
measure the performance improvement of the `post_update` operators.

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2026-03-23 20:29:55 +08:00
lijiahang226
170dcbda62 [Feature] Support DeepSeek for A5 (#7232)
### What this PR does / why we need it?

Add A5 mla operators to support running DeepSeek models on A5.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: Li Jiahang <216526138+lijiahang226@users.noreply.github.com>
2026-03-23 20:28:26 +08:00
Shaoxu Cheng
13397e9cb7 [310p] Add a PyTorch implementation of the GDN gating operator on 310P (#7430)
### What this PR does / why we need it?
RFC #7394
Add a PyTorch implementation of the GDN gating operator on 310P.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
UT

- vLLM version: v0.17.0
- vLLM main:
4497431df6

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-23 20:26:39 +08:00
meihanc
e344a53127 [bugfix][CI]Skip e2e log summary when the log file is missing or empty (#7552)
### What this PR does / why we need it?
Avoid failing `ci_log_summary.py` when the e2e log file is missing or
empty.

Test in CI
:https://github.com/vllm-project/vllm-ascend/actions/runs/23428406256/job/68149271871
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-03-23 20:25:59 +08:00
zhangxinyuehfad
886756aea0 [Bugfix][CI] Fix aisbench installation to avoid Gitee authentication (#7536)
### What this PR does / why we need it?
- Pass GITEE_USERNAME (var) and GITEE_TOKEN (secret) as Docker build
  args in nightly image build so Dockerfile can authenticate to Gitee
- In Dockerfile.nightly.a2/a3, embed credentials into clone URL to
  avoid auth failure during `git clone`
- In single-node and multi-node PR test workflows, backup the
  pre-installed benchmark from the nightly image before wiping
  vllm-ascend, then restore it instead of re-cloning from Gitee,
  which is inaccessible from fork PR contexts

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-23 20:16:51 +08:00
SILONG ZENG
ffd195b0fe [Bugfix]Remove conflicting triton after vllm-ascend install on x86 (#7497)
### What this PR does / why we need it?
This PR fixes the x86 image issue where both `triton` and
`triton-ascend` are installed in the final environment.
- https://github.com/vllm-project/vllm-ascend/issues/7359

We confirmed the root cause is not that `triton` fails to uninstall
after the upstream `vllm` installation. Instead, during the
`vllm-ascend` installation step, pip resolves and installs upstream
`triton` again alongside `triton-ascend` on x86 platforms. This leads to
module conflicts at runtime because both distributions provide the
`triton` Python package.

To fix this, this PR updates all Dockerfiles to remove upstream `triton`
immediately after installing `vllm-ascend`, while keeping the
`triton-ascend` version resolved by `vllm-ascend` itself.

Affected files:
- `Dockerfile`
- `Dockerfile.a3`
- `Dockerfile.310p`
- `Dockerfile.openEuler`
- `Dockerfile.a3.openEuler`
- `Dockerfile.310p.openEuler`

### Does this PR introduce _any_ user-facing change?
Yes.

For x86 container images, the final Python environment will no longer
keep upstream `triton` alongside `triton-ascend`. This avoids importing
the wrong Triton package and fixes related runtime failures.

### How was this patch tested?
Root cause validation was performed by reproducing the installation flow
locally and checking the package state after each step.

Observed during `vllm-ascend` installation on x86:
- `triton-ascend` was installed as expected
- upstream `triton` was also installed again in the same step
``` bash
export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge

Successfully installed aiofiles-25.1.0 arctic-inference-0.1.1 blinker-1.9.0 cmake-4.2.3 fastapi-0.123.10 
flask-3.1.3 h2-4.3.0 hpack-4.1.0 hypercorn-0.18.0 hyperframe-6.1.0 itsdangerous-2.2.0 numpy-1.26.4 
opencv-python-headless-4.11.0.86 pandas-3.0.1 pandas-stubs-3.0.0.260204 priority-2.0.0 pybind11-3.0.2 
python-dateutil-2.9.0.post0 quart-0.20.0 setuptools-scm-9.2.2 six-1.17.0 starlette-0.50.0 torch-2.9.0+cpu 
torch-npu-2.9.0 torchaudio-2.9.0+cpu torchvision-0.24.0+cpu triton-3.6.0 triton-ascend-3.2.0 
vllm_ascend-0.17.0rc2.dev51+geb92e7d50 werkzeug-3.1.6 wheel-0.46.3 wsproto-1.3.2 xgrammar-0.1.32
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with 
the system package manager, possibly rendering your system unusable. It is recommended to use a virtual 
environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what 
you are doing and want to suppress this warning.
Files removed: 423 (1025.9 MB)
Directories removed: 5
```

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-23 20:14:42 +08:00
liuhy1213-cell
fb283b5820 [CI] Add nightly CI test cases for the GLM-5 (#7429)
### What this PR does / why we need it?
Add nightly CI test cases for the GLM-5
Add model download for the GLM-5

https://github.com/vllm-project/vllm-ascend/actions/runs/23286178651/job/67710409642#logs
- vLLM version: v0.17.0
- vLLM main:
b31e9326a7
---------
Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com>
Signed-off-by: liuhy1213-cell <liuhy1213@gmail.com>
Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>
2026-03-23 19:14:19 +08:00
drslark
41dadd4312 [main][bugfix] Solved the problem of the d node getting stuck in the pd-separation scenario (#7534)
### What this PR does / why we need it?
A problem of the d node getting stuck in the pd-separation scenario is
solved.

We find it will crash at `torch.nn.functional.linear(x, weight, bias)`
after being stuck for a long time.
we found that the shapes of each dp
node were not aligned. this is the root cause.

- vLLM version: v0.18.0
- vLLM main:
4034c3d32e

Signed-off-by: drslark <slarksblood@qq.com>
2026-03-23 18:53:07 +08:00
Zetong Li
a253235a59 [Doc] Add note for unsupported PCP + FULL (#7559)
### What this PR does / why we need it?
This PR aims to add note in doc that FULL mode is not supported in PCP
scenario.

Signed-off-by: Zetong Li <slippersss@126.com>
2026-03-23 17:34:51 +08:00
Levi
9976e685b7 [Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp (#7297)
### What this PR does / why we need it?
Fix multi dp padding logic for eager mode, bacause its will cause rank0
load imbalance in kimi-k2.5-w4a8 with the all the padding tokens router
to rank0. And the fix can also apply to other model in multi dp.
- before
hbm usage:
<img width="2229" height="733" alt="image"
src="https://github.com/user-attachments/assets/50479b6d-cfd0-4206-8e80-974024652997"
/>

preformance:
```shell
Concurrency  NumPrompts   QPS          TTFT_Avg     TTFT_P50     TPOT_Avg     TPOT_P50     TPOT_P90    
============ ============ ============ ============ ============ ============ ============ ============
1            15           0.0179       1667.7803    1673.3437    35.2973      35.2775      35.3784     
32           480          0.4725       2764.8027    1905.2137    40.8030      40.6978      41.0179     
64           960          0.7820       4123.7096    3485.6153    48.0461      48.1598      48.2971     
100          1500         1.0852       6216.7988    5714.0082    52.9323      53.0613      54.6304     
108          1620         1.1040       6277.4892    5798.7425    56.3862      56.9224      57.2901     
116          1740         1.1680       6563.3293    6039.5659    56.9894      57.4027      57.5786     
128          1920         1.2555       7822.5551    7604.1662    57.7660      58.1768      58.2717     
192          2880         1.4314       9212.1953    9131.3461    58.9905      59.1683      59.2791     
256          3840         1.4480       9028.0812    8913.7937    59.0092      59.2385      59.3516  
```


- after
hbm usage:
<img width="2246" height="1005" alt="image"
src="https://github.com/user-attachments/assets/d0936481-5a58-4bc5-a6f1-b92735d47885"
/>


preformance:
```shell
Concurrency  NumPrompts   QPS          TTFT_Avg     TTFT_P50     TPOT_Avg     TPOT_P50     TPOT_P90    
============ ============ ============ ============ ============ ============ ============ ============
1            15           0.0181       601.4171     600.9774     35.6270      35.6254      35.6480     
32           480          0.4455       720.8782     724.2889     45.4250      45.4755      45.6318     
64           960          0.8445       729.6209     728.2149     47.0464      47.0896      47.1985     
100          1500         1.2601       723.4834     724.6673     48.3108      48.3844      48.5355     
108          1620         1.3409       727.1509     720.6772     48.8962      48.9409      49.0489     
116          1740         1.4080       679.9799     677.6119     49.1253      49.1983      49.3087     
128          1920         1.4155       680.6284     674.9436     49.2193      49.2450      49.3763     
192          2880         1.4422       684.6577     676.7833     49.2059      49.2264      49.3229     
256          3840         1.4558       685.2462     678.1709     49.2191      49.2351      49.3419    
```
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: fny-coder <985619145@qq.com>
2026-03-23 17:05:02 +08:00
Nengjun Ma
8e2c59e1ee Main2main upgrade vllm commit to 03 19 17:00 (#7478)
### What this PR does / why we need it?
Upgrade vllm commit to 2026.03.19.

1.Fix socket removed from StatelessProcessGroup. Upstream vLLM PR
[#36330](https://github.com/vllm-project/vllm/pull/36330) ("elastic_ep:
Fix stateless group port races") refactored StatelessProcessGroup and
removed the socket: socket.socket | None field. The socket ownership was
moved to a new create_tcp_store() helper instead of being stored as a
field on the dataclass.

2.fix `virtual_engine` parameter removed from `set_forward_context().
Upstream [V0 Deprecation] Deprecate virtual engine
[#37195](https://github.com/vllm-project/vllm/pull/37195)

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-03-23 16:25:57 +08:00
LICO67373
caa71e50ca [Perf] Simplify FIA prefill context merge path (#7293)
### What this PR does / why we need it?
This PR simplifies and hardens MLA prefill context merging in
`vllm_ascend/attention/mla_v1.py` after FIA migration by directly
building `out_list/lse_list` (without temporary chunk buffers or
`cat/stack/split`) and using `reshape` for safe flattening of
non-contiguous tensors.

### Does this PR introduce _any_ user-facing change?
No. This is an internal refactor/stability improvement only; no
API/interface behavior changes.

### How was this patch tested?
- Verified tensor shape/data flow for `npu_attention_update` inputs
(`out_list/lse_list`) after refactor.
- Confirmed no lint errors in the modified file.
- CI UT coverage on attention/MLA paths is used for validation.

vLLM version: `v0.17.0`  
vLLM main: `vllm-project/vllm@4034c3d`

---------

Signed-off-by: lico67373 <918688502@qq.com>
2026-03-23 07:47:42 +00:00
dependabot[bot]
da866cc168 [CI] Bump docker/build-push-action from 6 to 7 (#7541)
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6 to 7.

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-23 15:46:12 +08:00