Commit Graph

2597 Commits

Author SHA1 Message Date
Cao Yi
50441e4650 [BugFix][MTP] Fix prefill misclassified as decode when prompt tokens == num_spec_tokens + 1 (#6835)
## Problem
When MTP is enabled, prefill requests with `prompt_tokens ==
num_spec_tokens + 1` are incorrectly classified as decode requests,
causing accuracy issues.

## Root Cause
The `uniform_decode` condition only checked:
- `max_num_scheduled_tokens == uniform_decode_query_len`
- `num_tokens == max_num_scheduled_tokens * num_reqs`

This is insufficient because a prefill request with specific prompt
length satisfies these conditions as well.

## Fix
Add `is_all_decode` check to ensure all requests have
`num_computed_tokens > 0` before classifying as uniform decode, since
decode requests must have computed at least one token.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-03-05 17:33:10 +08:00
dsxsteven
91c39ebae6 [BugFix] [dcp] Fix GQA Model Error when Enable both DP and DCP (#7012)
### What this PR does / why we need it?
For GQA model, when we enable both dp and dcp (disable pcp), the
key-value pairs were not being captured correctly; we have now fixed it.


Signed-off-by: dsxsteven <dsxsteven@sina.com>
2026-03-05 16:51:08 +08:00
zhangxinyuehfad
1e4017e3fa [CI] support nightly ci for per pr by labels (#6483)
### What this PR does / why we need it?

This PR refactors the nightly CI workflows (A2 and A3) to support
running tests against a specific PR's code, in addition to the existing
scheduled/dispatch runs using pre-built images.

#### Motivation:
Previously, nightly tests could only be triggered by schedule or
workflow_dispatch, always using the pre-built nightly image. This change
allows developers to trigger nightly tests against their own PR's source
code, enabling early validation without waiting for a nightly build.

#### Changes
Trigger logic (parse-trigger job)

A new parse-trigger job is introduced in both
schedule_nightly_test_a2.yaml and schedule_nightly_test_a3.yaml to
centralize trigger evaluation:

`schedule / workflow_dispatch`: runs all tests with the pre-built image
(existing behavior preserved)
`pull_request (labeled + synchronize)`: runs only when:The PR has the
nightly-test label, and /nightly [test-names] comment exists (latest one
wins)

1. /nightly or /nightly all — runs all tests
2. /nightly test1 test2 — runs only named tests (comma-wrapped for exact
matching)

#### How to trigger
1. Add the nightly-test label to your PR
2. Comment /nightly (all tests) or /nightly test1 test2 (specific tests)
4. Re-triggering: add another /nightly comment and push a new commit
(synchronize event)

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-05 16:46:37 +08:00
zhangxinyuehfad
a6745b8577 [CI] fix test_qwen3_moe_external_launcher_ep_tp2 (#6951)
### What this PR does / why we need it?
fix test_qwen3_moe_external_launcher_ep_tp2 by
wait_until_npu_memory_free

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-05 16:43:45 +08:00
tanhaoan333
1f2a083597 [bugfix]Qwen-Omni quantization model_type bugfix (#7007)
### What this PR does / why we need it?
[bugfix]Qwen-Omni quantization model_type bugfix
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>
2026-03-05 16:34:34 +08:00
realliujiaxu
1a7f845696 [Feat][Worker] NPUWorker Profiler profile_prefix full adaptation (RFC #6954) (#6968)
## What this PR does / why we need it?

Implements [RFC
#6954](https://github.com/vllm-project/vllm-ascend/issues/6954):
NPUWorker Profiler profile_prefix full adaptation for API parity with
upstream vLLM.

### Changes
- **Lazy profiler init**: Defer profiler creation until first
`profile(is_start=True)` call
- **profile_prefix param**: Add `profile_prefix` to `profile()`; compute
`trace_name` from prefix + `get_worker_rank_suffix()`
- **Refactor `_init_profiler` → `_create_profiler(trace_name)`**: Pass
`worker_name` to `tensorboard_trace_handler` for unique trace files per
worker
- Unique trace files per worker; no collision in multi-worker setups

### Testing
- Unit tests updated/added in `tests/ut/worker/test_worker_v1.py`
- `pytest tests/ut/worker/test_worker_v1.py::TestNPUWorker` passed

## Does this PR introduce _any_ user-facing change?
Yes. Trace file naming may differ (more descriptive with worker rank
suffix). `profile(is_start=True, profile_prefix="warmup")` now
supported.

## How was this patch tested?
- Unit tests:`pytest tests/ut/worker/test_worker_v1.py::TestNPUWorker`
- Manual: vLLM serve with profiler config, start/stop profile, verified
trace files

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-05 16:18:34 +08:00
LeeWenquan
3047b724b3 Add GemmaRmsNorm ACLGraph Support (#6473)
### What this PR does / why we need it?
1. New Custom NPU Operation: Introduced npu_gemma_rms_norm in
csrc/torch_binding.cpp to provide optimized Gemma RMS Normalization
support for Ascend NPUs. This function includes logic to handle dynamic
shapes for the gamma tensor.
2. PyTorch Operator Registration: The new npu_gemma_rms_norm operation
has been registered with the PyTorch custom operator library, making it
accessible from Python.
Meta-Implementation for ACLGraph: A corresponding meta-implementation,
npu_gemma_rms_norm_meta, was added in csrc/torch_binding_meta.cpp. This
is crucial for symbolic tracing and allowing the custom kernel to be
captured and optimized by ACLGraph.
3. Python Frontend Integration: The vllm_ascend/ops/layernorm.py file
was updated to utilize the newly added
torch.ops._C_ascend.npu_gemma_rms_norm for Gemma RMS Normalization,
replacing the generic torch_npu.npu_rms_norm
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
Signed-off-by: LeeWenquan <83354342+SunnyLee151064@users.noreply.github.com>
2026-03-05 16:15:07 +08:00
LI SHENGYONG
5a3744c542 [EPLB] The profiling can collect the time required for adjusting the eplb. (#7001)
### What this PR does / why we need it?
To analyze the overhead of the dynamic eplb adjustment framework in
detail, we added the time consumption of the adjustment to the print
information in profiling mode.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


![Snipaste_2026-03-05_11-42-28](https://github.com/user-attachments/assets/41c2b82a-5dfa-4e39-8b50-f4649deed30c)

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-03-05 16:10:57 +08:00
songjianquan
43c8da3574 [Feat]fused_qkvzba_split_reshape supports token number greater than 65536 (#6740)
### What this PR does / why we need it?

This pull request optimizes the fused_qkvzba_split_reshape_cat Triton
kernel for Qwen3-Next GatedDeltaNet model and removes the previous
conditional restrictions in the forward pass.
Key changes:
1. Refactored Triton kernel implementation: The
fused_qkvzba_split_reshape_cat_kernel has been optimized with a new
loop-based approach that supports arbitrary num_v_heads / num_k_heads
ratios and batch sizes. The kernel now uses configurable ROWS_PER_ITER
for better memory utilization .
2. The optimized kernel now handles all scenarios directly without
requiring a fallback path using fix_query_key_value_ordering and
torch.cat.

### Does this PR introduce _any_ user-facing change?
No. This is an internal optimization of the Triton kernel implementation
and does not introduce any user-facing changes.

### How was this patch tested?
CI is expected to pass with existing tests.

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: songjianquan <songjianquan1@huawei.com>
Co-authored-by: songjianquan <songjianquan1@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-05 14:41:38 +08:00
wangxiyuan
13777bf3f0 [Spec Decode]clean up spec decode interface (#6947)
This pull request refactors the speculative decoding proposer interface
to align with upstream vLLM, removing the local `Proposer` interface and
renaming methods to `propose`.

This is the first step. In the future we should remove the class
register and just add few Ascend specified method once the arch in vLLM
is ready.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-05 14:30:10 +08:00
rjg-lyh
2bd9c35788 [perf][refactor] Refactor and optimize sfa_v1.py for dsv3.2/glm5 (#6874)
### What this PR does / why we need it?
This PR refactors sfa_v1.py to improve code readability and usability,
fixes a code bug, and enhances performance through the replacement of
certain operators.

### changes
- **improve code readability**: Optimizes parts of the code structure in
sfa_v1.py, supplementary comments for key code blocks, removes some
unused variables, and improves the naming of certain functions and
variables.

- **resolved a duplicated double write to k_cache**: Fixed redundant
double writes of k_cache in the indexer_select module (in both the
`forward` function and `indexer_select_post_process`), improving
performance to some extent.

- **replace `scatter` ops with `reshape_and_cache`**: This optimization
replaces two separate cache storage operations on `k_nope` and `k_pe`
with a single call to the `reshape_and_cache` operator, improving
performance. The original `scatter` operator involves reordering
slot_mapping for generality, introducing significant scalar
computations. In contrast, the `reshape_and_cache` operator eliminates
this redundant reordering step, thus reducing unnecessary computation
time and enhancing the operator's performance.

### performance comparison
4*A3, 1P1D, P dp2tp16, D dp8tp4, input/output: 64K/3K
origin:
TTFT: **28s**, TPOT: 26ms, TPS: **820 token/s**

fixed redundant double writes of k_cache:
TTFT: **24s**, TPOT: 26ms, TPS: **840 token/s**

replace scatter ops with reshape_and_cache:
TTFT: **24s**, TPOT: 26ms, TPS: **850 token/s**

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-05 14:27:11 +08:00
Ronald
77e009d9fc [Feature] Add docs of batch invariance and make some extra operators patch (#6910)
### What this PR does / why we need it?

This PR add docs of batch invariance and make some extra operators
according to validation result.
please see https://github.com/vllm-project/vllm-ascend/issues/5487 to
track progress.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-03-05 09:12:40 +08:00
tanhaoan333
f8315f5717 [bugfix]Qwen2.5VL accurate question (#6975)
### What this PR does / why we need it?
The attention mechanism in the ViT model architecture of Qwen2.5VL
consists of two parts and does not support using cache to pass sequence
lengths.
### Does this PR introduce _any_ user-facing change?
remove seq_lens_cache
### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>
2026-03-04 22:02:29 +08:00
zhangxinyuehfad
566c367a10 [CI] Add DeepSeek-V3.2 large EP nightly ci (#6378)
### What this PR does / why we need it?

Add DeepSeek-V3.2 nightly ci

Fix PD routing to exclude headless nodes when collecting
prefiller/decoder IPs

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-04 16:15:56 +08:00
Zhujiyang2
c3c265648f [Ops][BugFix] Fix RoPE shape mismatch for mtp models with flashcomm v1 enabled (#6939)
What this PR does / why we need it?
When using a draft model (e.g., in MTP speculative decoding) with shared
expert data parallelism (enabled via flashcomm), a shape mismatch error
occurs in the rotary embedding calculation for models like GLM-4.7. This
is because the positions tensor has an incorrect shape for this specific
configuration.

This PR fixes the issue by adding a check in
AscendRotaryEmbedding.forward_oot. If the model is a draft model and
shared expert DP is enabled, it processes the positions tensor using
torch.ops.vllm.maybe_all_gather_and_maybe_unpad to ensure its shape is
correct before applying the rotary embedding. This resolves the shape
mismatch error.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com>
2026-03-04 16:02:08 +08:00
SILONG ZENG
95b44d7b73 [bugfix]fix file not found error in nightly of single-node (#6976)
### What this PR does / why we need it?
1. The **main image build** takes approximately **two hours**. The main
image build time needs to be moved forward to **21pm(UTC+8)** to ensure
that the nightly image build can use the latest main image.
``` bash
schedule:
   # UTC+8: 8am, 12pm, 16pm, 22pm
   - cron: '0 0,4,8,14 * * *'
```
--->
``` bash
schedule:
   # UTC+8: 8am, 12pm, 16pm, 21pm
   - cron: '0 0,4,8,13 * * *'
```
Link:
https://github.com/vllm-project/vllm-ascend/actions/runs/22632712302/job/65641055135#step:8:26

2. The nightly test is encountering the following error: 
``` bash
ImportError: ascend_transport.so: cannot open shared object file: No such file or directory.
```
Path need to be added:
``` bash
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib" >> ~/.bashrc
```
Link:
https://github.com/vllm-project/vllm-ascend/actions/runs/22632712302/job/65641054911#step:7:529
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-04 11:47:26 +08:00
zhaomingyu13
52d9086f64 [Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (#6914)
### What this PR does / why we need it?
When using the target model after rotational quantization, the
acceptance rate decreases because the fc weight of the draft model has
not undergone rotational quantization(issue: #6445). We fixed this issue
by performing rotation quantization on the fc weight of the draft model
in the same way as the main model when loading draft model.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2026-03-04 11:29:49 +08:00
Li Wang
d431d7d526 [CI] Enable auto upgrade e2e estimated time for auto-partition suites (#6840)
### What this PR does / why we need it?
This patch add a schedule triggered workflow for auto upgrade e2e
estimated-time for batter load balance
1. The workflow will run the full e2e test to get the duration of each
test.
2. The script `update_estimated_time.py` will upgrade the
[config.json](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/scripts/config.yaml)
according to the latest time
3. The workflow will submit a pull request that includes changes to
`config.json` automatically
<img width="2484" height="764" alt="image"
src="https://github.com/user-attachments/assets/02f3459c-bb3b-4f8e-9966-8bb2e5c1bbea"
/>


### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
- 
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-04 10:38:34 +08:00
NJX
c7fd7a25f7 [Doc][Misc] Fix msprobe_guide.md documentation issues (#6965)
## What this PR does / why we need it?

Fixes several documentation issues in the msprobe debugging guide as
reported in #6065:

1. **Remove unnecessary `cat` heredoc wrapper**: The example
configuration section used a `cat <<'JSON'` bash wrapper around the JSON
config. Simplified to a plain JSON code block.
2. **Fix duplicate chapter numbering**: Two sections were both numbered
'2'. Renumbered sections sequentially (0-6).
3. **Fix msprobe command**: Changed `msprobe graph_visualize` to
`msprobe -f pytorch graph` in section 5.2 Visualization.
4. **Remove backward-related content**: Since vllm is inference-only (no
training), removed all backward pass references including backward
tensor examples, parameter gradient examples, and backward descriptions
from dump.json explanations.

## Does this PR introduce _any_ user-facing change?

Documentation improvement only. No code changes.

## How was this patch tested?

Manual review of the markdown file to verify all 4 issues from #6065 are
addressed.

Closes #6065
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: NJX-njx <3771829673@qq.com>
2026-03-04 10:28:31 +08:00
SILONG ZENG
859f2c25b9 [Nightly][Refactor]Migrate nightly single-node model tests from .py to .yaml (#6503)
### What this PR does / why we need it?
This PR refactors the nightly single-node model test by migrating test
configurations from Python scripts to a more maintainable `YAML-based`
format.

| Original PR | Python (`.py`) | YAML (`.yaml`) |
| :--- | :--- | :--- |
| [#3568](https://github.com/vllm-project/vllm-ascend/pull/3568) |
`test_deepseek_r1_0528_w8a8_eplb.py` | `DeepSeek-R1-0528-W8A8.yaml` |
| [#3631](https://github.com/vllm-project/vllm-ascend/pull/3631) |
`test_deepseek_r1_0528_w8a8.py` | `DeepSeek-R1-0528-W8A8.yaml` |
| [#5874](https://github.com/vllm-project/vllm-ascend/pull/5874) |
`test_deepseek_r1_w8a8_hbm.py` | `DeepSeek-R1-W8A8-HBM.yaml` |
| [#3908](https://github.com/vllm-project/vllm-ascend/pull/3908) |
`test_deepseek_v3_2_w8a8.py` | `DeepSeek-V3.2-W8A8.yaml` |
| [#5682](https://github.com/vllm-project/vllm-ascend/pull/5682) |
`test_kimi_k2_thinking.py` | `Kimi-K2-Thinking.yaml` |
| [#4111](https://github.com/vllm-project/vllm-ascend/pull/4111) |
`test_mtpx_deepseek_r1_0528_w8a8.py` | `MTPX-DeepSeek-R1-0528-W8A8.yaml`
|
| [#3733](https://github.com/vllm-project/vllm-ascend/pull/3733) |
`test_prefix_cache_deepseek_r1_0528_w8a8.py` |
`Prefix-Cache-DeepSeek-R1-0528-W8A8.yaml` |
| [#6543](https://github.com/vllm-project/vllm-ascend/pull/6543) |
`test_qwen3_235b_w8a8.py` | `Qwen3-235B-A22B-W8A8.yaml` |
| [#6543](https://github.com/vllm-project/vllm-ascend/pull/6543) |
`test_qwen3_235b_a22b_w8a8_eplb.py` | `Qwen3-235B-A22B-W8A8.yaml` |
| [#3973](https://github.com/vllm-project/vllm-ascend/pull/3973) |
`test_qwen3_30b_w8a8.py` | `Qwen3-30B-A3B-W8A8.yaml` |
| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) |
`test_qwen3_32b_int8.py` | `Qwen3-32B-Int8.yaml` |
| [#3757](https://github.com/vllm-project/vllm-ascend/pull/3757) |
`test_qwq_32b.py` | `QwQ-32B.yaml` |
| [#5616](https://github.com/vllm-project/vllm-ascend/pull/5616) |
`test_qwen3_next_w8a8.py` | `Qwen3-Next-80B-A3B-Instruct-W8A8.yaml` |
| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) |
`test_qwen2_5_vl_7b.py` | `Qwen2.5-VL-7B-Instruct.yaml` |
| [#5301](https://github.com/vllm-project/vllm-ascend/pull/5301) |
`test_qwen2_5_vl_7b_epd.py` | `Qwen2.5-VL-7B-Instruct-EPD.yaml` |
| [#3707](https://github.com/vllm-project/vllm-ascend/pull/3707) |
`test_qwen2_5_vl_32b.py` | `Qwen2.5-VL-32B-Instruct.yaml` |
| [#3676](https://github.com/vllm-project/vllm-ascend/pull/3676) |
`test_qwen3_32b_int8_a3_feature_stack3.py` |
`Qwen3-32B-Int8-A3-Feature-Stack3.yaml` |
| [#3709](https://github.com/vllm-project/vllm-ascend/pull/3709) |
`test_prefix_cache_qwen3_32b_int8.py` |
`Prefix-Cache-Qwen3-32B-Int8.yaml` |
| [#5395](https://github.com/vllm-project/vllm-ascend/pull/5395) |
`test_qwen3_next.py` | `Qwen3-Next-80B-A3B-Instruct-A2.yaml` |
| [#3474](https://github.com/vllm-project/vllm-ascend/pull/3474) |
`test_qwen3_32b.py` | `Qwen3-32B.yaml` |
| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) |
`test_qwen3_32b_int8.py` | `Qwen3-32B-Int8-A2.yaml` |
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-03 20:13:43 +08:00
Cao Yi
a0a904a3d4 [BugFix] Improve GDN layer detection for multimodal models (#6941)
## Summary
- Enhanced `check_gdn_layer()` function to properly detect GDN layers in
multimodal models
- Added support for checking `text_config.layer_types` in addition to
root-level `layer_types`
- Fixed potential None reference errors when `layer_types` attribute is
missing

## Changes
- Modified `vllm_ascend/utils.py`:
  - Replaced `hasattr()` check with safer `getattr()` approach
  - Added fallback to empty list when `layer_types` is None
- Added secondary check for `text_config.layer_types` to support models
like Qwen-Omni

## Motivation
Previous implementation only checked `layer_types` at the root config
level, which failed to detect GDN layers in multimodal models where this
information is nested under `text_config`. Additionally, it could raise
errors when `layer_types` was None.

---

Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
2026-03-03 20:08:39 +08:00
weiguihua2
5b05b3a090 [feat]ds3.2 pcp support mtp and chunkprefill (#6917)
### What this PR does / why we need it?
ds3.2 pcp supports the combination of MTP and chunkprefill features.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2026-03-03 19:03:50 +08:00
Frank Chen
b771ca9a47 [CPU binding] Implement global CPU slicing and improve IRQ binding for Ascend NPUs (#6945)
### What this PR does / why we need it?

This PR introduces global CPU slicing for Ascend NPUs to ensure
non-overlapping CPU partitions, addresses IRQ binding logical errors on
A3, and enhances the logic for determining total NPUs in CPU allocation.
These changes are necessary to optimize CPU resource management and
improve system stability.

- **Global CPU Slicing**: Introduced a global CPU slicing mechanism for
Ascend NPUs to ensure non-overlapping CPU partitions across multiple
processes or data parallel groups, preventing resource contention.
- **Improved IRQ Binding for A3 Devices**: Refined the IRQ binding logic
specifically for Ascend A3 devices, correctly mapping logical NPU IDs to
physical card and chip IDs for accurate npu-smi queries and preventing
multi-process overwrite of IRQ settings.
- **Enhanced NPU Count Determination**: Improved the logic for
determining the total number of logical NPUs, prioritizing NPU mapping
information to ensure more accurate CPU allocation.
- **Minimum CPU Requirement**: Established a minimum requirement of 5
CPUs per NPU for binding, reserving specific cores for IRQ, main, ACL,
and release operations to ensure stable operation.

### Does this PR introduce _any_ user-facing change?

No user-facing changes are introduced.

### How was this patch tested?

CI passed with new added/existing tests.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: c00818886 <chenchuwei@huawei.com>
2026-03-03 17:20:52 +08:00
linfeng-yuan
700423156f [Triton] Centralize Ascend extension op dispatch in triton_utils (#6937)
### What this PR does / why we need it?

This pull request refactors the dispatch mechanism for the
**triton-ascend-specific operators** `insert_slice`, `extract_slice`,
and `get_element` to ensure compatibility with both CANN 8.5 and 9.0.

A unified helper function, `_resolve_triton_ascend_op`, has been
introduced in `vllm_ascend/ops/triton/triton_utils.py`. This function
dynamically resolves these operators by first attempting to import them
from the `triton.language.extra.cann.extension` module, which is present
in newer CANN versions. If that fails, it falls back to the standard
`triton.language` module.

This approach centralizes operator dispatch logic, allowing individual
Triton kernels to use these functions without being aware of the
underlying Triton/CANN version. All call sites have been updated to use
these new unified functions.

### Does this PR introduce _any_ user-facing change?

No. This is an internal refactoring of operator implementations and does
not introduce any user-facing changes.

### How was this patch tested?

CI is expected to pass with existing tests.

**Testing Context:**
- vLLM version: v0.16.0
- vLLM main: `15d76f74e2fdb12a95ea00f0ca283acf6219a2b7`

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-03 17:10:30 +08:00
linfeng-yuan
cb893bcdb0 [csrc][bugfix] Add compile-time Ascend950/910_95 compatibility for custom ops between CANN8.5 and 9.0 (#6936)
### What this PR does / why we need it?
Remove hardcoded ASCEND910_95 usage in csrc custom-op host/tiling code
and
select the SoC target at CMake configure time.

- Probe CANN headers with check_cxx_source_compiles:
prefer platform_ascendc::SocVersion::ASCEND950, fallback to
ASCEND910_95.
- Export the selected enum/config string via shared compile definitions
(VLLM_ASCEND_950_SOC_ENUM / VLLM_ASCEND_950_SOC_CONFIG).
- Apply the shared macros to affected paths (moe_gating_top_k,
add_rms_norm_bias) to avoid per-file hardcoding.
- Keep behavior unchanged; this is an internal build-compatibility fix
for CANN 8.5 and 9.x.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-03 17:08:22 +08:00
Shaoxu Cheng
2064afe380 [300I][Bugfix] fix unquant model weight nd2nz error (#6851)
### What this PR does / why we need it?
- This PR fixes an issue with weight format conversion for unquantized
models running on Ascend 310P devices.

- The changes refactor the logic for converting weights to the
FRACTAL_NZ format. Previously, this was handled in a 310P-specific
linear layer implementation (`AscendUnquantizedLinearMethod310`). This
implementation has been removed, and the logic is now centralized in the
`maybe_trans_nz` utility function. This function now checks if the
device is a 310P and applies the NZ format cast accordingly for
`float16`/`bfloat16` weights.

- This refactoring simplifies the code by removing platform-specific
duplication and ensures correct weight handling for unquantized models
on 310P.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
ut and local test
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-03 15:57:26 +08:00
zzzzwwjj
f19f7b1fe2 [doc] fix supported_models (#6930)
### What this PR does / why we need it?

Add Experimental supported model/feature for supported_models.md

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2026-03-03 09:47:50 +08:00
starmountain1997
248d07566f [CI] nightly test timeout (#6912)
### What this PR does / why we need it?

The nightly test is currently failing due to a
[timeout](https://github.com/vllm-project/vllm-ascend/actions/runs/22547280169/job/65326335134).

As noted in #6778, this issue can be resolved by applying this fix.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

run nightly test.

Co-authored-by: guozr <guozr1997@hotmail.com>
2026-03-03 09:31:46 +08:00
Xiaoshuang Wang
f7a8befc20 [CI] Upgrade CANN to 8.5.1 (#6897)
### What this PR does / why we need it?
[CI] Upgrade CANN to 8.5.1

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-03-03 09:02:42 +08:00
tanhaoan333
15f6564976 [Model]Add Qwen3-Omni quantization Ascend NPU adaptation and optimization (#6828)
### What this PR does / why we need it?
This pull request is for quantization adaptation of Qwen3Omni, and it
achieves operator-level optimization and AUT (Auto-Quantization Tuning)
component optimization through patch-based modifications.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>
2026-03-03 00:07:23 +08:00
wangxiaoteng888
dfa9ff7f2a [P/D][v0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 (#6898)
### What this PR does / why we need it?
Adapt the recompute feature to vLLM 0.16.0, where the D node forwards
recompute requests to the P node.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-03-02 23:24:03 +08:00
pu-zhe
5899438a86 [Feat][310p] 310P support w8a8s quantization and saving w8a8sc state (#6878)
### What this PR does / why we need it?
This pull request introduces significant enhancements for 310P device
support, primarily by enabling W8A8S quantization and facilitating the
saving of models with W8A8SC state outputs. It provides an example
script for saving sharded and compressed model states, implements the
core W8A8S quantization method, and integrates metadata generation
within the 310P worker to accurately describe the quantization types of
saved parameters. These changes aim to improve efficiency and
compatibility for quantized models on 310P hardware.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
W8A8S accuarcy test and W8A8SC states save.
<img width="886" height="184" alt="image"
src="https://github.com/user-attachments/assets/e9bcac54-1f69-4d3a-a5b8-221a147ef99d"
/>

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-03-02 20:09:15 +08:00
linfeng-yuan
68d8d20ca2 [misc] move mxfp_compat into device to decouple from quantization init chain (#6918)
### What this PR does / why we need it?
`mxfp_compat` only provides dtype/symbol compatibility helpers for
different `torch_npu` versions, but it was placed under
`vllm_ascend.quantization`. Importing it from device/ops paths could
trigger `quantization/__init__.py` and pull in heavy quantization method
dependencies, increasing startup coupling and causing import-cycle risk
(especially on 310P paths).

### Does this PR introduce _any_ user-facing change?
No functional behavior change intended.

### How was this patch tested?
CI passed.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-02 18:17:01 +08:00
pu-zhe
632801b0ad [CI][310P] Add 310p tracked files in CI light. (#6923)
### What this PR does / why we need it?
Add 310p tracked files in CI light.
'vllm_ascend/attention/attention_v1.py'
'vllm_ascend/ops/fused_moe/**'
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI test
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-03-02 18:03:46 +08:00
whx
16c879cdf7 [Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518)
### What this PR does / why we need it?
Add muls_add triton kernel with related fusion pass. What's more, this
PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI passed with new added test.


- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-03-02 17:54:25 +08:00
realliujiaxu
8547520726 [Doc][Misc] Update AGENTS.md with sign-off and PR template requirements (#6892)
### What this PR does / why we need it?

This PR enhances the `AGENTS.md` contribution guidelines with the
following improvements:

1. **Add sign-off requirement** - All commits must include
"Signed-off-by:" line using `git commit -s`
2. **Add PR title format** - Document the `[Type][Module] Description`
format for PR titles
3. **Add PR template reference** - Guide contributors to follow
`.github/PULL_REQUEST_TEMPLATE.md`
4. **Add linting check step** - Require running `bash format.sh ci`
before pushing
5. **Add lint note for all file types** - Emphasize markdown files also
need lint checking
6. **Add fork workflow guidance** - Clarify to push to fork repository,
not main repository
7. **Clarify test section** - Provide specific examples for "How was
this patch tested?"
8. **Add vLLM version note** - Warn about preserving auto-added vLLM
version info

### Does this PR introduce _any_ user-facing change?

No. This is a documentation-only update for contributors.

### How was this patch tested?

- Verified the markdown formatting renders correctly
- Confirmed all links and references are valid
- Ran `bash format.sh ci` locally and all checks passed

---

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-02 16:44:59 +08:00
Yuzhou Tong
9180dd6c51 [BugFix][PCP] Fix presion bugs for pcp/dcp in PD disaggregate (#6876)
### What this PR does / why we need it?
Fix a bug for PD disaggregate of PCP/DCP, some conditions only consider
MLA while ignoring DSA.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
15d76f74e2
- vLLM Ascend main: 81fb7d5779

Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com>
Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com>
2026-03-02 16:11:00 +08:00
Shaoxu Cheng
ddc78dbade [300I] support decode-only aclgraph mode (#6849)
### What this PR does / why we need it?
310p aclgraph mode, but has some problems:
- the event-id hardware limit, the num of graph will be limited.
- the cann version support this feature cannot be get from external of
huawei.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
local test
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-02 14:15:14 +08:00
dependabot[bot]
86c9109d16 Bump actions/upload-artifact from 6 to 7 (#6906)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 6 to 7.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-02 14:08:28 +08:00
dependabot[bot]
002ec24dd8 Bump actions/download-artifact from 7 to 8 (#6907)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 7 to 8.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-02 14:07:59 +08:00
Eric-dot
3c66a970f2 add mxfp8 moe quantization (#6670)
### What this PR does / why we need it?
support mxfp8 quantization (Qwen MOE )
Using adaptor to make the hardware-specific behavior clearer and more
maintainable
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
13397841ab

---------

Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: wangyao-i <iwangyao@outlook.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com>
Co-authored-by: fangrongcan <f00876277@china.huawei.com>
Co-authored-by: wangyao-i <iwangyao@outlook.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
2026-03-02 11:04:06 +08:00
wjunLu
c324053b44 [CI] Revert speedup image building and CI Installation related PRs (#6891)
### What this PR does / why we need it?

Revert speedup image building and CI Installation related PRs

git revert 8835236181
git revert 64fba51275
git revert 263c2f8e8d
git revert 84b00695f8


### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-03-02 08:53:10 +08:00
Frank Chen
a77fe932e4 [Platform] Fix CPU binding logic (#6889)
### What this PR does / why we need it?

- Rework CpuAlloc.handle_no_affinity() to build available NUMA nodes
after allowed_cpus filtering, assign NPUs to NUMA nodes via round‑robin,
and split CPUs per NPU with disjoint slices for better balance.
- Improve bind_memory() robustness by deriving the target NUMA from each
NPU’s CPU pool, validating NUMA existence, and skipping binding when
data is missing.
- bind_memory() now only bind the single NUMA node that corresponds to
NPU id, instead of 2 NUMA nodes.
- Fix the issue that all NPUs bind to 0th NUMA node when DP16 due to
global NPU id is not visible across DP domain.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added/updated unit tests:

test_cpu_binding.py
1.   test_binding_mode_table covers A2 vs A3 binding mode mapping.
2. test_build_cpu_pools_fallback_to_numa_balanced covers fallback when
affinity info is missing.
3. TestBindingSwitch.test_is_arm_cpu covers ARM/x86/unknown arch
detection.
4.   test_bind_cpus_skip_non_arm covers non‑ARM skip path in bind_cpus.

test_worker_v1.py
1. Updated mocks for enable_cpu_binding default True to align with new
config default.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
2026-03-01 20:30:43 +08:00
realliujiaxu
5e24b26a54 [Bugfix] rename enable_flash_comm_v1 back to enable_sp (#6883)
### What this PR does / why we need it?

PR #5632 introduced a bug by replacing some branches gated by enable_sp
with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is
enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and
VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with
the previous logic and leads to accuracy issues. This PR restores the
original enable_sp-based branching to recover expected behavior and
accuracy.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

#### 1. start server
``` bash
vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/  \
    --port 8001 \
    --served-model-name auto \
    --max-model-len 1024 \
    --enforce-eager \
    --tensor-parallel-size 2 \
    --data-parallel-size 2 \
    --gpu-memory-utilization 0.9 \
    --enable-expert-parallel \
    --additional-config '{"enable_shared_expert_dp": true}'
```

#### 2. curl
```bash
curl -s http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "auto",
  "messages": [
    {"role": "user", "content": "Hello. I have a question. Who are you?"}
  ],
  "max_tokens": 10,
  "temperature": 0.0,
  "ignore_eos_token": true
}'
```

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-01 20:22:50 +08:00
wjunLu
8835236181 [Image] Fix docker image merge tag settings (#6884)
### What this PR does / why we need it?
Fix docker image merge tag settings, to use tag as the branch name.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-03-01 12:20:57 +08:00
Bai Yongbin
9d09488b4a [Feat] support basic pcp&dcp for qwen3next (#6091)
### What this PR does / why we need it?
This PR implements Context Parallelism (CP) support for the Qwen3-Next
model, including PCP (Parallel Context Parallelism) and DCP
(Dynamic/Data Context Parallelism).

- vLLM version: v0.15.0
- vLLM main:
f176443446

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Co-authored-by: SunnyLee219 <3294305115@qq.com>
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-02-28 21:44:08 +08:00
wjunLu
64fba51275 [Bugfix] Fix openEuler dockerfile error (#6871)
### What this PR does / why we need it?
This pull request addresses several issues within the openEuler
Dockerfiles : `Dockerfile.a3.openEuler` and `Dockerfile.openEuler` to
ensure correct installation and setup of the Mooncake dependency. The
changes primarily involve fixing incorrect file paths for the
mooncake_installer.sh script and streamlining the declaration of the
MOONCAKE_TAG build argument, leading to more robust and accurate
container builds.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-02-28 20:55:18 +08:00
starmountain1997
5ffae03156 [bugfix] fix capture shape in sp_eagle_fullgraph (#6846)
### What this PR does / why we need it?

This was meant to be merged in #6536, but I accidentally restored a
commit. You can find the relevant discussion
[here](https://github.com/vllm-project/vllm-ascend/pull/6536#issuecomment-3882883471).

Since `self.pass_config.enable_sp` is forcibly set to `False` in the
[source
code](f176443446/vllm/config/compilation.py (L1066)),
this section will no longer verify whether the generated cudagraph
shapes are multiples of both `uniform_decode_query_len`
(`num_speculative_tokens + 1`) and `tensor_parallel_size`.

This PR enables the `num_speculative_tokens + 1` and
`tensor_parallel_size` check upfront. Therefore, it won't silently round
up the `cudagraph_size` and throw a cryptic error for the user.

A typical example of this cryptic error looks like:
```
ValueError: could not broadcast input array from shape (196,) into shape (14,)
```

### Does this PR introduce _any_ user-facing change?

no.

### How was this patch tested?

Have passed all test.

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: lilinsiman <lilinsiman@gmail.com>
Co-authored-by: drslark <slarksblood@qq.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-28 17:30:02 +08:00
zyz111222
81fb7d5779 [Doc] add 310P3 guidance of PaddleOCR-VL (#6837)
### What this PR does / why we need it?
add 310P3 guidance of PaddleOCR-VL model, refresh PaddleOCR-VL.md in the
docs/source/tutorials/

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
by CI

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: zouyizhou <zouyizhou@huawei.com>
2026-02-28 16:03:07 +08:00
luomin2005
3cc8bf15da Support platform.get_device_uuid function (#6777)
### What this PR does / why we need it?
Support platform.get_device_uuid function.
currently, the pytorch.npu.get_device_properties return uuid as full
zero, vllm-ascend implement the interface at first, once the
pytorch.npu.get_device_properties return the real uuid, vllm-ascend will
support without modification.
more details see 
https://github.com/vllm-project/vllm-ascend/issues/6669
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea
root@localhost:/workspace/l00614971/vllm_test# python vllm_test.py 
INFO 02-24 09:43:48 [__init__.py:43] Available plugins for group
vllm.platform_plugins:
INFO 02-24 09:43:48 [__init__.py:45] - ascend -> vllm_ascend:register
INFO 02-24 09:43:48 [__init__.py:48] All plugins in this group will be
loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 02-24 09:43:48 [__init__.py:217] Platform plugin ascend is
activated
device_uuid =  00000000-0000-0000-0000-000000000000

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-02-28 14:17:12 +08:00