### What this PR does / why we need it?
This PR optimizes the `_compute_slot_mappings_kernel` for Ascend NPUs to
improve performance. The key changes include:
- A new Triton kernel implementation (`_compute_slot_mappings_kernel`)
with NPU-specific optimizations, such as using `tl.gather` to handle
non-contiguous memory access and replacing modulo operations.
- A new method `compute_slot_mappings` in `AscendBlockTables` to use
this new kernel.
- An end-to-end test to verify the correctness of the new kernel against
the reference GPU implementation.
The optimization is needed to avoid performance degradation from scalar
computation on Ascend devices.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------
Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>
### What this PR does / why we need it?
2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712,
extend SP to VL MoE models.
### Does this PR introduce _any_ user-facing change?
remove `sp_threshold` in additional config and reuse `sp_min_token_num`
from vLLM.
### How was this patch tested?
- Model: Qwen3-VL-30B-A3B,
- TP4 DP2
- 100 reqs
- max concurrency 1
| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|------------|---------------------|------------------------|
| 4k | 429.40 | 323.3 |
| 16k | 1297.01 | 911.74 |
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
This pr modifies qwen3Next nightly CI config.
(1) Add a nightly CI .
(2) Set a more precise accuracy standard
- vLLM version: v0.18.0
- vLLM main:
6a9cceb219
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
### What this PR does / why we need it?
- Replace local logging with vllm.logger for consistency
- Add info log when enable_npugraph_ex is enabled
- Add info log when enable_static_kernel is enabled
- Unify logging message format to use config switch names consistently
- This helps users understand which compilation optimizations are active
### Does this PR introduce _any_ user-facing change?
Yes. Users will now see informational log messages when
enable_npugraph_ex or enable_static_kernel features are enabled,
providing better visibility into the compilation optimization settings
being used.
### How was this patch tested?
- Code passes all pre-commit hooks (ruff check, ruff format, codespell,
typos)
- Follows project coding conventions and style guidelines
- Logger import matches the pattern used elsewhere in the codebase
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
### What this PR does / why we need it?
This PR adds missing arguments in `AscendRotaryEmbedding`,
`AscendYarnRotaryEmbedding` to conform with vLLM. Besides, corresponding
ut is introduced.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Check wildcard address address for layerwise connector
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
Fix mooncake layerconnector dead when update_decoder_info fail. For the
scenario where node D is dead, node P failing to update_decoder_info
should not cause node P to become dead.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
by CI
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
1. When the FullGraph mode is used, the branches in the Triton operator
are compiled and fixed during the graph capture process, causing the
branch condition in the `fused_recurrent_gated_delta_rule` operator,
which checks whether `ssm_state_indices >= 0` before writing to the SSM
cache, to become invalid. Now, the write operation is performed
regardless of the value. This results in the operator performing address
offset calculations and writing to the SSM cache based on the -1 offset
after -1 is used for padding in vLLM GDN backend. Since the conv cache
and SSM cache in vLLM Ascend implementation are actually a single
continuous tensor divided into two parts, this leads to data overwriting
and the generation of NaN values.
This PR addresses two cases where padding -1 is required in the GDN
metadata builder. The same logic is used to replace the padding with 0
to avoid the problem of memory overwriting, because block 0 is a
reserved block.
2. Fix layerwise connector bug for mamba cache sending on heterogeneous
TP.
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
This reverts commit 42bcad7e9b. The commit
cause accuracy decrease of qwen3Next, 150 items of gsm8k, 98 -> 91.
- vLLM version: v0.18.0
- vLLM main:
6a9cceb219
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
### What this PR does / why we need it?
RFC #7394
310P cannot use the fused `rmsnormgated` operator and must fall back to
the native implementation.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
ut
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
### What this PR does / why we need it?
During the prefill phase of Qwen3-Next and Qwen3.5, the
`torch.ops._C_ascend.causal_conv1d_fn` operator exhibits significant
performance bottlenecks. To address this, we have re-implemented the
optimization using `torch.ops._C_ascend.npu_causal_conv1d_custom`.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
1 accuracy test
```
[2026-03-20 16:44:22,961] [ais_bench] [INFO] Start launch task state board ...
+-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+
| Task Name | Process | Progress | Time Cost | Status | Log Path | Extend Parameters |
+=============================+===========+============+=============+==========+===========================================+=====================+
| vllm-api-general-chat/gsm8k | 2918978 | NA | 0:00:01 | finish | logs/eval/vllm-api-general-chat/gsm8k.out | None |
+-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+
[2026-03-20 16:44:34,284] [ais_bench] [INFO] Evaluation tasks completed.
[2026-03-20 16:44:34,287] [ais_bench] [INFO] Summarizing evaluation results...
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 271d0b accuracy gen 96.21
```
2 ut modify test
`pytest -sv
/home/c30006096/vllm-ascend/tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_causal_conv1d.py::test_ascend_causal_conv1d`
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
Signed-off-by: wenba0 <3054239545@qq.com>
Signed-off-by: jiaojiao <56385650+wenba0@users.noreply.github.com>
### What this PR does / why we need it?
Qwen3.5 full attention supports enabling the split_qkv_rmsnorm_mrope
fusion operator.
### How was this patch tested?
vLLM version: v0.16.0
vLLM-Ascend main: https://github.com/vllm-project/vllm-ascend/pull/6730
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
### What this PR does / why we need it?
[CI] Recover pd disaggregated encoder test case that been incorrectly
skipped in PR: https://github.com/vllm-project/vllm-ascend/pull/7412
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
Main2main Upgrade vllm commit to 0320 17:00
1. fix vllm refactored `_moe_forward` to call
`runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True.
vllm PR:"[MoE Refactor] DefaultMoERunner simplification
[#33049](https://github.com/vllm-project/vllm/pull/33049)"
2.fix vllm moved the call to `self._set_compile_ranges()` in
`VllmConfig.__post_init__` from **before** `check_and_update_config()`
to **after** it (to allow platforms to lower `max_num_batched_tokens`
first). vllm PR: "fix(xpu): Re-compute compile ranges after
platform-specific config updates"
[#37523](https://github.com/vllm-project/vllm/pull/37523)
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it?
- This PR aims to enhance the operator performance in the `post_update`
phase of `model_runner_v2` on NPUs. By optimizing the relevant
operations, it is expected to improve the overall efficiency and speed
of the model running on NPU hardware, which is crucial for scenarios
where high-performance inference is required.
- when bs = 256, time cost reduce from 26us to 11 us;
### Does this PR introduce _any_ user-facing change?
No, there are no changes to the API, interface, or other high-level
behaviors that would directly affect the user's code or interaction with
the system beyond the performance improvement.
### How was this patch tested?
CI passed with new added/existing tests. In addition to the regular CI
tests, specific benchmark tests were conducted on NPU hardware to
measure the performance improvement of the `post_update` operators.
---------
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it?
RFC #7394
Add a PyTorch implementation of the GDN gating operator on 310P.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
UT
- vLLM version: v0.17.0
- vLLM main:
4497431df6
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
### What this PR does / why we need it?
- Pass GITEE_USERNAME (var) and GITEE_TOKEN (secret) as Docker build
args in nightly image build so Dockerfile can authenticate to Gitee
- In Dockerfile.nightly.a2/a3, embed credentials into clone URL to
avoid auth failure during `git clone`
- In single-node and multi-node PR test workflows, backup the
pre-installed benchmark from the nightly image before wiping
vllm-ascend, then restore it instead of re-cloning from Gitee,
which is inaccessible from fork PR contexts
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
8b6325758c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR fixes the x86 image issue where both `triton` and
`triton-ascend` are installed in the final environment.
- https://github.com/vllm-project/vllm-ascend/issues/7359
We confirmed the root cause is not that `triton` fails to uninstall
after the upstream `vllm` installation. Instead, during the
`vllm-ascend` installation step, pip resolves and installs upstream
`triton` again alongside `triton-ascend` on x86 platforms. This leads to
module conflicts at runtime because both distributions provide the
`triton` Python package.
To fix this, this PR updates all Dockerfiles to remove upstream `triton`
immediately after installing `vllm-ascend`, while keeping the
`triton-ascend` version resolved by `vllm-ascend` itself.
Affected files:
- `Dockerfile`
- `Dockerfile.a3`
- `Dockerfile.310p`
- `Dockerfile.openEuler`
- `Dockerfile.a3.openEuler`
- `Dockerfile.310p.openEuler`
### Does this PR introduce _any_ user-facing change?
Yes.
For x86 container images, the final Python environment will no longer
keep upstream `triton` alongside `triton-ascend`. This avoids importing
the wrong Triton package and fixes related runtime failures.
### How was this patch tested?
Root cause validation was performed by reproducing the installation flow
locally and checking the package state after each step.
Observed during `vllm-ascend` installation on x86:
- `triton-ascend` was installed as expected
- upstream `triton` was also installed again in the same step
``` bash
export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge
Successfully installed aiofiles-25.1.0 arctic-inference-0.1.1 blinker-1.9.0 cmake-4.2.3 fastapi-0.123.10
flask-3.1.3 h2-4.3.0 hpack-4.1.0 hypercorn-0.18.0 hyperframe-6.1.0 itsdangerous-2.2.0 numpy-1.26.4
opencv-python-headless-4.11.0.86 pandas-3.0.1 pandas-stubs-3.0.0.260204 priority-2.0.0 pybind11-3.0.2
python-dateutil-2.9.0.post0 quart-0.20.0 setuptools-scm-9.2.2 six-1.17.0 starlette-0.50.0 torch-2.9.0+cpu
torch-npu-2.9.0 torchaudio-2.9.0+cpu torchvision-0.24.0+cpu triton-3.6.0 triton-ascend-3.2.0
vllm_ascend-0.17.0rc2.dev51+geb92e7d50 werkzeug-3.1.6 wheel-0.46.3 wsproto-1.3.2 xgrammar-0.1.32
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with
the system package manager, possibly rendering your system unusable. It is recommended to use a virtual
environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what
you are doing and want to suppress this warning.
Files removed: 423 (1025.9 MB)
Directories removed: 5
```
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
A problem of the d node getting stuck in the pd-separation scenario is
solved.
We find it will crash at `torch.nn.functional.linear(x, weight, bias)`
after being stuck for a long time.
we found that the shapes of each dp
node were not aligned. this is the root cause.
- vLLM version: v0.18.0
- vLLM main:
4034c3d32e
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
This PR aims to add note in doc that FULL mode is not supported in PCP
scenario.
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
Upgrade vllm commit to 2026.03.19.
1.Fix socket removed from StatelessProcessGroup. Upstream vLLM PR
[#36330](https://github.com/vllm-project/vllm/pull/36330) ("elastic_ep:
Fix stateless group port races") refactored StatelessProcessGroup and
removed the socket: socket.socket | None field. The socket ownership was
moved to a new create_tcp_store() helper instead of being stored as a
field on the dataclass.
2.fix `virtual_engine` parameter removed from `set_forward_context().
Upstream [V0 Deprecation] Deprecate virtual engine
[#37195](https://github.com/vllm-project/vllm/pull/37195)
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
This PR simplifies and hardens MLA prefill context merging in
`vllm_ascend/attention/mla_v1.py` after FIA migration by directly
building `out_list/lse_list` (without temporary chunk buffers or
`cat/stack/split`) and using `reshape` for safe flattening of
non-contiguous tensors.
### Does this PR introduce _any_ user-facing change?
No. This is an internal refactor/stability improvement only; no
API/interface behavior changes.
### How was this patch tested?
- Verified tensor shape/data flow for `npu_attention_update` inputs
(`out_list/lse_list`) after refactor.
- Confirmed no lint errors in the modified file.
- CI UT coverage on attention/MLA paths is used for validation.
vLLM version: `v0.17.0`
vLLM main: `vllm-project/vllm@4034c3d`
---------
Signed-off-by: lico67373 <918688502@qq.com>
### What this PR does / why we need it?
This PR refactors the communication group of MC2 to keep it consistent
with vllm's EP group, making it compatible with PP.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it?
Because the new A5 MMEncoder operator was merged, the 310P can no longer
run any VL models. This PR fixes that issue. details at #7046
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
### What this PR does / why we need it?
As the vllm-ascend main doesn't maintain v0.17.0 now, we'd just apply
the single branch in eagle proposer. Otherwise it will raise error in
v0.18.0
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.18.0
- vLLM main:
8b6325758c
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
**Background:**
PR https://github.com/vllm-project/vllm-ascend/pull/6448 has introduced
a `seq_lens` CPU cache mechanism, which will considerably benefit the
performance for VL models but may lead to accuracy issues. Thus, we have
reverted it.
**Proposed Change:**
In PR https://github.com/vllm-project/vllm/pull/36605, we have supported
custom processing logic for OOT MMEncoder kernels in vLLM. Thus, we can
pre-compute `seq_lens` (rather than `cu_seqlens`) and put it on CPU
before ViT vision blocks to avoid redundant computation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
#### ✅ Functional Test
Run Qwen2.5-VL:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--limit-mm-per-prompt '{"image": 1}'
```
Output:
```bash
"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" having a slightly bolder and more prominent appearance compared to \"Qwen.\" The overall design is simple and professional."
```
> [!NOTE]
> Since PR https://github.com/vllm-project/vllm/pull/36605 only modified
`Qwen3-VL` modeling files, this PR has no affect to `Qwen2.5-VL` model.
---
Run Qwen3-VL:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--limit-mm-per-prompt '{"image": 1}'
```
Output:
```bash
"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with TONG."
```
---
#### ✅ Benchmark
Launch the server:
```
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 16384 \
--max-num-batched-tokens 16384
```
Run benchmark:
```
vllm bench serve \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--hf-split train \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--num-prompts 500 \
--request-rate 10 \
--burstiness 5 \
--no-stream
```
Before this PR:
```
============ Serving Benchmark Result ============
Successful requests: 500
Failed requests: 0
Request rate configured (RPS): 10.00
Benchmark duration (s): 78.58
Total input tokens: 33418
Total generated tokens: 61431
Request throughput (req/s): 6.36
Output token throughput (tok/s): 781.78
Peak output token throughput (tok/s): 2475.00
Peak concurrent requests: 383.00
Total token throughput (tok/s): 1207.07
---------------Time to First Token----------------
Mean TTFT (ms): 7116.24
Median TTFT (ms): 4295.84
P99 TTFT (ms): 18370.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 245.78
Median TPOT (ms): 264.03
P99 TPOT (ms): 334.38
---------------Inter-token Latency----------------
Mean ITL (ms): 246.99
Median ITL (ms): 117.71
P99 ITL (ms): 1327.55
==================================================
```
After this PR:
```
============ Serving Benchmark Result ============
Successful requests: 500
Failed requests: 0
Request rate configured (RPS): 10.00
Benchmark duration (s): 77.44
Total input tokens: 33418
Total generated tokens: 61522
Request throughput (req/s): 6.46
Output token throughput (tok/s): 794.40
Peak output token throughput (tok/s): 2691.00
Peak concurrent requests: 369.00
Total token throughput (tok/s): 1225.91
---------------Time to First Token----------------
Mean TTFT (ms): 6888.64
Median TTFT (ms): 4128.82
P99 TTFT (ms): 17487.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 240.14
Median TPOT (ms): 259.18
P99 TPOT (ms): 313.15
---------------Inter-token Latency----------------
Mean ITL (ms): 241.84
Median ITL (ms): 121.08
P99 ITL (ms): 1470.33
==================================================
```
**Performance Metrics:**
| Metric | Before this PR | After this PR | Comparison |
| :----- | :------------- | :------------ | :--------- |
| **Throughput** | | | |
| Request throughput (req/s) | 6.36 | 6.46 | +1.57% ↑ |
| Output token throughput (tok/s) | 781.78 | 794.40 | +1.61% ↑ |
| Total token throughput (tok/s) | 1,207.07 | 1,225.91 | +1.56% ↑ |
| Peak output token throughput (tok/s) | 2,475 | 2,691 | +8.73% ↑ |
| **Latency** | | | |
| Benchmark duration (s) | 78.58 | 77.44 | -1.45% ↓ |
| Mean TTFT (ms) | 7,116.24 | 6,888.64 | -3.20% ↓ |
| Median TTFT (ms) | 4,295.84 | 4,128.82 | -3.89% ↓ |
| P99 TTFT (ms) | 18,370.87 | 17,487.94 | -4.81% ↓ |
| Mean TPOT (ms) | 245.78 | 240.14 | -2.29% ↓ |
| Median TPOT (ms) | 264.03 | 259.18 | -1.84% ↓ |
| P99 TPOT (ms) | 334.38 | 313.15 | -6.35% ↓ |
| Mean ITL (ms) | 246.99 | 241.84 | -2.09% ↓ |
| Median ITL (ms) | 117.71 | 121.08 | +2.86% ↑ |
| P99 ITL (ms) | 1,327.55 | 1,470.33 | +10.76% ↑ |
**🤖 AI Summary:**
- The most notable improvement is in P99 TPOT, which dropped **-6.35%**
from 334.38ms → 313.15ms, indicating reduced tail latency for per-token
generation under heavy load.
- TTFT improved across all percentiles: mean dropped **-3.20%** (7,116ms
→ 6,889ms), median **-3.89%** (4,296ms → 4,129ms), and P99 **-4.81%**
(18,371ms → 17,488ms), reflecting faster time-to-first-token across the
board.
- TPOT also improved consistently, with mean down **-2.29%** (245.78ms →
240.14ms) and median down **-1.84%** (264.03ms → 259.18ms), showing a
modest but steady reduction in per-token generation time.
- Throughput saw a slight uplift of roughly **+1.6%** across request,
output token, and total token throughput. Peak output token throughput
jumped **+8.73%** (2,475 → 2,691 tok/s), suggesting better burst
handling capacity.
- P99 ITL increased **+10.76%** (1,328ms → 1,470ms), the largest
regression in the run. Median ITL also ticked up **+2.86%** (117.71ms →
121.08ms). These tail-latency spikes may reflect scheduling variability
under peak concurrency and could be within run-to-run noise, but are
worth monitoring.
- Overall, the PR delivers a consistent improvement in both throughput
and latency, with the caveat that P99 inter-token latency regressed —
likely a transient effect given that mean ITL still improved by
**-2.09%**.
---
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Fix the precision issue in dispatch_ffn_combine_bf16 operator.
Remove redundant synchronization operations in dispatch_ffn_combine
operator.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: guanguan0308 <1546542263@qq.com>
### What this PR does / why we need it?
Follow https://github.com/vllm-project/vllm/pull/37425,
https://github.com/vllm-project/vllm-omni/pull/1982
Copied from them:
Notice that `hasattr(self.model, "flush_pending_metadata")` cost 6ms per
decode step when profiling Qwen3 Omni.
The original `CUDAGraphWrapper.__getattr__` raises:
```python
raise AttributeError(f"... cudagraph wrapper: {self.runnable}")
```
When hasattr() is called for a non-existent attribute, Python internally
calls __getattr__ which constructs this AttributeError. The
{self.runnable} triggers `__repr__()` on the underlying model (e.g.,
`Qwen3OmniMoeForConditionalGeneration`), which recursivelytraverses the
entire nn.Module tree to generate an 18,000+ character string. This
takes ~6-7ms per call.
Since `hasattr(self.model, "flush_pending_metadata") ` is called every
decode step in the Talker forward path, this adds ~6ms overhead per
step, severely impacting audio inter-chunk latency (ICL).
```Python
hasattr(self.model, "flush_pending_metadata")
→ getattr(self.model, "flush_pending_metadata")
→ not found in CUDAGraphWrapper.__dict__
→ not found in the CUDAGraphWrapper class hierarchy
→ triggers CUDAGraphWrapper.__getattr__("flush_pending_metadata")
→ hasattr(self.runnable, "flush_pending_metadata") # runnable also doesn't have it
→ executes raise AttributeError(f"... {self.runnable}")
→ Python needs to construct the exception object
→ the f-string triggers self.runnable.__repr__()
→ Qwen3OmniMoeForConditionalGeneration.__repr__()
→ recursively traverses the entire nn.Module tree
→ generates a 18,000+ character string
→ takes ~6 ms
→ AttributeError object is created
→ hasattr catches the AttributeError and returns False
→ the 18,000-character string is immediately discarded (no one ever sees it)
```
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
See https://github.com/vllm-project/vllm-omni/pull/1982
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it?
Replace text-match assertions with a two-tier logprob accuracy check:
- Prefill (token 0): assert token ID is identical between eager baseline
and compiled mode, then verify logprob matches within `atol`.
- Decode (tokens 1-2): if chosen tokens match, compare logprobs
directly; if they differ, cross-lookup the baseline token in the
compiled model's top-20 distribution and assert the assigned logprob is
within `decode_atol` (defaults to 2x atol). This tolerates minor argmax
drift caused by floating-point differences while still catching
distribution divergence.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR optimizes the Qwen3.5 and Qwen3Next GDN prefill path on Ascend
by reducing host/device synchronization overhead.
The current implementation of the `chunk_gated_delta_rule` path for
variable-length sequences prepares chunk metadata during the forward
pass. This approach triggers frequent CPU intervention and host/device
round-trips. When running prefill-heavy workloads with asynchronous
scheduling enabled, these synchronizations result in execution "bubbles"
and prefill stalling (stuttering). **Note that this does not cause
asynchronous scheduling to fail; rather, it prevents the system from
reaching its theoretical throughput due to these unnecessary stalls.**
To resolve this, the patch moves metadata preparation out of the hot
path:
- **Prebuilt Metadata:** All non-speculative varlen chunk metadata for
GDN is now prebuilt on the CPU.
- **Asynchronous Transfer:** Staging buffers are kept in pinned memory
and transferred to the NPU asynchronously.
- **Integration:** The prebuilt bundle is attached to GDN attention
metadata via `patch_gdn_attn.py` and passed into Triton wrappers.
- **Backward Compatibility:** Triton wrappers fall back to the legacy
preparation path if no prebuilt metadata is provided.
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
### What this PR does / why we need it?
Delete the logic that the input of get_rope_shape from device to host.
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
Signed-off-by: LoganJane <loganJane73@hotmail.com>
## Summary
Flash Comm V1 (flashcomm1) was previously blocked for all VL models.
**Root cause:** For VL models, `inputs_embeds` at layer 0 originates
from the vision encoder as a full `[N, H]` tensor — it has **not** been
reduce-scattered across TP ranks. The original MLA forward path assumed
inputs were already scattered, producing wrong output shapes under TP >
1.
**Fix:**
- Detect at init time (statically, not via runtime shape checks) whether
a layer is the first layer of a VL model (`is_vl_first_layer`) so dynamo
treats the branch as a constant.
- In `AscendMultiHeadLatentAttention.forward`, when `flashcomm1 + TP > 1
+ is_vl_first_layer`, set `need_gather_q_kv=False` and pre-allocate
output as `[N//tp_size, H]`.
- Remove the platform-level assertion that prevented VL models from
enabling Flash Comm V1.
**Other improvements:**
- `is_vl_model()` now uses vllm's canonical detection (`hf_config is not
hf_text_config`) instead of fragile key-name checks, with the old checks
kept as fallback.
- Added `parse_layer_idx(prefix)` utility.
- Added `maybe_chunk_residual` call in `AscendRMSNorm` before the
add-rms-norm op.
- Removed unnecessary CPU/fp32 round-trip in
`AscendLearnable2DInterpPosEmbDivided_fixed.forward()`.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: LoganJane <loganJane73@hotmail.com>
### What this PR does / why we need it
This PR fixes a startup regression for Ascend hybrid attention + mamba
models after upgrading to vLLM `0.18.0`.
However, after the vLLM `0.18.0` upgrade, worker initialization still
calls the generic platform hook:
- `current_platform.update_block_size_for_backend(vllm_config)`
### How this PR fixes it
This PR keeps the fix strictly inside `vllm-ascend`.
It adds an Ascend override for
`NPUPlatform.update_block_size_for_backend()`:
- for hybrid models, do not run the generic upstream block-size fallback
- preserve the block size that was already computed by the hybrid
model-specific config logic
- for non-hybrid models, keep the original upstream behavior unchanged
- vLLM version: v0.18.0
- vLLM main:
8b6325758c
---------
Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
In fulldecodeonly mode, num_req_padded was set to an incorrect value,
causing accuracy degradation in Qwen3-Next. Therefore, we added a check
for compilation_config.cudagraph_mode to the conditional logic, ensuring
that padding is applied only in FULL mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
This PR aims to fix padding logic in eagle proposer for kimi25. Main
changes involve:
1. modify the way to obtain draft model attention builder and backend
2. add block table padding & related tensor slicing in common metadata
when `draft_step>1` for solving fia verifying error
3. replace block table in `update_graph_params` for solving fia
verifying error
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
Upgrade vllm v0.18.0 in dockerfile
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
1. upgrade to 0.18.0
2. ensure kernel_block_sizes is int for Eagle drafter
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR enables separate attention backend configuration for target and
draft models in speculative decoding, decoupling the previously bound
attention backend settings between the two models.
It solves the compatibility issue where some draft models do not support
the attention backend used by the target model, and allows users to
select the optimal attention backend for each model individually to
maximize inference performance. The change is fully backward compatible.
---------
Signed-off-by: SidaoY <1024863041@qq.com>
### What this PR does / why we need it?
Refactor `vllm_ascend/ops/fused_moe` to replace scattered MoE business
`**kwargs` with typed request objects and explicit stage boundaries.
- Prepare, dispatch, MLP, and quant stages now have clearer ownership.
- Main MoE path no longer depends on business `kwargs.get(...)` lookups.
- Comm and dispatcher interfaces are request-only on the main path.
- UTs can assert stage-level fields directly instead of inferring
behavior indirectly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed.
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Replace the '_npu_flash_attention_unpad' operator with the
'npu_fusion_attention' operator to ensure that the Qwen VL model can run
in the A5 environment and remove the 'mrope' operator call restriction
for A5.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: 汪越 <wangyue361@h-partners.com>
### What this PR does / why we need it?
1. Extracting duplicated code into a method.
That is defining _get_input_parallel_ in parent class
_CustomRowParallelOp_, and call the helper method in its 5 children
classes :
- MLPRowParallelOp
- OProjRowParallelOp
- Flashcomm2OProjRowParallelOp
- MatmulAllreduceRowParallelOp
- SequenceRowParallelOp
's _apply_impl_ method
2. Variable typo fixing: split instead of splitted for the past tense
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: idouba <zhangchaomeng@huawei.com>
### What this PR does / why we need it?
This PR adapts the `w8a8_mxfp8` quantization method to support Qwen
Vision-Language (VL) models. Key changes include:
- Reshaping multi-dimensional input tensors to 2D before the quantized
matrix multiplication.
- Reshaping the 2D output back to its original multi-dimensional format.
- Adding specific output reshaping for the visual components of Qwen VL
models.
- Casting the bias tensor to `float32` to comply with the
`npu_quant_matmul` kernel requirements.
These changes are necessary to enable `w8a8_mxfp8` quantization for
models with multi-modal inputs like Qwen VL.
### Does this PR introduce _any_ user-facing change?
No, this is a backend enhancement to extend quantization support to new
model architectures. There are no user-facing API or behavior changes.
### How was this patch tested?
CI is expected to pass. Manual testing should be performed with a Qwen
VL model using `w8a8_mxfp8` quantization to verify correctness and
performance.
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: ksiyuan <ksiyuan@umich.edu>