1665 Commits

Author SHA1 Message Date
panchao-hub
d98a0727c8 [Feat] Add npugraph_ex enablement logging (#7574)
### What this PR does / why we need it?

- Replace local logging with vllm.logger for consistency
- Add info log when enable_npugraph_ex is enabled
- Add info log when enable_static_kernel is enabled
- Unify logging message format to use config switch names consistently
- This helps users understand which compilation optimizations are active

### Does this PR introduce _any_ user-facing change?

Yes. Users will now see informational log messages when
enable_npugraph_ex or enable_static_kernel features are enabled,
providing better visibility into the compilation optimization settings
being used.

### How was this patch tested?

- Code passes all pre-commit hooks (ruff check, ruff format, codespell,
typos)
- Follows project coding conventions and style guidelines
- Logger import matches the pattern used elsewhere in the codebase

Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
2026-03-24 17:04:48 +08:00
Angazenn
bdb65319a9 [UT] Align input arguments with Ascend(Yarn)RotaryEmbedding with vLLM and add ut (#7358)
### What this PR does / why we need it?
This PR adds missing arguments in `AscendRotaryEmbedding`,
`AscendYarnRotaryEmbedding` to conform with vLLM. Besides, corresponding
ut is introduced.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Angazenn <supperccell@163.com>
2026-03-24 16:02:56 +08:00
liziyu
73cadecfb4 [P/D] [Bugfix] fix mooncake layerconnector dead when update_decoder_info fail (#7514)
### What this PR does / why we need it?
Fix mooncake layerconnector dead when update_decoder_info fail. For the
scenario where node D is dead, node P failing to update_decoder_info
should not cause node P to become dead.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
by CI

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-03-24 15:49:46 +08:00
zxr2333
67aad1fce8 [BugFix][P/D] fix padding error on FullGraph mode && fix layerwise connector mamba accuracy (#7506)
### What this PR does / why we need it?
1. When the FullGraph mode is used, the branches in the Triton operator
are compiled and fixed during the graph capture process, causing the
branch condition in the `fused_recurrent_gated_delta_rule` operator,
which checks whether `ssm_state_indices >= 0` before writing to the SSM
cache, to become invalid. Now, the write operation is performed
regardless of the value. This results in the operator performing address
offset calculations and writing to the SSM cache based on the -1 offset
after -1 is used for padding in vLLM GDN backend. Since the conv cache
and SSM cache in vLLM Ascend implementation are actually a single
continuous tensor divided into two parts, this leads to data overwriting
and the generation of NaN values.
This PR addresses two cases where padding -1 is required in the GDN
metadata builder. The same logic is used to replace the padding with 0
to avoid the problem of memory overwriting, because block 0 is a
reserved block.
2. Fix layerwise connector bug for mamba cache sending on heterogeneous
TP.

- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-24 15:15:55 +08:00
LeeWenquan
475b4b0cea Revert "GMM custom operator optimization in small batch scenarios (vllm-project#7100)" (#7557)
### What this PR does / why we need it?
This reverts commit 42bcad7e9b. The commit
cause accuracy decrease of qwen3Next, 150 items of gsm8k, 98 -> 91.

- vLLM version: v0.18.0
- vLLM main:
6a9cceb219

Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
2026-03-24 14:24:44 +08:00
Shaoxu Cheng
83bd77c983 [310p]: add rmsnorm gated fallback and unit test (#7424)
### What this PR does / why we need it?
RFC #7394
310P cannot use the fused `rmsnormgated` operator and must fall back to
the native implementation.

### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
ut
- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-24 09:00:11 +08:00
jiaojiao
1de805ce0a [Ops][Misc] Refactor and optimize CausalConv1d for Ascend (#7495)
### What this PR does / why we need it?
During the prefill phase of Qwen3-Next and Qwen3.5, the
`torch.ops._C_ascend.causal_conv1d_fn` operator exhibits significant
performance bottlenecks. To address this, we have re-implemented the
optimization using `torch.ops._C_ascend.npu_causal_conv1d_custom`.

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
1 accuracy test
```
[2026-03-20 16:44:22,961] [ais_bench] [INFO] Start launch task state board ...
+-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+
| Task Name                   |   Process | Progress   | Time Cost   | Status   | Log Path                                  | Extend Parameters   |
+=============================+===========+============+=============+==========+===========================================+=====================+
| vllm-api-general-chat/gsm8k |   2918978 | NA         | 0:00:01     | finish   | logs/eval/vllm-api-general-chat/gsm8k.out | None                |
+-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+
[2026-03-20 16:44:34,284] [ais_bench] [INFO] Evaluation tasks completed.
[2026-03-20 16:44:34,287] [ais_bench] [INFO] Summarizing evaluation results...
dataset    version    metric    mode      vllm-api-general-chat
---------  ---------  --------  ------  -----------------------
gsm8k      271d0b     accuracy  gen                       96.21
```
2 ut modify test
`pytest -sv
/home/c30006096/vllm-ascend/tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_causal_conv1d.py::test_ascend_causal_conv1d`

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

Signed-off-by: wenba0 <3054239545@qq.com>
Signed-off-by: jiaojiao <56385650+wenba0@users.noreply.github.com>
2026-03-24 00:07:12 +08:00
ZhuQi-seu
e942b62d74 [features]support split qkv rmsnorm rmope for qwen3.5 (#7368)
### What this PR does / why we need it?
Qwen3.5 full attention supports enabling the split_qkv_rmsnorm_mrope
fusion operator.

### How was this patch tested?
vLLM version: v0.16.0
vLLM-Ascend main: https://github.com/vllm-project/vllm-ascend/pull/6730
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
2026-03-23 23:58:12 +08:00
Nengjun Ma
fcba91a392 Main2main Upgrade vllm commit to 0320 17:00 (#7510)
### What this PR does / why we need it?
Main2main Upgrade vllm commit to 0320 17:00

1. fix vllm refactored `_moe_forward` to call
`runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True.
vllm PR:"[MoE Refactor] DefaultMoERunner simplification
[#33049](https://github.com/vllm-project/vllm/pull/33049)"

2.fix vllm moved the call to `self._set_compile_ranges()` in
`VllmConfig.__post_init__` from **before** `check_and_update_config()`
to **after** it (to allow platforms to lower `max_num_batched_tokens`
first). vllm PR: "fix(xpu): Re-compute compile ranges after
platform-specific config updates"
[#37523](https://github.com/vllm-project/vllm/pull/37523)


### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
2026-03-23 21:37:41 +08:00
weijinqian0
bdd90c0088 [model_runner_v2]optimize the performance of the post_update. (#7496)
### What this PR does / why we need it?
- This PR aims to enhance the operator performance in the `post_update`
phase of `model_runner_v2` on NPUs. By optimizing the relevant
operations, it is expected to improve the overall efficiency and speed
of the model running on NPU hardware, which is crucial for scenarios
where high-performance inference is required.
- when bs = 256, time cost reduce from 26us to 11 us; 

### Does this PR introduce _any_ user-facing change?
No, there are no changes to the API, interface, or other high-level
behaviors that would directly affect the user's code or interaction with
the system beyond the performance improvement.

### How was this patch tested?
CI passed with new added/existing tests. In addition to the regular CI
tests, specific benchmark tests were conducted on NPU hardware to
measure the performance improvement of the `post_update` operators.

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2026-03-23 20:29:55 +08:00
lijiahang226
170dcbda62 [Feature] Support DeepSeek for A5 (#7232)
### What this PR does / why we need it?

Add A5 mla operators to support running DeepSeek models on A5.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: Li Jiahang <216526138+lijiahang226@users.noreply.github.com>
2026-03-23 20:28:26 +08:00
Shaoxu Cheng
13397e9cb7 [310p] Add a PyTorch implementation of the GDN gating operator on 310P (#7430)
### What this PR does / why we need it?
RFC #7394
Add a PyTorch implementation of the GDN gating operator on 310P.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
UT

- vLLM version: v0.17.0
- vLLM main:
4497431df6

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-23 20:26:39 +08:00
drslark
41dadd4312 [main][bugfix] Solved the problem of the d node getting stuck in the pd-separation scenario (#7534)
### What this PR does / why we need it?
A problem of the d node getting stuck in the pd-separation scenario is
solved.

We find it will crash at `torch.nn.functional.linear(x, weight, bias)`
after being stuck for a long time.
we found that the shapes of each dp
node were not aligned. this is the root cause.

- vLLM version: v0.18.0
- vLLM main:
4034c3d32e

Signed-off-by: drslark <slarksblood@qq.com>
2026-03-23 18:53:07 +08:00
Levi
9976e685b7 [Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp (#7297)
### What this PR does / why we need it?
Fix multi dp padding logic for eager mode, bacause its will cause rank0
load imbalance in kimi-k2.5-w4a8 with the all the padding tokens router
to rank0. And the fix can also apply to other model in multi dp.
- before
hbm usage:
<img width="2229" height="733" alt="image"
src="https://github.com/user-attachments/assets/50479b6d-cfd0-4206-8e80-974024652997"
/>

preformance:
```shell
Concurrency  NumPrompts   QPS          TTFT_Avg     TTFT_P50     TPOT_Avg     TPOT_P50     TPOT_P90    
============ ============ ============ ============ ============ ============ ============ ============
1            15           0.0179       1667.7803    1673.3437    35.2973      35.2775      35.3784     
32           480          0.4725       2764.8027    1905.2137    40.8030      40.6978      41.0179     
64           960          0.7820       4123.7096    3485.6153    48.0461      48.1598      48.2971     
100          1500         1.0852       6216.7988    5714.0082    52.9323      53.0613      54.6304     
108          1620         1.1040       6277.4892    5798.7425    56.3862      56.9224      57.2901     
116          1740         1.1680       6563.3293    6039.5659    56.9894      57.4027      57.5786     
128          1920         1.2555       7822.5551    7604.1662    57.7660      58.1768      58.2717     
192          2880         1.4314       9212.1953    9131.3461    58.9905      59.1683      59.2791     
256          3840         1.4480       9028.0812    8913.7937    59.0092      59.2385      59.3516  
```


- after
hbm usage:
<img width="2246" height="1005" alt="image"
src="https://github.com/user-attachments/assets/d0936481-5a58-4bc5-a6f1-b92735d47885"
/>


preformance:
```shell
Concurrency  NumPrompts   QPS          TTFT_Avg     TTFT_P50     TPOT_Avg     TPOT_P50     TPOT_P90    
============ ============ ============ ============ ============ ============ ============ ============
1            15           0.0181       601.4171     600.9774     35.6270      35.6254      35.6480     
32           480          0.4455       720.8782     724.2889     45.4250      45.4755      45.6318     
64           960          0.8445       729.6209     728.2149     47.0464      47.0896      47.1985     
100          1500         1.2601       723.4834     724.6673     48.3108      48.3844      48.5355     
108          1620         1.3409       727.1509     720.6772     48.8962      48.9409      49.0489     
116          1740         1.4080       679.9799     677.6119     49.1253      49.1983      49.3087     
128          1920         1.4155       680.6284     674.9436     49.2193      49.2450      49.3763     
192          2880         1.4422       684.6577     676.7833     49.2059      49.2264      49.3229     
256          3840         1.4558       685.2462     678.1709     49.2191      49.2351      49.3419    
```
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: fny-coder <985619145@qq.com>
2026-03-23 17:05:02 +08:00
Nengjun Ma
8e2c59e1ee Main2main upgrade vllm commit to 03 19 17:00 (#7478)
### What this PR does / why we need it?
Upgrade vllm commit to 2026.03.19.

1.Fix socket removed from StatelessProcessGroup. Upstream vLLM PR
[#36330](https://github.com/vllm-project/vllm/pull/36330) ("elastic_ep:
Fix stateless group port races") refactored StatelessProcessGroup and
removed the socket: socket.socket | None field. The socket ownership was
moved to a new create_tcp_store() helper instead of being stored as a
field on the dataclass.

2.fix `virtual_engine` parameter removed from `set_forward_context().
Upstream [V0 Deprecation] Deprecate virtual engine
[#37195](https://github.com/vllm-project/vllm/pull/37195)

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-03-23 16:25:57 +08:00
LICO67373
caa71e50ca [Perf] Simplify FIA prefill context merge path (#7293)
### What this PR does / why we need it?
This PR simplifies and hardens MLA prefill context merging in
`vllm_ascend/attention/mla_v1.py` after FIA migration by directly
building `out_list/lse_list` (without temporary chunk buffers or
`cat/stack/split`) and using `reshape` for safe flattening of
non-contiguous tensors.

### Does this PR introduce _any_ user-facing change?
No. This is an internal refactor/stability improvement only; no
API/interface behavior changes.

### How was this patch tested?
- Verified tensor shape/data flow for `npu_attention_update` inputs
(`out_list/lse_list`) after refactor.
- Confirmed no lint errors in the modified file.
- CI UT coverage on attention/MLA paths is used for validation.

vLLM version: `v0.17.0`  
vLLM main: `vllm-project/vllm@4034c3d`

---------

Signed-off-by: lico67373 <918688502@qq.com>
2026-03-23 07:47:42 +00:00
Qiu
71df17f4e6 bugfix(MC2): refactor the comm group of MC2 to be compatible with PP (#7291)
### What this PR does / why we need it?
This PR refactors the communication group of MC2 to keep it consistent
with vllm's EP group, making it compatible with PP.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-03-23 15:44:21 +08:00
Shaoxu Cheng
5b60b530d6 [Bugfix][310p] the new A5 mmencoder op donot support 310p (#7518)
### What this PR does / why we need it?

Because the new A5 MMEncoder operator was merged, the 310P can no longer
run any VL models. This PR fixes that issue. details at #7046

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e
- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-23 15:40:34 +08:00
Mengqing Cao
9e2878065a [Spec-Decode] Fix spec decode proposer in 0.18.0 (#7544)
### What this PR does / why we need it?
As the vllm-ascend main doesn't maintain v0.17.0 now, we'd just apply
the single branch in eagle proposer. Otherwise it will raise error in
v0.18.0

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-23 15:39:24 +08:00
Shanshan Shen
6b7d9b76f1 [MM][Perf] Pre-compute seq_lens and put it on CPU before ViT vision blocks for better performance (#7104)
### What this PR does / why we need it?
**Background:**

PR https://github.com/vllm-project/vllm-ascend/pull/6448 has introduced
a `seq_lens` CPU cache mechanism, which will considerably benefit the
performance for VL models but may lead to accuracy issues. Thus, we have
reverted it.

**Proposed Change:**

In PR https://github.com/vllm-project/vllm/pull/36605, we have supported
custom processing logic for OOT MMEncoder kernels in vLLM. Thus, we can
pre-compute `seq_lens` (rather than `cu_seqlens`) and put it on CPU
before ViT vision blocks to avoid redundant computation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

####  Functional Test

Run Qwen2.5-VL:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--limit-mm-per-prompt '{"image": 1}'
```

Output:

```bash
"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" having a slightly bolder and more prominent appearance compared to \"Qwen.\" The overall design is simple and professional."
```

> [!NOTE]
> Since PR https://github.com/vllm-project/vllm/pull/36605 only modified
`Qwen3-VL` modeling files, this PR has no affect to `Qwen2.5-VL` model.

---
Run Qwen3-VL:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--limit-mm-per-prompt '{"image": 1}'
```

Output:

```bash
"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with TONG."
```

---
####  Benchmark

Launch the server:

```
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 16384 \
--max-num-batched-tokens 16384
```

Run benchmark:

```
vllm bench serve \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--hf-split train \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--num-prompts 500 \
--request-rate 10 \
--burstiness 5 \
--no-stream
```

Before this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  78.58     
Total input tokens:                      33418     
Total generated tokens:                  61431     
Request throughput (req/s):              6.36      
Output token throughput (tok/s):         781.78    
Peak output token throughput (tok/s):    2475.00   
Peak concurrent requests:                383.00    
Total token throughput (tok/s):          1207.07   
---------------Time to First Token----------------
Mean TTFT (ms):                          7116.24   
Median TTFT (ms):                        4295.84   
P99 TTFT (ms):                           18370.87  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          245.78    
Median TPOT (ms):                        264.03    
P99 TPOT (ms):                           334.38    
---------------Inter-token Latency----------------
Mean ITL (ms):                           246.99    
Median ITL (ms):                         117.71    
P99 ITL (ms):                            1327.55   
==================================================
```

After this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  77.44     
Total input tokens:                      33418     
Total generated tokens:                  61522     
Request throughput (req/s):              6.46      
Output token throughput (tok/s):         794.40    
Peak output token throughput (tok/s):    2691.00   
Peak concurrent requests:                369.00    
Total token throughput (tok/s):          1225.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          6888.64   
Median TTFT (ms):                        4128.82   
P99 TTFT (ms):                           17487.94  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          240.14    
Median TPOT (ms):                        259.18    
P99 TPOT (ms):                           313.15    
---------------Inter-token Latency----------------
Mean ITL (ms):                           241.84    
Median ITL (ms):                         121.08    
P99 ITL (ms):                            1470.33   
==================================================
```

**Performance Metrics:**

| Metric | Before this PR | After this PR | Comparison |
| :----- | :------------- | :------------ | :--------- |
| **Throughput** | | | |
| Request throughput (req/s) | 6.36 | 6.46 | +1.57% ↑ |
| Output token throughput (tok/s) | 781.78 | 794.40 | +1.61% ↑ |
| Total token throughput (tok/s) | 1,207.07 | 1,225.91 | +1.56% ↑ |
| Peak output token throughput (tok/s) | 2,475 | 2,691 | +8.73% ↑ |
| **Latency** | | | |
| Benchmark duration (s) | 78.58 | 77.44 | -1.45% ↓ |
| Mean TTFT (ms) | 7,116.24 | 6,888.64 | -3.20% ↓ |
| Median TTFT (ms) | 4,295.84 | 4,128.82 | -3.89% ↓ |
| P99 TTFT (ms) | 18,370.87 | 17,487.94 | -4.81% ↓ |
| Mean TPOT (ms) | 245.78 | 240.14 | -2.29% ↓ |
| Median TPOT (ms) | 264.03 | 259.18 | -1.84% ↓ |
| P99 TPOT (ms) | 334.38 | 313.15 | -6.35% ↓ |
| Mean ITL (ms) | 246.99 | 241.84 | -2.09% ↓ |
| Median ITL (ms) | 117.71 | 121.08 | +2.86% ↑ |
| P99 ITL (ms) | 1,327.55 | 1,470.33 | +10.76% ↑ |

**🤖 AI Summary:**

- The most notable improvement is in P99 TPOT, which dropped **-6.35%**
from 334.38ms → 313.15ms, indicating reduced tail latency for per-token
generation under heavy load.
- TTFT improved across all percentiles: mean dropped **-3.20%** (7,116ms
→ 6,889ms), median **-3.89%** (4,296ms → 4,129ms), and P99 **-4.81%**
(18,371ms → 17,488ms), reflecting faster time-to-first-token across the
board.
- TPOT also improved consistently, with mean down **-2.29%** (245.78ms →
240.14ms) and median down **-1.84%** (264.03ms → 259.18ms), showing a
modest but steady reduction in per-token generation time.
- Throughput saw a slight uplift of roughly **+1.6%** across request,
output token, and total token throughput. Peak output token throughput
jumped **+8.73%** (2,475 → 2,691 tok/s), suggesting better burst
handling capacity.
- P99 ITL increased **+10.76%** (1,328ms → 1,470ms), the largest
regression in the run. Median ITL also ticked up **+2.86%** (117.71ms →
121.08ms). These tail-latency spikes may reflect scheduling variability
under peak concurrency and could be within run-to-run noise, but are
worth monitoring.
- Overall, the PR delivers a consistent improvement in both throughput
and latency, with the caveat that P99 inter-token latency regressed —
likely a transient effect given that mean ITL still improved by
**-2.09%**.

---
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-03-23 15:24:26 +08:00
Shanshan Shen
5c0d02f689 [Bugfix] Fix multi-instance serving OOM on single card (#7427)
### What this PR does / why we need it?
Fix https://github.com/vllm-project/vllm-ascend/issues/7308.

Subtracting `init_non_torch_memory` (maybe used by the first instance)
from the total `non_torch_memory` when calculating
`available_kv_cache_memory`.

Directly use `non_torch_memory_increase` (contained in
`non_kv_cache_memory`) to calculate `available_kv_cache_memory`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Launch tow vllm-ascend instances sequentially on single card.

```bash
# Launch first instance
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \
--port 8100 \
--host 0.0.0.0 \
--additional-config='{"enable_cpu_binding":true}'  \
--gpu-memory-utilization 0.3 \
--max-num-seqs 1 \
--max-model-len 2048 \
--max-num-batched-tokens 2048 \
--no-enable-prefix-caching \
--enforce-eager

# Launch second instance
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \
--port 8101 \
--host 0.0.0.0 \
--additional-config='{"enable_cpu_binding":true}'  \
--gpu-memory-utilization 0.3 \
--max-num-seqs 1 \
--max-model-len 2048 \
--max-num-batched-tokens 2048 \
--no-enable-prefix-caching \
--enforce-eager
```

**Before this PR:**

```bash
# First instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.2340388298034668 GiB
init_non_torch_memory: 0.3616676330566406 GiB
non_torch_memory_before_empty_cache: 0.3896217346191406 GiB
non_torch_memory_increase: 0.0279541015625 GiB
non_torch_memory_cleared_by_empty_cache: 0.3616676330566406 GiB
------------------------------------------------------------------

# Second instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.2336344718933105 GiB
init_non_torch_memory: 18.37220001220703 GiB
non_torch_memory_before_empty_cache: 18.399906158447266 GiB
non_torch_memory_increase: 0.02754974365234375 GiB
non_torch_memory_cleared_by_empty_cache: 18.372356414794922 GiB
------------------------------------------------------------------
# available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache
Available KV cache memory: -1.32 GiB
```

**After this PR:**

```bash
# First instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.2340540885925293 GiB
init_non_torch_memory: 0.36182403564453125 GiB
non_torch_memory_before_empty_cache: 0.38979339599609375 GiB
non_torch_memory_increase: 0.0279693603515625 GiB
non_torch_memory_cleared_by_empty_cache: 0.0 GiB
------------------------------------------------------------------

# Second instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.233344554901123 GiB
init_non_torch_memory: 18.74309539794922 GiB
non_torch_memory_before_empty_cache: 18.770355224609375 GiB
non_torch_memory_increase: 0.02725982666015625 GiB
non_torch_memory_cleared_by_empty_cache: 0.0 GiB
------------------------------------------------------------------
# available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache
Available KV cache memory: 17.05 GiB
```

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2026-03-23 14:22:59 +08:00
Canlin Guo
e68464a1d6 [Bugfix] Fix slow hasattr in ACLGraphWrapper.__getattr__ (#7442)
### What this PR does / why we need it?

Follow https://github.com/vllm-project/vllm/pull/37425,
https://github.com/vllm-project/vllm-omni/pull/1982

Copied from them:

Notice that `hasattr(self.model, "flush_pending_metadata")` cost 6ms per
decode step when profiling Qwen3 Omni.

The original `CUDAGraphWrapper.__getattr__` raises:
```python
  raise AttributeError(f"... cudagraph wrapper: {self.runnable}")
  ```
When hasattr() is called for a non-existent attribute, Python internally
calls __getattr__ which constructs this AttributeError. The
{self.runnable} triggers `__repr__()` on the underlying model (e.g.,
`Qwen3OmniMoeForConditionalGeneration`), which recursivelytraverses the
entire nn.Module tree to generate an 18,000+ character string. This
takes ~6-7ms per call.
Since `hasattr(self.model, "flush_pending_metadata") ` is called every
decode step in the Talker forward path, this adds ~6ms overhead per
step, severely impacting audio inter-chunk latency (ICL).

```Python
hasattr(self.model, "flush_pending_metadata")
  → getattr(self.model, "flush_pending_metadata")
    → not found in CUDAGraphWrapper.__dict__
    → not found in the CUDAGraphWrapper class hierarchy
    → triggers CUDAGraphWrapper.__getattr__("flush_pending_metadata")
      → hasattr(self.runnable, "flush_pending_metadata")  # runnable also doesn't have it
      → executes raise AttributeError(f"... {self.runnable}")
        → Python needs to construct the exception object
        → the f-string triggers self.runnable.__repr__()
        → Qwen3OmniMoeForConditionalGeneration.__repr__()
          → recursively traverses the entire nn.Module tree
          → generates a 18,000+ character string
          → takes ~6 ms
        → AttributeError object is created
    → hasattr catches the AttributeError and returns False
    → the 18,000-character string is immediately discarded (no one ever sees it)
```

### Does this PR introduce _any_ user-facing change?

NO.

### How was this patch tested?

See https://github.com/vllm-project/vllm-omni/pull/1982


- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-03-23 09:26:24 +08:00
Qi Mao
9bf9b4b267 [Feature] Optimize Qwen3.5/Qwen3Next GDN prefill by prebuilding chunk metadata (#7487)
### What this PR does / why we need it?
This PR optimizes the Qwen3.5 and Qwen3Next GDN prefill path on Ascend
by reducing host/device synchronization overhead.

The current implementation of the `chunk_gated_delta_rule` path for
variable-length sequences prepares chunk metadata during the forward
pass. This approach triggers frequent CPU intervention and host/device
round-trips. When running prefill-heavy workloads with asynchronous
scheduling enabled, these synchronizations result in execution "bubbles"
and prefill stalling (stuttering). **Note that this does not cause
asynchronous scheduling to fail; rather, it prevents the system from
reaching its theoretical throughput due to these unnecessary stalls.**

To resolve this, the patch moves metadata preparation out of the hot
path:
- **Prebuilt Metadata:** All non-speculative varlen chunk metadata for
GDN is now prebuilt on the CPU.
- **Asynchronous Transfer:** Staging buffers are kept in pinned memory
and transferred to the NPU asynchronously.
- **Integration:** The prebuilt bundle is attached to GDN attention
metadata via `patch_gdn_attn.py` and passed into Triton wrappers.
- **Backward Compatibility:** Triton wrappers fall back to the legacy
preparation path if no prebuilt metadata is provided.

- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
2026-03-22 23:09:23 +08:00
LoganJane
b2e71b7930 [Bugfix] Fix get_rope_shape for Kimi-K2.5 (#7521)
### What this PR does / why we need it?
Delete the logic that the input of get_rope_shape from device to host.

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

Signed-off-by: LoganJane <loganJane73@hotmail.com>
2026-03-22 21:06:31 +08:00
Cao Yi
9e2965bae2 [Feature] Support Flash Comm V1 for VL models (with MLA) (#7390)
## Summary

Flash Comm V1 (flashcomm1) was previously blocked for all VL models.

**Root cause:** For VL models, `inputs_embeds` at layer 0 originates
from the vision encoder as a full `[N, H]` tensor — it has **not** been
reduce-scattered across TP ranks. The original MLA forward path assumed
inputs were already scattered, producing wrong output shapes under TP >
1.

**Fix:**
- Detect at init time (statically, not via runtime shape checks) whether
a layer is the first layer of a VL model (`is_vl_first_layer`) so dynamo
treats the branch as a constant.
- In `AscendMultiHeadLatentAttention.forward`, when `flashcomm1 + TP > 1
+ is_vl_first_layer`, set `need_gather_q_kv=False` and pre-allocate
output as `[N//tp_size, H]`.
- Remove the platform-level assertion that prevented VL models from
enabling Flash Comm V1.

**Other improvements:**
- `is_vl_model()` now uses vllm's canonical detection (`hf_config is not
hf_text_config`) instead of fragile key-name checks, with the old checks
kept as fallback.
- Added `parse_layer_idx(prefix)` utility.
- Added `maybe_chunk_residual` call in `AscendRMSNorm` before the
add-rms-norm op.
- Removed unnecessary CPU/fp32 round-trip in
  `AscendLearnable2DInterpPosEmbDivided_fixed.forward()`.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: LoganJane <loganJane73@hotmail.com>
2026-03-22 21:05:28 +08:00
Qi Mao
9d0b7c8e98 [Platform][BugFix] Preserve hybrid block size on Ascend (#7528)
### What this PR does / why we need it
This PR fixes a startup regression for Ascend hybrid attention + mamba
models after upgrading to vLLM `0.18.0`.
However, after the vLLM `0.18.0` upgrade, worker initialization still
calls the generic platform hook:
- `current_platform.update_block_size_for_backend(vllm_config)`

### How this PR fixes it

This PR keeps the fix strictly inside `vllm-ascend`.

It adds an Ascend override for
`NPUPlatform.update_block_size_for_backend()`:

- for hybrid models, do not run the generic upstream block-size fallback
- preserve the block size that was already computed by the hybrid
model-specific config logic
- for non-hybrid models, keep the original upstream behavior unchanged

- vLLM version: v0.18.0
- vLLM main:
8b6325758c
---------
Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-03-22 11:21:49 +08:00
XiaoxinWang
cbf46fad3c fixed graph mode bug. (#7460)
### What this PR does / why we need it?
In fulldecodeonly mode, num_req_padded was set to an incorrect value,
causing accuracy degradation in Qwen3-Next. Therefore, we added a check
for compilation_config.cudagraph_mode to the conditional logic, ensuring
that padding is applied only in FULL mode.


### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
8a680463fa

Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2026-03-22 10:09:37 +08:00
Zetong Li
84a74f0cb1 [Bugfix] Fix padding logic in eagle proposer for kimi25 (#7348)
### What this PR does / why we need it?
This PR aims to fix padding logic in eagle proposer for kimi25. Main
changes involve:
1. modify the way to obtain draft model attention builder and backend
2. add block table padding & related tensor slicing in common metadata
when `draft_step>1` for solving fia verifying error
3. replace block table in `update_graph_params` for solving fia
verifying error

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: Zetong Li <slippersss@126.com>
2026-03-21 16:57:22 +08:00
meihanc
bff4fbfca5 upgrade to 0.18.0 (#7502)
### What this PR does / why we need it?
1. upgrade to 0.18.0
2. ensure kernel_block_sizes is int for Eagle drafter
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-03-21 16:05:38 +08:00
HongtaoYang
80a4265717 [Feat] Support separate attention backend for target and draft model. (#7342)
### What this PR does / why we need it?
This PR enables separate attention backend configuration for target and
draft models in speculative decoding, decoupling the previously bound
attention backend settings between the two models.

It solves the compatibility issue where some draft models do not support
the attention backend used by the target model, and allows users to
select the optimal attention backend for each model individually to
maximize inference performance. The change is fully backward compatible.
---------
Signed-off-by: SidaoY <1024863041@qq.com>
2026-03-21 10:48:01 +08:00
linfeng-yuan
88d03a783f [refactor] replace scattered business kwargs with typed request objects and explicit stage boundaries (#7024)
### What this PR does / why we need it?
Refactor `vllm_ascend/ops/fused_moe` to replace scattered MoE business
`**kwargs` with typed request objects and explicit stage boundaries.

- Prepare, dispatch, MLP, and quant stages now have clearer ownership.
- Main MoE path no longer depends on business `kwargs.get(...)` lookups.
- Comm and dispatcher interfaces are request-only on the main path.
- UTs can assert stage-level fields directly instead of inferring
behavior indirectly.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed.

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-20 23:23:57 +08:00
yesyue-w
c860535246 【A5】【Qwen VL】Qwen VL adapt for A5 (#7046)
### What this PR does / why we need it?
Replace the '_npu_flash_attention_unpad' operator with the
'npu_fusion_attention' operator to ensure that the Qwen VL model can run
in the A5 environment and remove the 'mrope' operator call restriction
for A5.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: 汪越 <wangyue361@h-partners.com>
2026-03-20 16:56:12 +08:00
idouba
f39f566e22 Refactor duplicated code into a common method to reduce redundancy (#7210)
### What this PR does / why we need it?

1. Extracting duplicated code into a method. 

That is defining _get_input_parallel_ in parent class
_CustomRowParallelOp_, and call the helper method in its 5 children
classes :
- MLPRowParallelOp
- OProjRowParallelOp
- Flashcomm2OProjRowParallelOp
- MatmulAllreduceRowParallelOp
- SequenceRowParallelOp
's _apply_impl_ method

2. Variable typo fixing: split instead of splitted for the past tense

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: idouba <zhangchaomeng@huawei.com>
2026-03-20 16:49:02 +08:00
Siyuan Kong
a16c99141b Adapt w8a8mxfp8 quantization for Qwen VL models (#7417)
### What this PR does / why we need it?

This PR adapts the `w8a8_mxfp8` quantization method to support Qwen
Vision-Language (VL) models. Key changes include:
- Reshaping multi-dimensional input tensors to 2D before the quantized
matrix multiplication.
- Reshaping the 2D output back to its original multi-dimensional format.
- Adding specific output reshaping for the visual components of Qwen VL
models.
- Casting the bias tensor to `float32` to comply with the
`npu_quant_matmul` kernel requirements.

These changes are necessary to enable `w8a8_mxfp8` quantization for
models with multi-modal inputs like Qwen VL.

### Does this PR introduce _any_ user-facing change?

No, this is a backend enhancement to extend quantization support to new
model architectures. There are no user-facing API or behavior changes.

### How was this patch tested?

CI is expected to pass. Manual testing should be performed with a Qwen
VL model using `w8a8_mxfp8` quantization to verify correctness and
performance.

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: ksiyuan <ksiyuan@umich.edu>
2026-03-20 16:18:58 +08:00
LI SHENGYONG
4e6dbe0956 [EPLB][Bugfix] Set parallel_config.enable_eplb to true to load redundant experts (#7470)
### What this PR does / why we need it?
pr: https://github.com/vllm-project/vllm/pull/37136 break eplb because
it filters out redundant experts.
pr: https://github.com/vllm-project/vllm/pull/37322 fix it due to use
parallel_config.enable_eplb to determine whether to skip the weight
loading filter.
But in vllm-ascend, parallel_config.enable_eplb is always false. When we
use eplb, we temporarily set it to true.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?

![Snipaste_2026-03-19_16-13-01](https://github.com/user-attachments/assets/b3a4911e-36b3-4c31-951c-7c091f416d00)
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-03-20 15:22:55 +08:00
LI SHENGYONG
1e05c4908f [EPLB] Reduce the memory used for batch_isend_irecv (#7344)
### What this PR does / why we need it?

#6729 seems to reduce the NPU memory usage of eplb, but actually moves
the buffer allocation of dist.all_gather_into_tensor to
dist.batch_isend_irecv. Therefore, the overall NPU memory usage is not
reduced. This PR completely reduces the memory usage in this part.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Remaining memory of each rank before the repair.
<img width="649" height="99" alt="image"
src="https://github.com/user-attachments/assets/52a67592-e0e8-4f9a-b194-b84cb848c598"
/>

Remaining memory of each rank after the repair.
<img width="641" height="99" alt="image"
src="https://github.com/user-attachments/assets/0bc2e67c-f328-4dea-98af-d7a459fb4876"
/>

Close EPLB.
<img width="543" height="45" alt="image"
src="https://github.com/user-attachments/assets/6dcba19d-4401-44b8-a6d3-c7b35ee983c7"
/>

Memory of weights for each rank.
<img width="648" height="46" alt="image"
src="https://github.com/user-attachments/assets/4db2fd04-98a0-4d26-a026-2e8287102b99"
/>

Estimated memory for EPLB: 15.68  / 48 (layer_num) + 2 * 0.02 = 0.35 GB


- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-03-20 12:25:58 +08:00
SILONG ZENG
eb92e7d50e [Bugfix] Restore balance scheduling patch for v0.17.0 (#7479)
### What this PR does / why we need it?
Restore previously introduced patches:
- https://github.com/vllm-project/vllm-ascend/pull/5212

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-19 20:12:57 +08:00
ichaoren
9d1452c74d [OPS]add split_qkv_tp_rmsnorm_rope ops (#7376)
### What this PR does / why we need it?
This PR introduces a new fused Triton kernel,
`split_qkv_tp_rmsnorm_rope` for Minimax-m2.5.

The implementation includes two Triton kernels:
1. `_split_qkv_and_compute_local_qk_var_kernel`: Splits the QKV input
and computes the local variance for RMSNorm.
2. `_apply_global_rmsnorm_kernel`: Applies global RMSNorm (considering
TP all-reduce for variance) and Neox-style RoPE.

### Does this PR introduce _any_ user-facing change?
Does not.

### How was this patch tested?
```python
pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_tp_rmsnorm_rope.py
```
### Test Data
A3 TP16
基线  

| data       | TTFT(ms) | TPOT(ms) | TPS    |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1  | 267.55   | 25.5     | 38.85  |
| 4k/1k@bs4  | 542.4    | 26.51    | 148.06 |

测试线

| data       | TTFT(ms) | TPOT(ms) | TPS    |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1  | 234.64   | 20.96    | 47.24  |
| 4k/1k@bs4  | 508.36   | 22.16    | 176.69 |


- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: xutianyi <xutianyi5@huawei.com>
Co-authored-by: xutianyi <xutianyi5@huawei.com>
2026-03-19 17:19:18 +08:00
Li Wang
83a4065b4b [CI] Add pre-commit check for patch logger (#7446)
### What this PR does / why we need it?
See https://github.com/vllm-project/vllm-ascend/pull/7402, pre-commit
hook will forbid init_logger(__name__) in vllm_ascend patch modules

- vLLM version: v0.17.0
- vLLM main:
8a680463fa

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-19 16:53:20 +08:00
Feng-xiaosuo
38e637eef5 Fix manual mapping registration and kimi_k2 layer name mapping (#7347)
### What this PR does / why we need it?

This PR fixes the layer name mapping logic in `AscendModelSlimConfig`
for quantization config loading.
1. **kimi_k2 model layer name mapping issue**: The `kimi_k2` model has a
unique layer naming convention that differs from the standard
`hf_to_vllm` mapping. One layer was defined in the mapper but was not
being correctly applied, causing quantization config lookup failures.
2. **Manual mapping registration timing issue**: The manual mapping
check in `apply_vllm_mapper` was executed before `vllm_config` was
initialized, causing `model_type` to be unavailable. This prevented some
models with manual mappings from being correctly registered.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Tested with `kimi_k2` model to verify the special layer name mapping
works correctly. Also tested with other models that have manual mappings
defined in `QUANT_MODEL_PREFIX_MAPPINGS` to ensure the registration
timing fix works properly.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Matrix_K <zhangke144@huawei.com>
Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>
Co-authored-by: Matrix_K <zhangke144@huawei.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
2026-03-19 16:46:41 +08:00
chenxi-hh
42bcad7e9b GMM custom operator optimization in small batch scenarios (#7100)
### What this PR does / why we need it?
GMM custom operator optimization in small batch scenarios

### How was this patch tested?

Qwen3-30B input: 4k, output: 1k

batch 1:
TPOT 7.9 ms -> 7.0 ms
Output Token Throughput 125.4651 token/s -> 140.6278 token/s

batch 2:
TPOT 9.4 ms -> 8.8 ms
Output Token Throughput 211.8187 token/s -> 225.2254 token/s

batch 16:
TPOT 13.6 ms -> 13.5 ms
Output Token Throughput 1159.8213 token/s -> 1165.0982 token/s

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: chenxi-hh <chen464822955@163.com>
2026-03-19 16:10:30 +08:00
wangxiyuan
8e0ebb470a [Misc] Drop Prefetch MLP Env (#7357)
### What this PR does / why we need it?
remove deprecated environment variables related to MLP prefetching

### Does this PR introduce _any_ user-facing change?
yes, the deprecated env vars can not be used then.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-19 14:27:27 +08:00
pu-zhe
e8f7b2e3f1 [Refactor] [310p] Support Mamba Cache and support attn_head_size larger than 128 (#7372)
### What this PR does / why we need it?
1. Mamba Cache Support on 310P: Implemented logic to correctly
initialize and allocate KV cache for Mamba models on the 310P platform,
including handling of state tensors and page size alignment.
2. Increased Attention Head Size Support: Modified the attention backend
to support attn_head_size larger than 128 by dynamically selecting
appropriate kernel block sizes based on hardware limitations (e.g.,
block_size * head_size <= 16384).
3. Refactored KV Cache Allocation: Consolidated and improved the KV
cache allocation mechanism, moving from separate size calculation and
allocation steps to a unified _allocate_kv_cache_tensors method that
handles both Attention and Mamba specific cache structures.
4. Dynamic Mamba Config Patching: Introduced conditional loading of
Mamba configuration patches, specifically using patch_mamba_config_310
for the 310P platform to ensure platform-specific optimizations and
validations.
5. Reserve reasonable memory to allocate KV cache to avoid OOM issue
with default gpu_memory_utilization.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Qwen3.5 E2E test
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-03-19 09:16:22 +08:00
Nengjun Ma
8b79d4de52 Main2main upgrade to vllm 0317 afternoon (#7409)
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](https://github.com/vllm-project/vllm/pull/37031)

4.fix [Support multiple KV groups in OffloadingSpec
](https://github.com/vllm-project/vllm/pull/36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
8a680463fa

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
2026-03-18 23:24:27 +08:00
jiangmengyu18
305820f1a9 [Bugfix] fix bug about model type of qwen3_vl_8b_instruct_w8a8 (#7383)
### What this PR does / why we need it?
Adapt to the model type of Qwen3-VL-8B-Instruct-W8A8

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
2026-03-18 20:30:03 +08:00
Angazenn
ec34bf0062 [Misc]fix logger which does not take effects in patches (#7402)
### What this PR does / why we need it?
This PR fixes the logger initialization in patches so that the log info
can be displayed as expected.

### Does this PR introduce _any_ user-facing change?
No.

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-03-18 17:13:12 +08:00
rjg-lyh
c1392a6ce6 [bugfix][accuracy] Fix ds indexer accuracy problem caused by k rope (#7341)
### What this PR does / why we need it?
The rotary algorithm in deepseek indexer should be neox-style instead of
gptj style. PR #4641 fix this accuracy bug in original pytorch version.
But PR #5701 accidentally removed the fixed code line and reverted the
implementation back to the problematic version. This PR fixes it.

Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-18 14:20:21 +08:00
wangxiaoteng888
c7157af8f7 [P/D] LayerwiseConnector supports the virtual push functionality on node D. (#7361)
### What this PR does / why we need it?
LayerwiseConnector supports the virtual push functionality on node D.By
adding a do_virtual flag to request metadata, the system can now
identify and process certain requests virtually, bypassing the actual KV
cache transfer process. This allows for immediate completion of these
requests from the consumer's perspective, potentially enabling
optimizations or specific testing scenarios where physical data transfer
is not required.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-03-18 10:50:02 +08:00
zhangyiming
1c954ff264 [main2main] upgrade vllm to 0308 (#7213)
### What this PR does / why we need it?
Update main2main to vllm 0308.
breaks:

* https://github.com/vllm-project/vllm/pull/30681
* https://github.com/vllm-project/vllm/pull/35552 remove
self.cudagraph_batch_sizes
* https://github.com/vllm-project/vllm/pull/35158 clear_metadata ->
defer_finalize
* https://github.com/vllm-project/vllm/pull/36006 remove
CacheConfig.cpu_offload_gb
* https://github.com/vllm-project/vllm/pull/35472
* https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder
* https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens
* https://github.com/vllm-project/vllm/pull/28053 

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: MrZ20 <2609716663@qq.com>
2026-03-18 09:24:43 +08:00
Chao Lei
d9ac7e8539 [Bugfix] Assertion error when decode prefix cache fully hits (#7236)
### What this PR does / why we need it?
#### Problem
When decode node enables prefix cache and the local prefix cache fully
hits, the following assertion error occurs:
```
(EngineCore_DP3 pid=34912)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 520, in step_with_batch_queue
(EngineCore_DP3 pid=34912)     engine_core_outputs = self.scheduler.update_from_output(
(EngineCore_DP3 pid=34912)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP3 pid=34912)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 1520, in update_from_output
(EngineCore_DP3 pid=34912)     self._update_from_kv_xfer_finished(kv_connector_output)
(EngineCore_DP3 pid=34912)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 2120, in _update_from_kv_xfer_finished
(EngineCore_DP3 pid=34912)     assert RequestStatus.is_finished(req.status)
(EngineCore_DP3 pid=34912)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP3 pid=34912) AssertionError
```

The error is triggered in scheduler.py at _update_from_kv_xfer_finished:
```
  if req.status == RequestStatus.WAITING_FOR_REMOTE_KVS:
      self.finished_recving_kv_req_ids.add(req_id)
  else:
      assert RequestStatus.is_finished(req.status)
```

  #### Root Cause

When decode node has prefix cache enabled and local prefix cache fully
hits:

1. get_num_new_matched_tokens returns ext_tokens=0, load_kv_async=False
when decode prefix cache fully hits
  2. Request status becomes RUNNING (not WAITING_FOR_REMOTE_KVS)
3. However, update_state_after_alloc still adds the request to
_reqs_need_recv because remote_block_ids exists in kv_transfer_params
  4. Worker processes the request in _handle_request:
- _transfer_kv_cache returns immediately (no actual transfer,
local_block_ids is empty)
    - finally block still calls update_done_task_count(request_id)
  5. finished_recving contains this request
6. When _update_from_kv_xfer_finished processes finished_recving,
request status is RUNNING
  7. Assertion fails

  #### Solution

In _handle_request, only notify scheduler (update_done_task_count) when
actual KV transfer happened (local_block_ids is not empty). The signals
to notify Prefill to release KVCache
(_send_done_signal_to_free_remote_port and _send_done_recv_signal) are
still sent regardless.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: LCAIZJ <leichao139636@163.com>
2026-03-17 15:17:45 +00:00