xc-llm-ascend

Author	SHA1	Message	Date
panchao-hub	d98a0727c8	[Feat] Add npugraph_ex enablement logging (#7574 ) ### What this PR does / why we need it? - Replace local logging with vllm.logger for consistency - Add info log when enable_npugraph_ex is enabled - Add info log when enable_static_kernel is enabled - Unify logging message format to use config switch names consistently - This helps users understand which compilation optimizations are active ### Does this PR introduce _any_ user-facing change? Yes. Users will now see informational log messages when enable_npugraph_ex or enable_static_kernel features are enabled, providing better visibility into the compilation optimization settings being used. ### How was this patch tested? - Code passes all pre-commit hooks (ruff check, ruff format, codespell, typos) - Follows project coding conventions and style guidelines - Logger import matches the pattern used elsewhere in the codebase Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com>	2026-03-24 17:04:48 +08:00
Angazenn	bdb65319a9	[UT] Align input arguments with Ascend(Yarn)RotaryEmbedding with vLLM and add ut (#7358 ) ### What this PR does / why we need it? This PR adds missing arguments in `AscendRotaryEmbedding`, `AscendYarnRotaryEmbedding` to conform with vLLM. Besides, corresponding ut is introduced. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-03-24 16:02:56 +08:00
liziyu	73cadecfb4	[P/D] [Bugfix] fix mooncake layerconnector dead when update_decoder_info fail (#7514 ) ### What this PR does / why we need it? Fix mooncake layerconnector dead when update_decoder_info fail. For the scenario where node D is dead, node P failing to update_decoder_info should not cause node P to become dead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? by CI - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-03-24 15:49:46 +08:00
zxr2333	67aad1fce8	[BugFix][P/D] fix padding error on FullGraph mode && fix layerwise connector mamba accuracy (#7506 ) ### What this PR does / why we need it? 1. When the FullGraph mode is used, the branches in the Triton operator are compiled and fixed during the graph capture process, causing the branch condition in the `fused_recurrent_gated_delta_rule` operator, which checks whether `ssm_state_indices >= 0` before writing to the SSM cache, to become invalid. Now, the write operation is performed regardless of the value. This results in the operator performing address offset calculations and writing to the SSM cache based on the -1 offset after -1 is used for padding in vLLM GDN backend. Since the conv cache and SSM cache in vLLM Ascend implementation are actually a single continuous tensor divided into two parts, this leads to data overwriting and the generation of NaN values. This PR addresses two cases where padding -1 is required in the GDN metadata builder. The same logic is used to replace the padding with 0 to avoid the problem of memory overwriting, because block 0 is a reserved block. 2. Fix layerwise connector bug for mamba cache sending on heterogeneous TP. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-24 15:15:55 +08:00
LeeWenquan	475b4b0cea	Revert "GMM custom operator optimization in small batch scenarios (vllm-project#7100)" (#7557 ) ### What this PR does / why we need it? This reverts commit `42bcad7e9b`. The commit cause accuracy decrease of qwen3Next, 150 items of gsm8k, 98 -> 91. - vLLM version: v0.18.0 - vLLM main: `6a9cceb219` Signed-off-by: Your Name <you@example.com> Co-authored-by: Your Name <you@example.com>	2026-03-24 14:24:44 +08:00
Shaoxu Cheng	83bd77c983	[310p]: add rmsnorm gated fallback and unit test (#7424 ) ### What this PR does / why we need it? RFC #7394 310P cannot use the fused `rmsnormgated` operator and must fall back to the native implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? ut - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-24 09:00:11 +08:00
jiaojiao	1de805ce0a	[Ops][Misc] Refactor and optimize CausalConv1d for Ascend (#7495 ) ### What this PR does / why we need it? During the prefill phase of Qwen3-Next and Qwen3.5, the `torch.ops._C_ascend.causal_conv1d_fn` operator exhibits significant performance bottlenecks. To address this, we have re-implemented the optimization using `torch.ops._C_ascend.npu_causal_conv1d_custom`. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? 1 accuracy test ``` [2026-03-20 16:44:22,961] [ais_bench] [INFO] Start launch task state board ... +-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+ \| Task Name \| Process \| Progress \| Time Cost \| Status \| Log Path \| Extend Parameters \| +=============================+===========+============+=============+==========+===========================================+=====================+ \| vllm-api-general-chat/gsm8k \| 2918978 \| NA \| 0:00:01 \| finish \| logs/eval/vllm-api-general-chat/gsm8k.out \| None \| +-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+ [2026-03-20 16:44:34,284] [ais_bench] [INFO] Evaluation tasks completed. [2026-03-20 16:44:34,287] [ais_bench] [INFO] Summarizing evaluation results... dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 271d0b accuracy gen 96.21 ``` 2 ut modify test `pytest -sv /home/c30006096/vllm-ascend/tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_causal_conv1d.py::test_ascend_causal_conv1d` - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: wenba0 <3054239545@qq.com> Signed-off-by: jiaojiao <56385650+wenba0@users.noreply.github.com>	2026-03-24 00:07:12 +08:00
ZhuQi-seu	e942b62d74	[features]support split qkv rmsnorm rmope for qwen3.5 (#7368 ) ### What this PR does / why we need it? Qwen3.5 full attention supports enabling the split_qkv_rmsnorm_mrope fusion operator. ### How was this patch tested? vLLM version: v0.16.0 vLLM-Ascend main: https://github.com/vllm-project/vllm-ascend/pull/6730 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>	2026-03-23 23:58:12 +08:00
Nengjun Ma	fcba91a392	Main2main Upgrade vllm commit to 0320 17:00 (#7510 ) ### What this PR does / why we need it? Main2main Upgrade vllm commit to 0320 17:00 1. fix vllm refactored `_moe_forward` to call `runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True. vllm PR:"[MoE Refactor] DefaultMoERunner simplification [#33049](https://github.com/vllm-project/vllm/pull/33049)" 2.fix vllm moved the call to `self._set_compile_ranges()` in `VllmConfig.__post_init__` from before `check_and_update_config()` to after it (to allow platforms to lower `max_num_batched_tokens` first). vllm PR: "fix(xpu): Re-compute compile ranges after platform-specific config updates" [#37523](https://github.com/vllm-project/vllm/pull/37523) ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>	2026-03-23 21:37:41 +08:00
weijinqian0	bdd90c0088	[model_runner_v2]optimize the performance of the post_update. (#7496 ) ### What this PR does / why we need it? - This PR aims to enhance the operator performance in the `post_update` phase of `model_runner_v2` on NPUs. By optimizing the relevant operations, it is expected to improve the overall efficiency and speed of the model running on NPU hardware, which is crucial for scenarios where high-performance inference is required. - when bs = 256, time cost reduce from 26us to 11 us; ### Does this PR introduce _any_ user-facing change? No, there are no changes to the API, interface, or other high-level behaviors that would directly affect the user's code or interaction with the system beyond the performance improvement. ### How was this patch tested? CI passed with new added/existing tests. In addition to the regular CI tests, specific benchmark tests were conducted on NPU hardware to measure the performance improvement of the `post_update` operators. --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2026-03-23 20:29:55 +08:00
lijiahang226	170dcbda62	[Feature] Support DeepSeek for A5 (#7232 ) ### What this PR does / why we need it? Add A5 mla operators to support running DeepSeek models on A5. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: Li Jiahang <216526138+lijiahang226@users.noreply.github.com>	2026-03-23 20:28:26 +08:00
Shaoxu Cheng	13397e9cb7	[310p] Add a PyTorch implementation of the GDN gating operator on 310P (#7430 ) ### What this PR does / why we need it? RFC #7394 Add a PyTorch implementation of the GDN gating operator on 310P. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT - vLLM version: v0.17.0 - vLLM main: `4497431df6` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-23 20:26:39 +08:00
drslark	41dadd4312	[main][bugfix] Solved the problem of the d node getting stuck in the pd-separation scenario (#7534 ) ### What this PR does / why we need it? A problem of the d node getting stuck in the pd-separation scenario is solved. We find it will crash at `torch.nn.functional.linear(x, weight, bias)` after being stuck for a long time. we found that the shapes of each dp node were not aligned. this is the root cause. - vLLM version: v0.18.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-23 18:53:07 +08:00
Levi	9976e685b7	[Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp (#7297 ) ### What this PR does / why we need it? Fix multi dp padding logic for eager mode, bacause its will cause rank0 load imbalance in kimi-k2.5-w4a8 with the all the padding tokens router to rank0. And the fix can also apply to other model in multi dp. - before hbm usage： <img width="2229" height="733" alt="image" src="https://github.com/user-attachments/assets/50479b6d-cfd0-4206-8e80-974024652997" /> preformance： ```shell Concurrency NumPrompts QPS TTFT_Avg TTFT_P50 TPOT_Avg TPOT_P50 TPOT_P90 ============ ============ ============ ============ ============ ============ ============ ============ 1 15 0.0179 1667.7803 1673.3437 35.2973 35.2775 35.3784 32 480 0.4725 2764.8027 1905.2137 40.8030 40.6978 41.0179 64 960 0.7820 4123.7096 3485.6153 48.0461 48.1598 48.2971 100 1500 1.0852 6216.7988 5714.0082 52.9323 53.0613 54.6304 108 1620 1.1040 6277.4892 5798.7425 56.3862 56.9224 57.2901 116 1740 1.1680 6563.3293 6039.5659 56.9894 57.4027 57.5786 128 1920 1.2555 7822.5551 7604.1662 57.7660 58.1768 58.2717 192 2880 1.4314 9212.1953 9131.3461 58.9905 59.1683 59.2791 256 3840 1.4480 9028.0812 8913.7937 59.0092 59.2385 59.3516 ``` - after hbm usage： <img width="2246" height="1005" alt="image" src="https://github.com/user-attachments/assets/d0936481-5a58-4bc5-a6f1-b92735d47885" /> preformance： ```shell Concurrency NumPrompts QPS TTFT_Avg TTFT_P50 TPOT_Avg TPOT_P50 TPOT_P90 ============ ============ ============ ============ ============ ============ ============ ============ 1 15 0.0181 601.4171 600.9774 35.6270 35.6254 35.6480 32 480 0.4455 720.8782 724.2889 45.4250 45.4755 45.6318 64 960 0.8445 729.6209 728.2149 47.0464 47.0896 47.1985 100 1500 1.2601 723.4834 724.6673 48.3108 48.3844 48.5355 108 1620 1.3409 727.1509 720.6772 48.8962 48.9409 49.0489 116 1740 1.4080 679.9799 677.6119 49.1253 49.1983 49.3087 128 1920 1.4155 680.6284 674.9436 49.2193 49.2450 49.3763 192 2880 1.4422 684.6577 676.7833 49.2059 49.2264 49.3229 256 3840 1.4558 685.2462 678.1709 49.2191 49.2351 49.3419 ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: fny-coder <985619145@qq.com>	2026-03-23 17:05:02 +08:00
Nengjun Ma	8e2c59e1ee	Main2main upgrade vllm commit to 03 19 17:00 (#7478 ) ### What this PR does / why we need it? Upgrade vllm commit to 2026.03.19. 1.Fix socket removed from StatelessProcessGroup. Upstream vLLM PR [#36330](https://github.com/vllm-project/vllm/pull/36330) ("elastic_ep: Fix stateless group port races") refactored StatelessProcessGroup and removed the socket: socket.socket \| None field. The socket ownership was moved to a new create_tcp_store() helper instead of being stored as a field on the dataclass. 2.fix `virtual_engine` parameter removed from `set_forward_context(). Upstream [V0 Deprecation] Deprecate virtual engine [#37195](https://github.com/vllm-project/vllm/pull/37195) ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-03-23 16:25:57 +08:00
LICO67373	caa71e50ca	[Perf] Simplify FIA prefill context merge path (#7293 ) ### What this PR does / why we need it? This PR simplifies and hardens MLA prefill context merging in `vllm_ascend/attention/mla_v1.py` after FIA migration by directly building `out_list/lse_list` (without temporary chunk buffers or `cat/stack/split`) and using `reshape` for safe flattening of non-contiguous tensors. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactor/stability improvement only; no API/interface behavior changes. ### How was this patch tested? - Verified tensor shape/data flow for `npu_attention_update` inputs (`out_list/lse_list`) after refactor. - Confirmed no lint errors in the modified file. - CI UT coverage on attention/MLA paths is used for validation. vLLM version: `v0.17.0` vLLM main: `vllm-project/vllm@4034c3d` --------- Signed-off-by: lico67373 <918688502@qq.com>	2026-03-23 07:47:42 +00:00
Qiu	71df17f4e6	bugfix(MC2): refactor the comm group of MC2 to be compatible with PP (#7291 ) ### What this PR does / why we need it? This PR refactors the communication group of MC2 to keep it consistent with vllm's EP group, making it compatible with PP. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-23 15:44:21 +08:00
Shaoxu Cheng	5b60b530d6	[Bugfix][310p] the new A5 mmencoder op donot support 310p (#7518 ) ### What this PR does / why we need it? Because the new A5 MMEncoder operator was merged, the 310P can no longer run any VL models. This PR fixes that issue. details at #7046 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-23 15:40:34 +08:00
Mengqing Cao	9e2878065a	[Spec-Decode] Fix spec decode proposer in 0.18.0 (#7544 ) ### What this PR does / why we need it? As the vllm-ascend main doesn't maintain v0.17.0 now, we'd just apply the single branch in eagle proposer. Otherwise it will raise error in v0.18.0 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed with existing test. - vLLM version: v0.18.0 - vLLM main: `8b6325758c` Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-23 15:39:24 +08:00
Shanshan Shen	6b7d9b76f1	[MM][Perf] Pre-compute `seq_lens` and put it on CPU before ViT vision blocks for better performance (#7104 ) ### What this PR does / why we need it? Background: PR https://github.com/vllm-project/vllm-ascend/pull/6448 has introduced a `seq_lens` CPU cache mechanism, which will considerably benefit the performance for VL models but may lead to accuracy issues. Thus, we have reverted it. Proposed Change: In PR https://github.com/vllm-project/vllm/pull/36605, we have supported custom processing logic for OOT MMEncoder kernels in vLLM. Thus, we can pre-compute `seq_lens` (rather than `cu_seqlens`) and put it on CPU before ViT vision blocks to avoid redundant computation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? #### ✅ Functional Test Run Qwen2.5-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max-model-len 16384 \ --max-num-batched-tokens 16384 \ --limit-mm-per-prompt '{"image": 1}' ``` Output: ```bash "The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" having a slightly bolder and more prominent appearance compared to \"Qwen.\" The overall design is simple and professional." ``` > [!NOTE] > Since PR https://github.com/vllm-project/vllm/pull/36605 only modified `Qwen3-VL` modeling files, this PR has no affect to `Qwen2.5-VL` model. --- Run Qwen3-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max-model-len 16384 \ --max-num-batched-tokens 16384 \ --limit-mm-per-prompt '{"image": 1}' ``` Output: ```bash "The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with TONG." ``` --- #### ✅ Benchmark Launch the server: ``` vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --dtype bfloat16 \ --limit-mm-per-prompt '{"image": 1}' \ --max-model-len 16384 \ --max-num-batched-tokens 16384 ``` Run benchmark: ``` vllm bench serve \ --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --hf-split train \ --dataset-path lmarena-ai/vision-arena-bench-v0.1 \ --num-prompts 500 \ --request-rate 10 \ --burstiness 5 \ --no-stream ``` Before this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 500 Failed requests: 0 Request rate configured (RPS): 10.00 Benchmark duration (s): 78.58 Total input tokens: 33418 Total generated tokens: 61431 Request throughput (req/s): 6.36 Output token throughput (tok/s): 781.78 Peak output token throughput (tok/s): 2475.00 Peak concurrent requests: 383.00 Total token throughput (tok/s): 1207.07 ---------------Time to First Token---------------- Mean TTFT (ms): 7116.24 Median TTFT (ms): 4295.84 P99 TTFT (ms): 18370.87 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 245.78 Median TPOT (ms): 264.03 P99 TPOT (ms): 334.38 ---------------Inter-token Latency---------------- Mean ITL (ms): 246.99 Median ITL (ms): 117.71 P99 ITL (ms): 1327.55 ================================================== ``` After this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 500 Failed requests: 0 Request rate configured (RPS): 10.00 Benchmark duration (s): 77.44 Total input tokens: 33418 Total generated tokens: 61522 Request throughput (req/s): 6.46 Output token throughput (tok/s): 794.40 Peak output token throughput (tok/s): 2691.00 Peak concurrent requests: 369.00 Total token throughput (tok/s): 1225.91 ---------------Time to First Token---------------- Mean TTFT (ms): 6888.64 Median TTFT (ms): 4128.82 P99 TTFT (ms): 17487.94 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 240.14 Median TPOT (ms): 259.18 P99 TPOT (ms): 313.15 ---------------Inter-token Latency---------------- Mean ITL (ms): 241.84 Median ITL (ms): 121.08 P99 ITL (ms): 1470.33 ================================================== ``` Performance Metrics: \| Metric \| Before this PR \| After this PR \| Comparison \| \| :----- \| :------------- \| :------------ \| :--------- \| \| Throughput \| \| \| \| \| Request throughput (req/s) \| 6.36 \| 6.46 \| +1.57% ↑ \| \| Output token throughput (tok/s) \| 781.78 \| 794.40 \| +1.61% ↑ \| \| Total token throughput (tok/s) \| 1,207.07 \| 1,225.91 \| +1.56% ↑ \| \| Peak output token throughput (tok/s) \| 2,475 \| 2,691 \| +8.73% ↑ \| \| Latency \| \| \| \| \| Benchmark duration (s) \| 78.58 \| 77.44 \| -1.45% ↓ \| \| Mean TTFT (ms) \| 7,116.24 \| 6,888.64 \| -3.20% ↓ \| \| Median TTFT (ms) \| 4,295.84 \| 4,128.82 \| -3.89% ↓ \| \| P99 TTFT (ms) \| 18,370.87 \| 17,487.94 \| -4.81% ↓ \| \| Mean TPOT (ms) \| 245.78 \| 240.14 \| -2.29% ↓ \| \| Median TPOT (ms) \| 264.03 \| 259.18 \| -1.84% ↓ \| \| P99 TPOT (ms) \| 334.38 \| 313.15 \| -6.35% ↓ \| \| Mean ITL (ms) \| 246.99 \| 241.84 \| -2.09% ↓ \| \| Median ITL (ms) \| 117.71 \| 121.08 \| +2.86% ↑ \| \| P99 ITL (ms) \| 1,327.55 \| 1,470.33 \| +10.76% ↑ \| 🤖 AI Summary: - The most notable improvement is in P99 TPOT, which dropped -6.35% from 334.38ms → 313.15ms, indicating reduced tail latency for per-token generation under heavy load. - TTFT improved across all percentiles: mean dropped -3.20% (7,116ms → 6,889ms), median -3.89% (4,296ms → 4,129ms), and P99 -4.81% (18,371ms → 17,488ms), reflecting faster time-to-first-token across the board. - TPOT also improved consistently, with mean down -2.29% (245.78ms → 240.14ms) and median down -1.84% (264.03ms → 259.18ms), showing a modest but steady reduction in per-token generation time. - Throughput saw a slight uplift of roughly +1.6% across request, output token, and total token throughput. Peak output token throughput jumped +8.73% (2,475 → 2,691 tok/s), suggesting better burst handling capacity. - P99 ITL increased +10.76% (1,328ms → 1,470ms), the largest regression in the run. Median ITL also ticked up +2.86% (117.71ms → 121.08ms). These tail-latency spikes may reflect scheduling variability under peak concurrency and could be within run-to-run noise, but are worth monitoring. - Overall, the PR delivers a consistent improvement in both throughput and latency, with the caveat that P99 inter-token latency regressed — likely a transient effect given that mean ITL still improved by -2.09%. --- - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-03-23 15:24:26 +08:00
Shanshan Shen	5c0d02f689	[Bugfix] Fix multi-instance serving OOM on single card (#7427 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/7308. Subtracting `init_non_torch_memory` (maybe used by the first instance) from the total `non_torch_memory` when calculating `available_kv_cache_memory`. Directly use `non_torch_memory_increase` (contained in `non_kv_cache_memory`) to calculate `available_kv_cache_memory`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Launch tow vllm-ascend instances sequentially on single card. ```bash # Launch first instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8100 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager # Launch second instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8101 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager ``` Before this PR: ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340388298034668 GiB init_non_torch_memory: 0.3616676330566406 GiB non_torch_memory_before_empty_cache: 0.3896217346191406 GiB non_torch_memory_increase: 0.0279541015625 GiB non_torch_memory_cleared_by_empty_cache: 0.3616676330566406 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2336344718933105 GiB init_non_torch_memory: 18.37220001220703 GiB non_torch_memory_before_empty_cache: 18.399906158447266 GiB non_torch_memory_increase: 0.02754974365234375 GiB non_torch_memory_cleared_by_empty_cache: 18.372356414794922 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: -1.32 GiB ``` After this PR: ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340540885925293 GiB init_non_torch_memory: 0.36182403564453125 GiB non_torch_memory_before_empty_cache: 0.38979339599609375 GiB non_torch_memory_increase: 0.0279693603515625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.233344554901123 GiB init_non_torch_memory: 18.74309539794922 GiB non_torch_memory_before_empty_cache: 18.770355224609375 GiB non_torch_memory_increase: 0.02725982666015625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: 17.05 GiB ``` - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2026-03-23 14:22:59 +08:00
Canlin Guo	e68464a1d6	[Bugfix] Fix slow hasattr in ACLGraphWrapper.__getattr__ (#7442 ) ### What this PR does / why we need it? Follow https://github.com/vllm-project/vllm/pull/37425, https://github.com/vllm-project/vllm-omni/pull/1982 Copied from them: Notice that `hasattr(self.model, "flush_pending_metadata")` cost 6ms per decode step when profiling Qwen3 Omni. The original `CUDAGraphWrapper.__getattr__` raises: ```python raise AttributeError(f"... cudagraph wrapper: {self.runnable}") ``` When hasattr() is called for a non-existent attribute, Python internally calls __getattr__ which constructs this AttributeError. The {self.runnable} triggers `__repr__()` on the underlying model (e.g., `Qwen3OmniMoeForConditionalGeneration`), which recursivelytraverses the entire nn.Module tree to generate an 18,000+ character string. This takes ~6-7ms per call. Since `hasattr(self.model, "flush_pending_metadata") ` is called every decode step in the Talker forward path, this adds ~6ms overhead per step, severely impacting audio inter-chunk latency (ICL). ```Python hasattr(self.model, "flush_pending_metadata") → getattr(self.model, "flush_pending_metadata") → not found in CUDAGraphWrapper.__dict__ → not found in the CUDAGraphWrapper class hierarchy → triggers CUDAGraphWrapper.__getattr__("flush_pending_metadata") → hasattr(self.runnable, "flush_pending_metadata") # runnable also doesn't have it → executes raise AttributeError(f"... {self.runnable}") → Python needs to construct the exception object → the f-string triggers self.runnable.__repr__() → Qwen3OmniMoeForConditionalGeneration.__repr__() → recursively traverses the entire nn.Module tree → generates a 18,000+ character string → takes ~6 ms → AttributeError object is created → hasattr catches the AttributeError and returns False → the 18,000-character string is immediately discarded (no one ever sees it) ``` ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? See https://github.com/vllm-project/vllm-omni/pull/1982 - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-03-23 09:26:24 +08:00
Qi Mao	9bf9b4b267	[Feature] Optimize Qwen3.5/Qwen3Next GDN prefill by prebuilding chunk metadata (#7487 ) ### What this PR does / why we need it? This PR optimizes the Qwen3.5 and Qwen3Next GDN prefill path on Ascend by reducing host/device synchronization overhead. The current implementation of the `chunk_gated_delta_rule` path for variable-length sequences prepares chunk metadata during the forward pass. This approach triggers frequent CPU intervention and host/device round-trips. When running prefill-heavy workloads with asynchronous scheduling enabled, these synchronizations result in execution "bubbles" and prefill stalling (stuttering). Note that this does not cause asynchronous scheduling to fail; rather, it prevents the system from reaching its theoretical throughput due to these unnecessary stalls. To resolve this, the patch moves metadata preparation out of the hot path: - Prebuilt Metadata: All non-speculative varlen chunk metadata for GDN is now prebuilt on the CPU. - Asynchronous Transfer: Staging buffers are kept in pinned memory and transferred to the NPU asynchronously. - Integration: The prebuilt bundle is attached to GDN attention metadata via `patch_gdn_attn.py` and passed into Triton wrappers. - Backward Compatibility: Triton wrappers fall back to the legacy preparation path if no prebuilt metadata is provided. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: maoxx241 <maomaoyu870@gmail.com>	2026-03-22 23:09:23 +08:00
LoganJane	b2e71b7930	[Bugfix] Fix get_rope_shape for Kimi-K2.5 (#7521 ) ### What this PR does / why we need it? Delete the logic that the input of get_rope_shape from device to host. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: LoganJane <loganJane73@hotmail.com>	2026-03-22 21:06:31 +08:00
Cao Yi	9e2965bae2	[Feature] Support Flash Comm V1 for VL models (with MLA) (#7390 ) ## Summary Flash Comm V1 (flashcomm1) was previously blocked for all VL models. Root cause: For VL models, `inputs_embeds` at layer 0 originates from the vision encoder as a full `[N, H]` tensor — it has not been reduce-scattered across TP ranks. The original MLA forward path assumed inputs were already scattered, producing wrong output shapes under TP > 1. Fix: - Detect at init time (statically, not via runtime shape checks) whether a layer is the first layer of a VL model (`is_vl_first_layer`) so dynamo treats the branch as a constant. - In `AscendMultiHeadLatentAttention.forward`, when `flashcomm1 + TP > 1 + is_vl_first_layer`, set `need_gather_q_kv=False` and pre-allocate output as `[N//tp_size, H]`. - Remove the platform-level assertion that prevented VL models from enabling Flash Comm V1. Other improvements: - `is_vl_model()` now uses vllm's canonical detection (`hf_config is not hf_text_config`) instead of fragile key-name checks, with the old checks kept as fallback. - Added `parse_layer_idx(prefix)` utility. - Added `maybe_chunk_residual` call in `AscendRMSNorm` before the add-rms-norm op. - Removed unnecessary CPU/fp32 round-trip in `AscendLearnable2DInterpPosEmbDivided_fixed.forward()`. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: LoganJane <loganJane73@hotmail.com>	2026-03-22 21:05:28 +08:00
Qi Mao	9d0b7c8e98	[Platform][BugFix] Preserve hybrid block size on Ascend (#7528 ) ### What this PR does / why we need it This PR fixes a startup regression for Ascend hybrid attention + mamba models after upgrading to vLLM `0.18.0`. However, after the vLLM `0.18.0` upgrade, worker initialization still calls the generic platform hook: - `current_platform.update_block_size_for_backend(vllm_config)` ### How this PR fixes it This PR keeps the fix strictly inside `vllm-ascend`. It adds an Ascend override for `NPUPlatform.update_block_size_for_backend()`: - for hybrid models, do not run the generic upstream block-size fallback - preserve the block size that was already computed by the hybrid model-specific config logic - for non-hybrid models, keep the original upstream behavior unchanged - vLLM version: v0.18.0 - vLLM main: `8b6325758c` --------- Signed-off-by: maoxx241 <maomaoyu870@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-22 11:21:49 +08:00
XiaoxinWang	cbf46fad3c	fixed graph mode bug. (#7460 ) ### What this PR does / why we need it? In fulldecodeonly mode, num_req_padded was set to an incorrect value, causing accuracy degradation in Qwen3-Next. Therefore, we added a check for compilation_config.cudagraph_mode to the conditional logic, ensuring that padding is applied only in FULL mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8a680463fa` Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2026-03-22 10:09:37 +08:00
Zetong Li	84a74f0cb1	[Bugfix] Fix padding logic in eagle proposer for kimi25 (#7348 ) ### What this PR does / why we need it? This PR aims to fix padding logic in eagle proposer for kimi25. Main changes involve: 1. modify the way to obtain draft model attention builder and backend 2. add block table padding & related tensor slicing in common metadata when `draft_step>1` for solving fia verifying error 3. replace block table in `update_graph_params` for solving fia verifying error - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-21 16:57:22 +08:00
meihanc	bff4fbfca5	upgrade to 0.18.0 (#7502 ) ### What this PR does / why we need it? 1. upgrade to 0.18.0 2. ensure kernel_block_sizes is int for Eagle drafter ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-03-21 16:05:38 +08:00
HongtaoYang	80a4265717	[Feat] Support separate attention backend for target and draft model. (#7342 ) ### What this PR does / why we need it? This PR enables separate attention backend configuration for target and draft models in speculative decoding, decoupling the previously bound attention backend settings between the two models. It solves the compatibility issue where some draft models do not support the attention backend used by the target model, and allows users to select the optimal attention backend for each model individually to maximize inference performance. The change is fully backward compatible. --------- Signed-off-by: SidaoY <1024863041@qq.com>	2026-03-21 10:48:01 +08:00
linfeng-yuan	88d03a783f	[refactor] replace scattered business kwargs with typed request objects and explicit stage boundaries (#7024 ) ### What this PR does / why we need it? Refactor `vllm_ascend/ops/fused_moe` to replace scattered MoE business `**kwargs` with typed request objects and explicit stage boundaries. - Prepare, dispatch, MLP, and quant stages now have clearer ownership. - Main MoE path no longer depends on business `kwargs.get(...)` lookups. - Comm and dispatcher interfaces are request-only on the main path. - UTs can assert stage-level fields directly instead of inferring behavior indirectly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-20 23:23:57 +08:00
yesyue-w	c860535246	【A5】【Qwen VL】Qwen VL adapt for A5 (#7046 ) ### What this PR does / why we need it? Replace the '_npu_flash_attention_unpad' operator with the 'npu_fusion_attention' operator to ensure that the Qwen VL model can run in the A5 environment and remove the 'mrope' operator call restriction for A5. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: 汪越 <wangyue361@h-partners.com>	2026-03-20 16:56:12 +08:00
idouba	f39f566e22	Refactor duplicated code into a common method to reduce redundancy (#7210 ) ### What this PR does / why we need it? 1. Extracting duplicated code into a method. That is defining _get_input_parallel_ in parent class _CustomRowParallelOp_, and call the helper method in its 5 children classes : - MLPRowParallelOp - OProjRowParallelOp - Flashcomm2OProjRowParallelOp - MatmulAllreduceRowParallelOp - SequenceRowParallelOp 's _apply_impl_ method 2. Variable typo fixing: split instead of splitted for the past tense ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: idouba <zhangchaomeng@huawei.com>	2026-03-20 16:49:02 +08:00
Siyuan Kong	a16c99141b	Adapt w8a8mxfp8 quantization for Qwen VL models (#7417 ) ### What this PR does / why we need it? This PR adapts the `w8a8_mxfp8` quantization method to support Qwen Vision-Language (VL) models. Key changes include: - Reshaping multi-dimensional input tensors to 2D before the quantized matrix multiplication. - Reshaping the 2D output back to its original multi-dimensional format. - Adding specific output reshaping for the visual components of Qwen VL models. - Casting the bias tensor to `float32` to comply with the `npu_quant_matmul` kernel requirements. These changes are necessary to enable `w8a8_mxfp8` quantization for models with multi-modal inputs like Qwen VL. ### Does this PR introduce _any_ user-facing change? No, this is a backend enhancement to extend quantization support to new model architectures. There are no user-facing API or behavior changes. ### How was this patch tested? CI is expected to pass. Manual testing should be performed with a Qwen VL model using `w8a8_mxfp8` quantization to verify correctness and performance. - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: ksiyuan <ksiyuan@umich.edu>	2026-03-20 16:18:58 +08:00
LI SHENGYONG	4e6dbe0956	[EPLB][Bugfix] Set parallel_config.enable_eplb to true to load redundant experts (#7470 ) ### What this PR does / why we need it? pr: https://github.com/vllm-project/vllm/pull/37136 break eplb because it filters out redundant experts. pr: https://github.com/vllm-project/vllm/pull/37322 fix it due to use parallel_config.enable_eplb to determine whether to skip the weight loading filter. But in vllm-ascend, parallel_config.enable_eplb is always false. When we use eplb, we temporarily set it to true. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? ![Snipaste_2026-03-19_16-13-01](https://github.com/user-attachments/assets/b3a4911e-36b3-4c31-951c-7c091f416d00) \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-20 15:22:55 +08:00
LI SHENGYONG	1e05c4908f	[EPLB] Reduce the memory used for batch_isend_irecv (#7344 ) ### What this PR does / why we need it? #6729 seems to reduce the NPU memory usage of eplb, but actually moves the buffer allocation of dist.all_gather_into_tensor to dist.batch_isend_irecv. Therefore, the overall NPU memory usage is not reduced. This PR completely reduces the memory usage in this part. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Remaining memory of each rank before the repair. <img width="649" height="99" alt="image" src="https://github.com/user-attachments/assets/52a67592-e0e8-4f9a-b194-b84cb848c598" /> Remaining memory of each rank after the repair. <img width="641" height="99" alt="image" src="https://github.com/user-attachments/assets/0bc2e67c-f328-4dea-98af-d7a459fb4876" /> Close EPLB. <img width="543" height="45" alt="image" src="https://github.com/user-attachments/assets/6dcba19d-4401-44b8-a6d3-c7b35ee983c7" /> Memory of weights for each rank. <img width="648" height="46" alt="image" src="https://github.com/user-attachments/assets/4db2fd04-98a0-4d26-a026-2e8287102b99" /> Estimated memory for EPLB: 15.68 / 48 (layer_num) + 2 * 0.02 = 0.35 GB - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-20 12:25:58 +08:00
SILONG ZENG	eb92e7d50e	[Bugfix] Restore balance scheduling patch for v0.17.0 (#7479 ) ### What this PR does / why we need it? Restore previously introduced patches： - https://github.com/vllm-project/vllm-ascend/pull/5212 - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-19 20:12:57 +08:00
ichaoren	9d1452c74d	[OPS]add split_qkv_tp_rmsnorm_rope ops (#7376 ) ### What this PR does / why we need it? This PR introduces a new fused Triton kernel, `split_qkv_tp_rmsnorm_rope` for Minimax-m2.5. The implementation includes two Triton kernels: 1. `_split_qkv_and_compute_local_qk_var_kernel`: Splits the QKV input and computes the local variance for RMSNorm. 2. `_apply_global_rmsnorm_kernel`: Applies global RMSNorm (considering TP all-reduce for variance) and Neox-style RoPE. ### Does this PR introduce _any_ user-facing change? Does not. ### How was this patch tested? ```python pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_tp_rmsnorm_rope.py ``` ### Test Data A3 TP16 基线 \| data \| TTFT(ms) \| TPOT(ms) \| TPS \| \|------------\|---------:\|---------:\|-------:\| \| 4k/1k@bs1 \| 267.55 \| 25.5 \| 38.85 \| \| 4k/1k@bs4 \| 542.4 \| 26.51 \| 148.06 \| 测试线 \| data \| TTFT(ms) \| TPOT(ms) \| TPS \| \|------------\|---------:\|---------:\|-------:\| \| 4k/1k@bs1 \| 234.64 \| 20.96 \| 47.24 \| \| 4k/1k@bs4 \| 508.36 \| 22.16 \| 176.69 \| - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: xutianyi <xutianyi5@huawei.com> Co-authored-by: xutianyi <xutianyi5@huawei.com>	2026-03-19 17:19:18 +08:00
Li Wang	83a4065b4b	[CI] Add pre-commit check for patch logger (#7446 ) ### What this PR does / why we need it? See https://github.com/vllm-project/vllm-ascend/pull/7402, pre-commit hook will forbid init_logger(__name__) in vllm_ascend patch modules - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-19 16:53:20 +08:00
Feng-xiaosuo	38e637eef5	Fix manual mapping registration and kimi_k2 layer name mapping (#7347 ) ### What this PR does / why we need it? This PR fixes the layer name mapping logic in `AscendModelSlimConfig` for quantization config loading. 1. kimi_k2 model layer name mapping issue: The `kimi_k2` model has a unique layer naming convention that differs from the standard `hf_to_vllm` mapping. One layer was defined in the mapper but was not being correctly applied, causing quantization config lookup failures. 2. Manual mapping registration timing issue: The manual mapping check in `apply_vllm_mapper` was executed before `vllm_config` was initialized, causing `model_type` to be unavailable. This prevented some models with manual mappings from being correctly registered. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Tested with `kimi_k2` model to verify the special layer name mapping works correctly. Also tested with other models that have manual mappings defined in `QUANT_MODEL_PREFIX_MAPPINGS` to ensure the registration timing fix works properly. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Matrix_K <zhangke144@huawei.com> Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com> Co-authored-by: Matrix_K <zhangke144@huawei.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-03-19 16:46:41 +08:00
chenxi-hh	42bcad7e9b	GMM custom operator optimization in small batch scenarios (#7100 ) ### What this PR does / why we need it? GMM custom operator optimization in small batch scenarios ### How was this patch tested? Qwen3-30B input: 4k, output: 1k batch 1： TPOT 7.9 ms -> 7.0 ms Output Token Throughput 125.4651 token/s -> 140.6278 token/s batch 2： TPOT 9.4 ms -> 8.8 ms Output Token Throughput 211.8187 token/s -> 225.2254 token/s batch 16： TPOT 13.6 ms -> 13.5 ms Output Token Throughput 1159.8213 token/s -> 1165.0982 token/s - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: chenxi-hh <chen464822955@163.com>	2026-03-19 16:10:30 +08:00
wangxiyuan	8e0ebb470a	[Misc] Drop Prefetch MLP Env (#7357 ) ### What this PR does / why we need it? remove deprecated environment variables related to MLP prefetching ### Does this PR introduce _any_ user-facing change? yes, the deprecated env vars can not be used then. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-19 14:27:27 +08:00
pu-zhe	e8f7b2e3f1	[Refactor] [310p] Support Mamba Cache and support attn_head_size larger than 128 (#7372 ) ### What this PR does / why we need it? 1. Mamba Cache Support on 310P: Implemented logic to correctly initialize and allocate KV cache for Mamba models on the 310P platform, including handling of state tensors and page size alignment. 2. Increased Attention Head Size Support: Modified the attention backend to support attn_head_size larger than 128 by dynamically selecting appropriate kernel block sizes based on hardware limitations (e.g., block_size * head_size <= 16384). 3. Refactored KV Cache Allocation: Consolidated and improved the KV cache allocation mechanism, moving from separate size calculation and allocation steps to a unified _allocate_kv_cache_tensors method that handles both Attention and Mamba specific cache structures. 4. Dynamic Mamba Config Patching: Introduced conditional loading of Mamba configuration patches, specifically using patch_mamba_config_310 for the 310P platform to ensure platform-specific optimizations and validations. 5. Reserve reasonable memory to allocate KV cache to avoid OOM issue with default gpu_memory_utilization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3.5 E2E test - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-19 09:16:22 +08:00
Nengjun Ma	8b79d4de52	Main2main upgrade to vllm 0317 afternoon (#7409 ) ### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](https://github.com/vllm-project/vllm/pull/37031) 4.fix [Support multiple KV groups in OffloadingSpec ](https://github.com/vllm-project/vllm/pull/36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>	2026-03-18 23:24:27 +08:00
jiangmengyu18	305820f1a9	[Bugfix] fix bug about model type of qwen3_vl_8b_instruct_w8a8 (#7383 ) ### What this PR does / why we need it? Adapt to the model type of Qwen3-VL-8B-Instruct-W8A8 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: betta18 <jiangmengyu1@huawei.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com>	2026-03-18 20:30:03 +08:00
Angazenn	ec34bf0062	[Misc]fix logger which does not take effects in patches (#7402 ) ### What this PR does / why we need it? This PR fixes the logger initialization in patches so that the log info can be displayed as expected. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-03-18 17:13:12 +08:00
rjg-lyh	c1392a6ce6	[bugfix][accuracy] Fix ds indexer accuracy problem caused by k rope (#7341 ) ### What this PR does / why we need it? The rotary algorithm in deepseek indexer should be neox-style instead of gptj style. PR #4641 fix this accuracy bug in original pytorch version. But PR #5701 accidentally removed the fixed code line and reverted the implementation back to the problematic version. This PR fixes it. Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-18 14:20:21 +08:00
wangxiaoteng888	c7157af8f7	[P/D] LayerwiseConnector supports the virtual push functionality on node D. (#7361 ) ### What this PR does / why we need it? LayerwiseConnector supports the virtual push functionality on node D.By adding a do_virtual flag to request metadata, the system can now identify and process certain requests virtually, bypassing the actual KV cache transfer process. This allows for immediate completion of these requests from the consumer's perspective, potentially enabling optimizations or specific testing scenarios where physical data transfer is not required. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-03-18 10:50:02 +08:00
zhangyiming	1c954ff264	[main2main] upgrade vllm to 0308 (#7213 ) ### What this PR does / why we need it? Update main2main to vllm 0308. breaks: * https://github.com/vllm-project/vllm/pull/30681 * https://github.com/vllm-project/vllm/pull/35552 remove self.cudagraph_batch_sizes * https://github.com/vllm-project/vllm/pull/35158 clear_metadata -> defer_finalize * https://github.com/vllm-project/vllm/pull/36006 remove CacheConfig.cpu_offload_gb * https://github.com/vllm-project/vllm/pull/35472 * https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder * https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens * https://github.com/vllm-project/vllm/pull/28053 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-03-18 09:24:43 +08:00
Chao Lei	d9ac7e8539	[Bugfix] Assertion error when decode prefix cache fully hits (#7236 ) ### What this PR does / why we need it? #### Problem When decode node enables prefix cache and the local prefix cache fully hits, the following assertion error occurs: ``` (EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 520, in step_with_batch_queue (EngineCore_DP3 pid=34912) engine_core_outputs = self.scheduler.update_from_output( (EngineCore_DP3 pid=34912) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 1520, in update_from_output (EngineCore_DP3 pid=34912) self._update_from_kv_xfer_finished(kv_connector_output) (EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 2120, in _update_from_kv_xfer_finished (EngineCore_DP3 pid=34912) assert RequestStatus.is_finished(req.status) (EngineCore_DP3 pid=34912) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP3 pid=34912) AssertionError ``` The error is triggered in scheduler.py at _update_from_kv_xfer_finished: ``` if req.status == RequestStatus.WAITING_FOR_REMOTE_KVS: self.finished_recving_kv_req_ids.add(req_id) else: assert RequestStatus.is_finished(req.status) ``` #### Root Cause When decode node has prefix cache enabled and local prefix cache fully hits: 1. get_num_new_matched_tokens returns ext_tokens=0, load_kv_async=False when decode prefix cache fully hits 2. Request status becomes RUNNING (not WAITING_FOR_REMOTE_KVS) 3. However, update_state_after_alloc still adds the request to _reqs_need_recv because remote_block_ids exists in kv_transfer_params 4. Worker processes the request in _handle_request: - _transfer_kv_cache returns immediately (no actual transfer, local_block_ids is empty) - finally block still calls update_done_task_count(request_id) 5. finished_recving contains this request 6. When _update_from_kv_xfer_finished processes finished_recving, request status is RUNNING 7. Assertion fails #### Solution In _handle_request, only notify scheduler (update_done_task_count) when actual KV transfer happened (local_block_ids is not empty). The signals to notify Prefill to release KVCache (_send_done_signal_to_free_remote_port and _send_done_recv_signal) are still sent regardless. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: LCAIZJ <leichao139636@163.com>	2026-03-17 15:17:45 +00:00

1 2 3 4 5 ...

1665 Commits