xc-llm-ascend

Author	SHA1	Message	Date
LeeWenquan	9615bc33fd	Fix Qwen3Next CI Config (#7561 ) ### What this PR does / why we need it? This pr modifies qwen3Next nightly CI config. (1) Add a nightly CI . (2) Set a more precise accuracy standard - vLLM version: v0.18.0 - vLLM main: `6a9cceb219` Signed-off-by: Your Name <you@example.com> Co-authored-by: Your Name <you@example.com>	2026-03-24 17:08:17 +08:00
panchao-hub	d98a0727c8	[Feat] Add npugraph_ex enablement logging (#7574 ) ### What this PR does / why we need it? - Replace local logging with vllm.logger for consistency - Add info log when enable_npugraph_ex is enabled - Add info log when enable_static_kernel is enabled - Unify logging message format to use config switch names consistently - This helps users understand which compilation optimizations are active ### Does this PR introduce _any_ user-facing change? Yes. Users will now see informational log messages when enable_npugraph_ex or enable_static_kernel features are enabled, providing better visibility into the compilation optimization settings being used. ### How was this patch tested? - Code passes all pre-commit hooks (ruff check, ruff format, codespell, typos) - Follows project coding conventions and style guidelines - Logger import matches the pattern used elsewhere in the codebase Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com>	2026-03-24 17:04:48 +08:00
Angazenn	bdb65319a9	[UT] Align input arguments with Ascend(Yarn)RotaryEmbedding with vLLM and add ut (#7358 ) ### What this PR does / why we need it? This PR adds missing arguments in `AscendRotaryEmbedding`, `AscendYarnRotaryEmbedding` to conform with vLLM. Besides, corresponding ut is introduced. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-03-24 16:02:56 +08:00
liziyu	568b6d0601	[P/D] Check wildcard address for layerwise connector (#7389 ) ### What this PR does / why we need it? Check wildcard address address for layerwise connector - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-03-24 15:50:06 +08:00
liziyu	73cadecfb4	[P/D] [Bugfix] fix mooncake layerconnector dead when update_decoder_info fail (#7514 ) ### What this PR does / why we need it? Fix mooncake layerconnector dead when update_decoder_info fail. For the scenario where node D is dead, node P failing to update_decoder_info should not cause node P to become dead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? by CI - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-03-24 15:49:46 +08:00
zxr2333	67aad1fce8	[BugFix][P/D] fix padding error on FullGraph mode && fix layerwise connector mamba accuracy (#7506 ) ### What this PR does / why we need it? 1. When the FullGraph mode is used, the branches in the Triton operator are compiled and fixed during the graph capture process, causing the branch condition in the `fused_recurrent_gated_delta_rule` operator, which checks whether `ssm_state_indices >= 0` before writing to the SSM cache, to become invalid. Now, the write operation is performed regardless of the value. This results in the operator performing address offset calculations and writing to the SSM cache based on the -1 offset after -1 is used for padding in vLLM GDN backend. Since the conv cache and SSM cache in vLLM Ascend implementation are actually a single continuous tensor divided into two parts, this leads to data overwriting and the generation of NaN values. This PR addresses two cases where padding -1 is required in the GDN metadata builder. The same logic is used to replace the padding with 0 to avoid the problem of memory overwriting, because block 0 is a reserved block. 2. Fix layerwise connector bug for mamba cache sending on heterogeneous TP. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-24 15:15:55 +08:00
LeeWenquan	475b4b0cea	Revert "GMM custom operator optimization in small batch scenarios (vllm-project#7100)" (#7557 ) ### What this PR does / why we need it? This reverts commit `42bcad7e9b`. The commit cause accuracy decrease of qwen3Next, 150 items of gsm8k, 98 -> 91. - vLLM version: v0.18.0 - vLLM main: `6a9cceb219` Signed-off-by: Your Name <you@example.com> Co-authored-by: Your Name <you@example.com>	2026-03-24 14:24:44 +08:00
Shaoxu Cheng	83bd77c983	[310p]: add rmsnorm gated fallback and unit test (#7424 ) ### What this PR does / why we need it? RFC #7394 310P cannot use the fused `rmsnormgated` operator and must fall back to the native implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? ut - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-24 09:00:11 +08:00
jiaojiao	1de805ce0a	[Ops][Misc] Refactor and optimize CausalConv1d for Ascend (#7495 ) ### What this PR does / why we need it? During the prefill phase of Qwen3-Next and Qwen3.5, the `torch.ops._C_ascend.causal_conv1d_fn` operator exhibits significant performance bottlenecks. To address this, we have re-implemented the optimization using `torch.ops._C_ascend.npu_causal_conv1d_custom`. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? 1 accuracy test ``` [2026-03-20 16:44:22,961] [ais_bench] [INFO] Start launch task state board ... +-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+ \| Task Name \| Process \| Progress \| Time Cost \| Status \| Log Path \| Extend Parameters \| +=============================+===========+============+=============+==========+===========================================+=====================+ \| vllm-api-general-chat/gsm8k \| 2918978 \| NA \| 0:00:01 \| finish \| logs/eval/vllm-api-general-chat/gsm8k.out \| None \| +-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+ [2026-03-20 16:44:34,284] [ais_bench] [INFO] Evaluation tasks completed. [2026-03-20 16:44:34,287] [ais_bench] [INFO] Summarizing evaluation results... dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 271d0b accuracy gen 96.21 ``` 2 ut modify test `pytest -sv /home/c30006096/vllm-ascend/tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_causal_conv1d.py::test_ascend_causal_conv1d` - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: wenba0 <3054239545@qq.com> Signed-off-by: jiaojiao <56385650+wenba0@users.noreply.github.com>	2026-03-24 00:07:12 +08:00
ZhuQi-seu	e942b62d74	[features]support split qkv rmsnorm rmope for qwen3.5 (#7368 ) ### What this PR does / why we need it? Qwen3.5 full attention supports enabling the split_qkv_rmsnorm_mrope fusion operator. ### How was this patch tested? vLLM version: v0.16.0 vLLM-Ascend main: https://github.com/vllm-project/vllm-ascend/pull/6730 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>	2026-03-23 23:58:12 +08:00
Nengjun Ma	8e0789bb36	[CI] Recover pd disaggregated encoder test case that been incorrectly skipped (#7505 ) ### What this PR does / why we need it? [CI] Recover pd disaggregated encoder test case that been incorrectly skipped in PR: https://github.com/vllm-project/vllm-ascend/pull/7412 ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-03-23 21:41:28 +08:00
Nengjun Ma	fcba91a392	Main2main Upgrade vllm commit to 0320 17:00 (#7510 ) ### What this PR does / why we need it? Main2main Upgrade vllm commit to 0320 17:00 1. fix vllm refactored `_moe_forward` to call `runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True. vllm PR:"[MoE Refactor] DefaultMoERunner simplification [#33049](https://github.com/vllm-project/vllm/pull/33049)" 2.fix vllm moved the call to `self._set_compile_ranges()` in `VllmConfig.__post_init__` from before `check_and_update_config()` to after it (to allow platforms to lower `max_num_batched_tokens` first). vllm PR: "fix(xpu): Re-compute compile ranges after platform-specific config updates" [#37523](https://github.com/vllm-project/vllm/pull/37523) ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>	2026-03-23 21:37:41 +08:00
weijinqian0	bdd90c0088	[model_runner_v2]optimize the performance of the post_update. (#7496 ) ### What this PR does / why we need it? - This PR aims to enhance the operator performance in the `post_update` phase of `model_runner_v2` on NPUs. By optimizing the relevant operations, it is expected to improve the overall efficiency and speed of the model running on NPU hardware, which is crucial for scenarios where high-performance inference is required. - when bs = 256, time cost reduce from 26us to 11 us; ### Does this PR introduce _any_ user-facing change? No, there are no changes to the API, interface, or other high-level behaviors that would directly affect the user's code or interaction with the system beyond the performance improvement. ### How was this patch tested? CI passed with new added/existing tests. In addition to the regular CI tests, specific benchmark tests were conducted on NPU hardware to measure the performance improvement of the `post_update` operators. --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2026-03-23 20:29:55 +08:00
lijiahang226	170dcbda62	[Feature] Support DeepSeek for A5 (#7232 ) ### What this PR does / why we need it? Add A5 mla operators to support running DeepSeek models on A5. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: Li Jiahang <216526138+lijiahang226@users.noreply.github.com>	2026-03-23 20:28:26 +08:00
Shaoxu Cheng	13397e9cb7	[310p] Add a PyTorch implementation of the GDN gating operator on 310P (#7430 ) ### What this PR does / why we need it? RFC #7394 Add a PyTorch implementation of the GDN gating operator on 310P. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT - vLLM version: v0.17.0 - vLLM main: `4497431df6` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-23 20:26:39 +08:00
meihanc	e344a53127	[bugfix][CI]Skip e2e log summary when the log file is missing or empty (#7552 ) ### What this PR does / why we need it? Avoid failing `ci_log_summary.py` when the e2e log file is missing or empty. Test in CI :https://github.com/vllm-project/vllm-ascend/actions/runs/23428406256/job/68149271871 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-03-23 20:25:59 +08:00
zhangxinyuehfad	886756aea0	[Bugfix][CI] Fix aisbench installation to avoid Gitee authentication (#7536 ) ### What this PR does / why we need it? - Pass GITEE_USERNAME (var) and GITEE_TOKEN (secret) as Docker build args in nightly image build so Dockerfile can authenticate to Gitee - In Dockerfile.nightly.a2/a3, embed credentials into clone URL to avoid auth failure during `git clone` - In single-node and multi-node PR test workflows, backup the pre-installed benchmark from the nightly image before wiping vllm-ascend, then restore it instead of re-cloning from Gitee, which is inaccessible from fork PR contexts ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `8b6325758c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-23 20:16:51 +08:00
SILONG ZENG	ffd195b0fe	[Bugfix]Remove conflicting triton after vllm-ascend install on x86 (#7497 ) ### What this PR does / why we need it? This PR fixes the x86 image issue where both `triton` and `triton-ascend` are installed in the final environment. - https://github.com/vllm-project/vllm-ascend/issues/7359 We confirmed the root cause is not that `triton` fails to uninstall after the upstream `vllm` installation. Instead, during the `vllm-ascend` installation step, pip resolves and installs upstream `triton` again alongside `triton-ascend` on x86 platforms. This leads to module conflicts at runtime because both distributions provide the `triton` Python package. To fix this, this PR updates all Dockerfiles to remove upstream `triton` immediately after installing `vllm-ascend`, while keeping the `triton-ascend` version resolved by `vllm-ascend` itself. Affected files: - `Dockerfile` - `Dockerfile.a3` - `Dockerfile.310p` - `Dockerfile.openEuler` - `Dockerfile.a3.openEuler` - `Dockerfile.310p.openEuler` ### Does this PR introduce _any_ user-facing change? Yes. For x86 container images, the final Python environment will no longer keep upstream `triton` alongside `triton-ascend`. This avoids importing the wrong Triton package and fixes related runtime failures. ### How was this patch tested? Root cause validation was performed by reproducing the installation flow locally and checking the package state after each step. Observed during `vllm-ascend` installation on x86: - `triton-ascend` was installed as expected - upstream `triton` was also installed again in the same step ``` bash export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \ source /usr/local/Ascend/ascend-toolkit/set_env.sh && \ source /usr/local/Ascend/nnal/atb/set_env.sh && \ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \ python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \ python3 -m pip cache purge Successfully installed aiofiles-25.1.0 arctic-inference-0.1.1 blinker-1.9.0 cmake-4.2.3 fastapi-0.123.10 flask-3.1.3 h2-4.3.0 hpack-4.1.0 hypercorn-0.18.0 hyperframe-6.1.0 itsdangerous-2.2.0 numpy-1.26.4 opencv-python-headless-4.11.0.86 pandas-3.0.1 pandas-stubs-3.0.0.260204 priority-2.0.0 pybind11-3.0.2 python-dateutil-2.9.0.post0 quart-0.20.0 setuptools-scm-9.2.2 six-1.17.0 starlette-0.50.0 torch-2.9.0+cpu torch-npu-2.9.0 torchaudio-2.9.0+cpu torchvision-0.24.0+cpu triton-3.6.0 triton-ascend-3.2.0 vllm_ascend-0.17.0rc2.dev51+geb92e7d50 werkzeug-3.1.6 wheel-0.46.3 wsproto-1.3.2 xgrammar-0.1.32 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. Files removed: 423 (1025.9 MB) Directories removed: 5 ``` - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-23 20:14:42 +08:00
liuhy1213-cell	fb283b5820	[CI] Add nightly CI test cases for the GLM-5 (#7429 ) ### What this PR does / why we need it? Add nightly CI test cases for the GLM-5 Add model download for the GLM-5 https://github.com/vllm-project/vllm-ascend/actions/runs/23286178651/job/67710409642#logs - vLLM version: v0.17.0 - vLLM main: `b31e9326a7` --------- Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com> Signed-off-by: liuhy1213-cell <liuhy1213@gmail.com> Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>	2026-03-23 19:14:19 +08:00
drslark	41dadd4312	[main][bugfix] Solved the problem of the d node getting stuck in the pd-separation scenario (#7534 ) ### What this PR does / why we need it? A problem of the d node getting stuck in the pd-separation scenario is solved. We find it will crash at `torch.nn.functional.linear(x, weight, bias)` after being stuck for a long time. we found that the shapes of each dp node were not aligned. this is the root cause. - vLLM version: v0.18.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-23 18:53:07 +08:00
Zetong Li	a253235a59	[Doc] Add note for unsupported PCP + FULL (#7559 ) ### What this PR does / why we need it? This PR aims to add note in doc that FULL mode is not supported in PCP scenario. Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-23 17:34:51 +08:00
Levi	9976e685b7	[Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp (#7297 ) ### What this PR does / why we need it? Fix multi dp padding logic for eager mode, bacause its will cause rank0 load imbalance in kimi-k2.5-w4a8 with the all the padding tokens router to rank0. And the fix can also apply to other model in multi dp. - before hbm usage： <img width="2229" height="733" alt="image" src="https://github.com/user-attachments/assets/50479b6d-cfd0-4206-8e80-974024652997" /> preformance： ```shell Concurrency NumPrompts QPS TTFT_Avg TTFT_P50 TPOT_Avg TPOT_P50 TPOT_P90 ============ ============ ============ ============ ============ ============ ============ ============ 1 15 0.0179 1667.7803 1673.3437 35.2973 35.2775 35.3784 32 480 0.4725 2764.8027 1905.2137 40.8030 40.6978 41.0179 64 960 0.7820 4123.7096 3485.6153 48.0461 48.1598 48.2971 100 1500 1.0852 6216.7988 5714.0082 52.9323 53.0613 54.6304 108 1620 1.1040 6277.4892 5798.7425 56.3862 56.9224 57.2901 116 1740 1.1680 6563.3293 6039.5659 56.9894 57.4027 57.5786 128 1920 1.2555 7822.5551 7604.1662 57.7660 58.1768 58.2717 192 2880 1.4314 9212.1953 9131.3461 58.9905 59.1683 59.2791 256 3840 1.4480 9028.0812 8913.7937 59.0092 59.2385 59.3516 ``` - after hbm usage： <img width="2246" height="1005" alt="image" src="https://github.com/user-attachments/assets/d0936481-5a58-4bc5-a6f1-b92735d47885" /> preformance： ```shell Concurrency NumPrompts QPS TTFT_Avg TTFT_P50 TPOT_Avg TPOT_P50 TPOT_P90 ============ ============ ============ ============ ============ ============ ============ ============ 1 15 0.0181 601.4171 600.9774 35.6270 35.6254 35.6480 32 480 0.4455 720.8782 724.2889 45.4250 45.4755 45.6318 64 960 0.8445 729.6209 728.2149 47.0464 47.0896 47.1985 100 1500 1.2601 723.4834 724.6673 48.3108 48.3844 48.5355 108 1620 1.3409 727.1509 720.6772 48.8962 48.9409 49.0489 116 1740 1.4080 679.9799 677.6119 49.1253 49.1983 49.3087 128 1920 1.4155 680.6284 674.9436 49.2193 49.2450 49.3763 192 2880 1.4422 684.6577 676.7833 49.2059 49.2264 49.3229 256 3840 1.4558 685.2462 678.1709 49.2191 49.2351 49.3419 ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: fny-coder <985619145@qq.com>	2026-03-23 17:05:02 +08:00
Nengjun Ma	8e2c59e1ee	Main2main upgrade vllm commit to 03 19 17:00 (#7478 ) ### What this PR does / why we need it? Upgrade vllm commit to 2026.03.19. 1.Fix socket removed from StatelessProcessGroup. Upstream vLLM PR [#36330](https://github.com/vllm-project/vllm/pull/36330) ("elastic_ep: Fix stateless group port races") refactored StatelessProcessGroup and removed the socket: socket.socket \| None field. The socket ownership was moved to a new create_tcp_store() helper instead of being stored as a field on the dataclass. 2.fix `virtual_engine` parameter removed from `set_forward_context(). Upstream [V0 Deprecation] Deprecate virtual engine [#37195](https://github.com/vllm-project/vllm/pull/37195) ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-03-23 16:25:57 +08:00
LICO67373	caa71e50ca	[Perf] Simplify FIA prefill context merge path (#7293 ) ### What this PR does / why we need it? This PR simplifies and hardens MLA prefill context merging in `vllm_ascend/attention/mla_v1.py` after FIA migration by directly building `out_list/lse_list` (without temporary chunk buffers or `cat/stack/split`) and using `reshape` for safe flattening of non-contiguous tensors. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactor/stability improvement only; no API/interface behavior changes. ### How was this patch tested? - Verified tensor shape/data flow for `npu_attention_update` inputs (`out_list/lse_list`) after refactor. - Confirmed no lint errors in the modified file. - CI UT coverage on attention/MLA paths is used for validation. vLLM version: `v0.17.0` vLLM main: `vllm-project/vllm@4034c3d` --------- Signed-off-by: lico67373 <918688502@qq.com>	2026-03-23 07:47:42 +00:00
dependabot[bot]	da866cc168	[CI] Bump docker/build-push-action from 6 to 7 (#7541 ) Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6 to 7. - vLLM version: v0.18.0 - vLLM main: `8b6325758c` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-23 15:46:12 +08:00
Qiu	71df17f4e6	bugfix(MC2): refactor the comm group of MC2 to be compatible with PP (#7291 ) ### What this PR does / why we need it? This PR refactors the communication group of MC2 to keep it consistent with vllm's EP group, making it compatible with PP. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-23 15:44:21 +08:00
dependabot[bot]	8527b49764	[CI] Bump docker/setup-buildx-action from 3 to 4 (#7542 ) Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 3 to 4. - vLLM version: v0.18.0 - vLLM main: `8b6325758c` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-23 15:44:14 +08:00
Shaoxu Cheng	5b60b530d6	[Bugfix][310p] the new A5 mmencoder op donot support 310p (#7518 ) ### What this PR does / why we need it? Because the new A5 MMEncoder operator was merged, the 310P can no longer run any VL models. This PR fixes that issue. details at #7046 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-23 15:40:34 +08:00
Mengqing Cao	9e2878065a	[Spec-Decode] Fix spec decode proposer in 0.18.0 (#7544 ) ### What this PR does / why we need it? As the vllm-ascend main doesn't maintain v0.17.0 now, we'd just apply the single branch in eagle proposer. Otherwise it will raise error in v0.18.0 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed with existing test. - vLLM version: v0.18.0 - vLLM main: `8b6325758c` Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-23 15:39:24 +08:00
Shanshan Shen	6b7d9b76f1	[MM][Perf] Pre-compute `seq_lens` and put it on CPU before ViT vision blocks for better performance (#7104 ) ### What this PR does / why we need it? Background: PR https://github.com/vllm-project/vllm-ascend/pull/6448 has introduced a `seq_lens` CPU cache mechanism, which will considerably benefit the performance for VL models but may lead to accuracy issues. Thus, we have reverted it. Proposed Change: In PR https://github.com/vllm-project/vllm/pull/36605, we have supported custom processing logic for OOT MMEncoder kernels in vLLM. Thus, we can pre-compute `seq_lens` (rather than `cu_seqlens`) and put it on CPU before ViT vision blocks to avoid redundant computation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? #### ✅ Functional Test Run Qwen2.5-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max-model-len 16384 \ --max-num-batched-tokens 16384 \ --limit-mm-per-prompt '{"image": 1}' ``` Output: ```bash "The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" having a slightly bolder and more prominent appearance compared to \"Qwen.\" The overall design is simple and professional." ``` > [!NOTE] > Since PR https://github.com/vllm-project/vllm/pull/36605 only modified `Qwen3-VL` modeling files, this PR has no affect to `Qwen2.5-VL` model. --- Run Qwen3-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max-model-len 16384 \ --max-num-batched-tokens 16384 \ --limit-mm-per-prompt '{"image": 1}' ``` Output: ```bash "The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with TONG." ``` --- #### ✅ Benchmark Launch the server: ``` vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --dtype bfloat16 \ --limit-mm-per-prompt '{"image": 1}' \ --max-model-len 16384 \ --max-num-batched-tokens 16384 ``` Run benchmark: ``` vllm bench serve \ --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --hf-split train \ --dataset-path lmarena-ai/vision-arena-bench-v0.1 \ --num-prompts 500 \ --request-rate 10 \ --burstiness 5 \ --no-stream ``` Before this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 500 Failed requests: 0 Request rate configured (RPS): 10.00 Benchmark duration (s): 78.58 Total input tokens: 33418 Total generated tokens: 61431 Request throughput (req/s): 6.36 Output token throughput (tok/s): 781.78 Peak output token throughput (tok/s): 2475.00 Peak concurrent requests: 383.00 Total token throughput (tok/s): 1207.07 ---------------Time to First Token---------------- Mean TTFT (ms): 7116.24 Median TTFT (ms): 4295.84 P99 TTFT (ms): 18370.87 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 245.78 Median TPOT (ms): 264.03 P99 TPOT (ms): 334.38 ---------------Inter-token Latency---------------- Mean ITL (ms): 246.99 Median ITL (ms): 117.71 P99 ITL (ms): 1327.55 ================================================== ``` After this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 500 Failed requests: 0 Request rate configured (RPS): 10.00 Benchmark duration (s): 77.44 Total input tokens: 33418 Total generated tokens: 61522 Request throughput (req/s): 6.46 Output token throughput (tok/s): 794.40 Peak output token throughput (tok/s): 2691.00 Peak concurrent requests: 369.00 Total token throughput (tok/s): 1225.91 ---------------Time to First Token---------------- Mean TTFT (ms): 6888.64 Median TTFT (ms): 4128.82 P99 TTFT (ms): 17487.94 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 240.14 Median TPOT (ms): 259.18 P99 TPOT (ms): 313.15 ---------------Inter-token Latency---------------- Mean ITL (ms): 241.84 Median ITL (ms): 121.08 P99 ITL (ms): 1470.33 ================================================== ``` Performance Metrics: \| Metric \| Before this PR \| After this PR \| Comparison \| \| :----- \| :------------- \| :------------ \| :--------- \| \| Throughput \| \| \| \| \| Request throughput (req/s) \| 6.36 \| 6.46 \| +1.57% ↑ \| \| Output token throughput (tok/s) \| 781.78 \| 794.40 \| +1.61% ↑ \| \| Total token throughput (tok/s) \| 1,207.07 \| 1,225.91 \| +1.56% ↑ \| \| Peak output token throughput (tok/s) \| 2,475 \| 2,691 \| +8.73% ↑ \| \| Latency \| \| \| \| \| Benchmark duration (s) \| 78.58 \| 77.44 \| -1.45% ↓ \| \| Mean TTFT (ms) \| 7,116.24 \| 6,888.64 \| -3.20% ↓ \| \| Median TTFT (ms) \| 4,295.84 \| 4,128.82 \| -3.89% ↓ \| \| P99 TTFT (ms) \| 18,370.87 \| 17,487.94 \| -4.81% ↓ \| \| Mean TPOT (ms) \| 245.78 \| 240.14 \| -2.29% ↓ \| \| Median TPOT (ms) \| 264.03 \| 259.18 \| -1.84% ↓ \| \| P99 TPOT (ms) \| 334.38 \| 313.15 \| -6.35% ↓ \| \| Mean ITL (ms) \| 246.99 \| 241.84 \| -2.09% ↓ \| \| Median ITL (ms) \| 117.71 \| 121.08 \| +2.86% ↑ \| \| P99 ITL (ms) \| 1,327.55 \| 1,470.33 \| +10.76% ↑ \| 🤖 AI Summary: - The most notable improvement is in P99 TPOT, which dropped -6.35% from 334.38ms → 313.15ms, indicating reduced tail latency for per-token generation under heavy load. - TTFT improved across all percentiles: mean dropped -3.20% (7,116ms → 6,889ms), median -3.89% (4,296ms → 4,129ms), and P99 -4.81% (18,371ms → 17,488ms), reflecting faster time-to-first-token across the board. - TPOT also improved consistently, with mean down -2.29% (245.78ms → 240.14ms) and median down -1.84% (264.03ms → 259.18ms), showing a modest but steady reduction in per-token generation time. - Throughput saw a slight uplift of roughly +1.6% across request, output token, and total token throughput. Peak output token throughput jumped +8.73% (2,475 → 2,691 tok/s), suggesting better burst handling capacity. - P99 ITL increased +10.76% (1,328ms → 1,470ms), the largest regression in the run. Median ITL also ticked up +2.86% (117.71ms → 121.08ms). These tail-latency spikes may reflect scheduling variability under peak concurrency and could be within run-to-run noise, but are worth monitoring. - Overall, the PR delivers a consistent improvement in both throughput and latency, with the caveat that P99 inter-token latency regressed — likely a transient effect given that mean ITL still improved by -2.09%. --- - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-03-23 15:24:26 +08:00
Shanshan Shen	5c0d02f689	[Bugfix] Fix multi-instance serving OOM on single card (#7427 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/7308. Subtracting `init_non_torch_memory` (maybe used by the first instance) from the total `non_torch_memory` when calculating `available_kv_cache_memory`. Directly use `non_torch_memory_increase` (contained in `non_kv_cache_memory`) to calculate `available_kv_cache_memory`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Launch tow vllm-ascend instances sequentially on single card. ```bash # Launch first instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8100 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager # Launch second instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8101 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager ``` Before this PR: ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340388298034668 GiB init_non_torch_memory: 0.3616676330566406 GiB non_torch_memory_before_empty_cache: 0.3896217346191406 GiB non_torch_memory_increase: 0.0279541015625 GiB non_torch_memory_cleared_by_empty_cache: 0.3616676330566406 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2336344718933105 GiB init_non_torch_memory: 18.37220001220703 GiB non_torch_memory_before_empty_cache: 18.399906158447266 GiB non_torch_memory_increase: 0.02754974365234375 GiB non_torch_memory_cleared_by_empty_cache: 18.372356414794922 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: -1.32 GiB ``` After this PR: ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340540885925293 GiB init_non_torch_memory: 0.36182403564453125 GiB non_torch_memory_before_empty_cache: 0.38979339599609375 GiB non_torch_memory_increase: 0.0279693603515625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.233344554901123 GiB init_non_torch_memory: 18.74309539794922 GiB non_torch_memory_before_empty_cache: 18.770355224609375 GiB non_torch_memory_increase: 0.02725982666015625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: 17.05 GiB ``` - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2026-03-23 14:22:59 +08:00
guanguan0308	44ef9a36ac	[fix]: fix precision issue in dispatch_ffn_combine_bf16 and remove redundant sync (#7198 ) ### What this PR does / why we need it? Fix the precision issue in dispatch_ffn_combine_bf16 operator. Remove redundant synchronization operations in dispatch_ffn_combine operator. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: guanguan0308 <1546542263@qq.com>	2026-03-23 10:14:03 +08:00
Canlin Guo	e68464a1d6	[Bugfix] Fix slow hasattr in ACLGraphWrapper.__getattr__ (#7442 ) ### What this PR does / why we need it? Follow https://github.com/vllm-project/vllm/pull/37425, https://github.com/vllm-project/vllm-omni/pull/1982 Copied from them: Notice that `hasattr(self.model, "flush_pending_metadata")` cost 6ms per decode step when profiling Qwen3 Omni. The original `CUDAGraphWrapper.__getattr__` raises: ```python raise AttributeError(f"... cudagraph wrapper: {self.runnable}") ``` When hasattr() is called for a non-existent attribute, Python internally calls __getattr__ which constructs this AttributeError. The {self.runnable} triggers `__repr__()` on the underlying model (e.g., `Qwen3OmniMoeForConditionalGeneration`), which recursivelytraverses the entire nn.Module tree to generate an 18,000+ character string. This takes ~6-7ms per call. Since `hasattr(self.model, "flush_pending_metadata") ` is called every decode step in the Talker forward path, this adds ~6ms overhead per step, severely impacting audio inter-chunk latency (ICL). ```Python hasattr(self.model, "flush_pending_metadata") → getattr(self.model, "flush_pending_metadata") → not found in CUDAGraphWrapper.__dict__ → not found in the CUDAGraphWrapper class hierarchy → triggers CUDAGraphWrapper.__getattr__("flush_pending_metadata") → hasattr(self.runnable, "flush_pending_metadata") # runnable also doesn't have it → executes raise AttributeError(f"... {self.runnable}") → Python needs to construct the exception object → the f-string triggers self.runnable.__repr__() → Qwen3OmniMoeForConditionalGeneration.__repr__() → recursively traverses the entire nn.Module tree → generates a 18,000+ character string → takes ~6 ms → AttributeError object is created → hasattr catches the AttributeError and returns False → the 18,000-character string is immediately discarded (no one ever sees it) ``` ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? See https://github.com/vllm-project/vllm-omni/pull/1982 - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-03-23 09:26:24 +08:00
Li Wang	75fae619d5	[Misc] Refactor aclgraph accuracy test to use logprob-based comparison (#7455 ) ### What this PR does / why we need it? Replace text-match assertions with a two-tier logprob accuracy check: - Prefill (token 0): assert token ID is identical between eager baseline and compiled mode, then verify logprob matches within `atol`. - Decode (tokens 1-2): if chosen tokens match, compare logprobs directly; if they differ, cross-lookup the baseline token in the compiled model's top-20 distribution and assert the assigned logprob is within `decode_atol` (defaults to 2x atol). This tolerates minor argmax drift caused by floating-point differences while still catching distribution divergence. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-23 09:08:21 +08:00
Qi Mao	9bf9b4b267	[Feature] Optimize Qwen3.5/Qwen3Next GDN prefill by prebuilding chunk metadata (#7487 ) ### What this PR does / why we need it? This PR optimizes the Qwen3.5 and Qwen3Next GDN prefill path on Ascend by reducing host/device synchronization overhead. The current implementation of the `chunk_gated_delta_rule` path for variable-length sequences prepares chunk metadata during the forward pass. This approach triggers frequent CPU intervention and host/device round-trips. When running prefill-heavy workloads with asynchronous scheduling enabled, these synchronizations result in execution "bubbles" and prefill stalling (stuttering). Note that this does not cause asynchronous scheduling to fail; rather, it prevents the system from reaching its theoretical throughput due to these unnecessary stalls. To resolve this, the patch moves metadata preparation out of the hot path: - Prebuilt Metadata: All non-speculative varlen chunk metadata for GDN is now prebuilt on the CPU. - Asynchronous Transfer: Staging buffers are kept in pinned memory and transferred to the NPU asynchronously. - Integration: The prebuilt bundle is attached to GDN attention metadata via `patch_gdn_attn.py` and passed into Triton wrappers. - Backward Compatibility: Triton wrappers fall back to the legacy preparation path if no prebuilt metadata is provided. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: maoxx241 <maomaoyu870@gmail.com>	2026-03-22 23:09:23 +08:00
LoganJane	b2e71b7930	[Bugfix] Fix get_rope_shape for Kimi-K2.5 (#7521 ) ### What this PR does / why we need it? Delete the logic that the input of get_rope_shape from device to host. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: LoganJane <loganJane73@hotmail.com>	2026-03-22 21:06:31 +08:00
Cao Yi	9e2965bae2	[Feature] Support Flash Comm V1 for VL models (with MLA) (#7390 ) ## Summary Flash Comm V1 (flashcomm1) was previously blocked for all VL models. Root cause: For VL models, `inputs_embeds` at layer 0 originates from the vision encoder as a full `[N, H]` tensor — it has not been reduce-scattered across TP ranks. The original MLA forward path assumed inputs were already scattered, producing wrong output shapes under TP > 1. Fix: - Detect at init time (statically, not via runtime shape checks) whether a layer is the first layer of a VL model (`is_vl_first_layer`) so dynamo treats the branch as a constant. - In `AscendMultiHeadLatentAttention.forward`, when `flashcomm1 + TP > 1 + is_vl_first_layer`, set `need_gather_q_kv=False` and pre-allocate output as `[N//tp_size, H]`. - Remove the platform-level assertion that prevented VL models from enabling Flash Comm V1. Other improvements: - `is_vl_model()` now uses vllm's canonical detection (`hf_config is not hf_text_config`) instead of fragile key-name checks, with the old checks kept as fallback. - Added `parse_layer_idx(prefix)` utility. - Added `maybe_chunk_residual` call in `AscendRMSNorm` before the add-rms-norm op. - Removed unnecessary CPU/fp32 round-trip in `AscendLearnable2DInterpPosEmbDivided_fixed.forward()`. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: LoganJane <loganJane73@hotmail.com>	2026-03-22 21:05:28 +08:00
Qi Mao	9d0b7c8e98	[Platform][BugFix] Preserve hybrid block size on Ascend (#7528 ) ### What this PR does / why we need it This PR fixes a startup regression for Ascend hybrid attention + mamba models after upgrading to vLLM `0.18.0`. However, after the vLLM `0.18.0` upgrade, worker initialization still calls the generic platform hook: - `current_platform.update_block_size_for_backend(vllm_config)` ### How this PR fixes it This PR keeps the fix strictly inside `vllm-ascend`. It adds an Ascend override for `NPUPlatform.update_block_size_for_backend()`: - for hybrid models, do not run the generic upstream block-size fallback - preserve the block size that was already computed by the hybrid model-specific config logic - for non-hybrid models, keep the original upstream behavior unchanged - vLLM version: v0.18.0 - vLLM main: `8b6325758c` --------- Signed-off-by: maoxx241 <maomaoyu870@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-22 11:21:49 +08:00
XiaoxinWang	cbf46fad3c	fixed graph mode bug. (#7460 ) ### What this PR does / why we need it? In fulldecodeonly mode, num_req_padded was set to an incorrect value, causing accuracy degradation in Qwen3-Next. Therefore, we added a check for compilation_config.cudagraph_mode to the conditional logic, ensuring that padding is applied only in FULL mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8a680463fa` Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2026-03-22 10:09:37 +08:00
Zetong Li	84a74f0cb1	[Bugfix] Fix padding logic in eagle proposer for kimi25 (#7348 ) ### What this PR does / why we need it? This PR aims to fix padding logic in eagle proposer for kimi25. Main changes involve: 1. modify the way to obtain draft model attention builder and backend 2. add block table padding & related tensor slicing in common metadata when `draft_step>1` for solving fia verifying error 3. replace block table in `update_graph_params` for solving fia verifying error - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-21 16:57:22 +08:00
zhangxinyuehfad	f482c314cf	Upgrade vllm v0.18.0 in dockerfile (#7523 ) ### What this PR does / why we need it? Upgrade vllm v0.18.0 in dockerfile ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-21 16:19:41 +08:00
meihanc	bff4fbfca5	upgrade to 0.18.0 (#7502 ) ### What this PR does / why we need it? 1. upgrade to 0.18.0 2. ensure kernel_block_sizes is int for Eagle drafter ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-03-21 16:05:38 +08:00
HongtaoYang	80a4265717	[Feat] Support separate attention backend for target and draft model. (#7342 ) ### What this PR does / why we need it? This PR enables separate attention backend configuration for target and draft models in speculative decoding, decoupling the previously bound attention backend settings between the two models. It solves the compatibility issue where some draft models do not support the attention backend used by the target model, and allows users to select the optimal attention backend for each model individually to maximize inference performance. The change is fully backward compatible. --------- Signed-off-by: SidaoY <1024863041@qq.com>	2026-03-21 10:48:01 +08:00
linfeng-yuan	88d03a783f	[refactor] replace scattered business kwargs with typed request objects and explicit stage boundaries (#7024 ) ### What this PR does / why we need it? Refactor `vllm_ascend/ops/fused_moe` to replace scattered MoE business `**kwargs` with typed request objects and explicit stage boundaries. - Prepare, dispatch, MLP, and quant stages now have clearer ownership. - Main MoE path no longer depends on business `kwargs.get(...)` lookups. - Comm and dispatcher interfaces are request-only on the main path. - UTs can assert stage-level fields directly instead of inferring behavior indirectly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-20 23:23:57 +08:00
yesyue-w	c860535246	【A5】【Qwen VL】Qwen VL adapt for A5 (#7046 ) ### What this PR does / why we need it? Replace the '_npu_flash_attention_unpad' operator with the 'npu_fusion_attention' operator to ensure that the Qwen VL model can run in the A5 environment and remove the 'mrope' operator call restriction for A5. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: 汪越 <wangyue361@h-partners.com>	2026-03-20 16:56:12 +08:00
idouba	f39f566e22	Refactor duplicated code into a common method to reduce redundancy (#7210 ) ### What this PR does / why we need it? 1. Extracting duplicated code into a method. That is defining _get_input_parallel_ in parent class _CustomRowParallelOp_, and call the helper method in its 5 children classes : - MLPRowParallelOp - OProjRowParallelOp - Flashcomm2OProjRowParallelOp - MatmulAllreduceRowParallelOp - SequenceRowParallelOp 's _apply_impl_ method 2. Variable typo fixing: split instead of splitted for the past tense ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: idouba <zhangchaomeng@huawei.com>	2026-03-20 16:49:02 +08:00
Li Wang	6ad74e8c80	[CI] Add git safe repo (#7501 ) ### What this PR does / why we need it? Add git safe repo to avoid dubious ownership error - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-20 16:40:24 +08:00
Siyuan Kong	a16c99141b	Adapt w8a8mxfp8 quantization for Qwen VL models (#7417 ) ### What this PR does / why we need it? This PR adapts the `w8a8_mxfp8` quantization method to support Qwen Vision-Language (VL) models. Key changes include: - Reshaping multi-dimensional input tensors to 2D before the quantized matrix multiplication. - Reshaping the 2D output back to its original multi-dimensional format. - Adding specific output reshaping for the visual components of Qwen VL models. - Casting the bias tensor to `float32` to comply with the `npu_quant_matmul` kernel requirements. These changes are necessary to enable `w8a8_mxfp8` quantization for models with multi-modal inputs like Qwen VL. ### Does this PR introduce _any_ user-facing change? No, this is a backend enhancement to extend quantization support to new model architectures. There are no user-facing API or behavior changes. ### How was this patch tested? CI is expected to pass. Manual testing should be performed with a Qwen VL model using `w8a8_mxfp8` quantization to verify correctness and performance. - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: ksiyuan <ksiyuan@umich.edu>	2026-03-20 16:18:58 +08:00
LI SHENGYONG	4e6dbe0956	[EPLB][Bugfix] Set parallel_config.enable_eplb to true to load redundant experts (#7470 ) ### What this PR does / why we need it? pr: https://github.com/vllm-project/vllm/pull/37136 break eplb because it filters out redundant experts. pr: https://github.com/vllm-project/vllm/pull/37322 fix it due to use parallel_config.enable_eplb to determine whether to skip the weight loading filter. But in vllm-ascend, parallel_config.enable_eplb is always false. When we use eplb, we temporarily set it to true. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? ![Snipaste_2026-03-19_16-13-01](https://github.com/user-attachments/assets/b3a4911e-36b3-4c31-951c-7c091f416d00) \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-20 15:22:55 +08:00
LI SHENGYONG	1e05c4908f	[EPLB] Reduce the memory used for batch_isend_irecv (#7344 ) ### What this PR does / why we need it? #6729 seems to reduce the NPU memory usage of eplb, but actually moves the buffer allocation of dist.all_gather_into_tensor to dist.batch_isend_irecv. Therefore, the overall NPU memory usage is not reduced. This PR completely reduces the memory usage in this part. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Remaining memory of each rank before the repair. <img width="649" height="99" alt="image" src="https://github.com/user-attachments/assets/52a67592-e0e8-4f9a-b194-b84cb848c598" /> Remaining memory of each rank after the repair. <img width="641" height="99" alt="image" src="https://github.com/user-attachments/assets/0bc2e67c-f328-4dea-98af-d7a459fb4876" /> Close EPLB. <img width="543" height="45" alt="image" src="https://github.com/user-attachments/assets/6dcba19d-4401-44b8-a6d3-c7b35ee983c7" /> Memory of weights for each rank. <img width="648" height="46" alt="image" src="https://github.com/user-attachments/assets/4db2fd04-98a0-4d26-a026-2e8287102b99" /> Estimated memory for EPLB: 15.68 / 48 (layer_num) + 2 * 0.02 = 0.35 GB - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-20 12:25:58 +08:00

1 2 3 4 5 ...

2716 Commits