xc-llm-ascend

Author	SHA1	Message	Date
zhangxinyuehfad	c1cefd26de	[v0.18.0][CI] Add nightly- prefix to branch/PR image tags (#7765 ) ### What this PR does / why we need it? Add nightly- prefix to branch/PR image tags Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-28 11:31:16 +08:00
jack	f83cb0e6dc	[Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710 ) ### What this PR does / why we need it? This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0` after the MiniMax usage-accounting patch merged upstream on March 27, 2026. It fixes OpenAI chat tool-call streaming for GLM47 by: - draining terminal parser chunks that contain both the final argument text and the closing `</tool_call>` suffix - computing finish backfill from the tool argument bytes actually emitted to the client, instead of trusting parser-internal buffered state - adding focused regression tests for finish backfill and terminal chunk handling ### Does this PR introduce _any_ user-facing change? Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit correct final chunks and argument payloads on `releases/v0.18.0`. ### How was this patch tested? - `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py` - `python -m pre_commit run --files vllm_ascend/patch/platform/patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_glm_tool_call_parser.py vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py` --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-03-28 09:15:04 +08:00
SparrowMu	6fbd0049df	[v0.18.0] Apply Eagle3 to MiniMax-M2.5 (#7619 ) (#7714 ) ### What this PR does / why we need it? Apply Eagle3 to MiniMax-M2.5 to increase model performance This will be discard after Eagle3 weight for MiniMax-M2.5 releases and code change accepted by official repo https://github.com/vllm-project/vllm/pull/37512/changes backport: #7619 - vLLM version: v0.18.0 - vLLM main: `ed359c497a` Signed-off-by: limuyuan <limuyuan3@huawei.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-27 18:33:29 +08:00
Feng-xiaosuo	60e88d9541	[v0.18.0][Refactor] Use forward mapping instead of reverse mapping in AscendMo… (#7716 ) …delSlimConfig (#7596) ### What this PR does / why we need it? This PR refactors the `AscendModelSlimConfig` class to use forward mapping instead of reverse mapping for quantization config key transformation. Changes: 1. Modified `apply_vllm_mapper()` to directly apply `hf_to_vllm_mapper.apply_dict()` to transform `quant_description` keys from HF format to vLLM format 2. Simplified `quant_prefix_mapper()` to return the prefix directly (no longer needs mapping since keys are already in vLLM format) 3. Removed `QUANT_MODEL_PREFIX_MAPPINGS` dictionary (~50 lines) - no longer needed 4. Removed `get_prefix_mapping()` function - no longer needed 5. Removed `vllm_to_hf_mapper` attribute - no longer needed Why this change is needed: The previous implementation used reverse mapping (vLLM → HF) which had several issues: - Some keys might not be used in the forward direction but would be incorrectly used in reverse - Empty values in the mapping would cause issues when reversed - Required maintaining a separate `QUANT_MODEL_PREFIX_MAPPINGS` dict that duplicated information already available in vLLM's model-specific `WeightsMapper` The new approach: - Uses the forward mapping (HF → vLLM) directly from vLLM's `WeightsMapper` - Eliminates the need for duplicate mapping definitions - Avoids issues with reverse mapping (unused keys, empty values) - Aligns with how `compressed_tensors_config.py` handles the same scenario - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Matrix_K <zhangke144@huawei.com> Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com> Co-authored-by: Matrix_K <zhangke144@huawei.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-03-27 18:25:42 +08:00
Angazenn	7cca7e6990	[v0.18.0][Misc] Recompute scheduler upgrade to vLLM 0.18.0 (#7720 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? cherry-pick from #7675 . The current RecomputeScheduler is aligned to Scheduler in vLLM v0.16.0. Since upstream vLLM has upgraded to v0.18.0, we also need to upgrade RecomputeScheduler to pick up missing updates. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-03-27 18:24:53 +08:00
linfeng-yuan	ab619e1c53	[0.18.0][profiler] profile AICore and MTE time with torch profiler (#7730 ) ### What this PR does / why we need it? This pull request updates the NPU profiler configuration in the Ascend worker by changing the `aic_metrics` parameter from `AiCoreNone` to `PipeUtilization`. This change enables the collection of pipe utilization metrics during profiling sessions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-27 16:37:54 +08:00
zhangxinyuehfad	2c175f5ed8	[v0.18.0][Bugfix] Fix pr triggers on branches for nightly test workflows (#7695 ) ### What this PR does / why we need it? 1. Allow PR triggers on `-dev` and `releases/v` branches for nightly test workflows. 2. fix image-tag in doc --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-27 15:17:06 +08:00
weiguihua2	bc8e87f3db	[v0.18.0][Bugfix] fix ds3.2 dcp mtp (#7681 ) ### What this PR does / why we need it? Fixed the issue where the DCP overlaps the MTP scenario in the ds3.2 scenario. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? cherry-pick from: https://github.com/vllm-project/vllm-ascend/pull/7617 Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-03-27 14:24:53 +08:00
bowenli	048c8d1afe	[v0.18.0][Bugfix] Fix the bug of MTP1 crashing in multiple concurrent scenarios. (#7699 ) ### What this PR does / why we need it? The triton operator does not perform boundary checks on the global position within the loop, leading to the memory overflow in scenarios with multiple concurrency + 1-step MTP launch. Solution: Add a check that global_pos < vec_len, and strictly limit the boundaries of all memory accesses to avoid out-of-bounds writes. backport：#7459 Signed-off-by: Bowen-Leee <caoshankuangren@gmail.com>	2026-03-27 14:13:12 +08:00
Debonet	6ce1dc162a	[v0.18.0] fix(attention): reuse weight address in graph + RL scenario (#7715 ) ### What this PR does / why we need it? In graph + RL scenario, we only capture the graph once, and the weight address is expected to be the same across iterations. However, when calling .contiguous() on weight tensors, a new memory address may be allocated, causing the graph to capture incorrect weight addresses. This PR modifies the weight update logic in AscendMLAImpl and AscendSFAImpl to use copy_() instead of reassignment, ensuring the weight addresses remain consistent across iterations. detailed in #7473 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: Debonex <719893090@qq.com>	2026-03-27 14:11:20 +08:00
Mengqing Cao	29308ac3a9	[v0.18.0][Bugfix] Fixed wrong class attribute assignment (#7586 ) (#7655 ) ### What this PR does / why we need it? Fixed incorrect class attribute assignment and corrected it to instance attribute assignment. Ensured reorder_batch_threshold only applies to the current instance to avoid global pollution and multi-instance conflicts. Backport of #7586 Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: LookAround0301 <lixushi@huawei.com>	2026-03-27 11:20:59 +08:00
Li Wang	2c2d8bb015	[cherry-pick][CI] Enforce torchaudio and torchvison compatible with pta (#7688 ) ### What this PR does / why we need it? This patch cherry-pick from #7648 Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-27 11:06:13 +08:00
jack	53cc225cac	[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting (#7700 ) ### What this PR does / why we need it? This backports the MiniMax M2 reasoning-token usage accounting fix onto `releases/v0.18.0` for vllm-ascend. The release branch does not include the other local GLM patch commit, so this PR keeps the MiniMax change self-contained by: - registering `patch_minimax_usage_accounting` on the release branch - backporting `completion_tokens_details.reasoning_tokens` into chat usage generation - fixing MiniMax reasoning token counting for `</think>`-delimited outputs without depending on the GLM suffix patch ### Does this PR introduce _any_ user-facing change? Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses now reports corrected reasoning token counts on the release branch. ### How was this patch tested? - `python -m compileall vllm_ascend/patch/platform/patch_minimax_usage_accounting.py` - `python - <<'PY'` import check for `vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of `releases/v0.18.0` No targeted automated regression test exists for this release-branch backport yet, so I validated syntax and module import compatibility on the release branch. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-03-27 10:45:28 +08:00
zzzzwwjj	a40eee2ba1	[feat] support dispatch_v2/combine_v2 hierarchy communication (#7698 ) ### What this PR does / why we need it? This PR adds support for hierarchical communication for `dispatch_v2` and `combine_v2` MoE operations. This is achieved by introducing a new configuration `enable_mc2_hierarchy_comm`. When enabled, the communication algorithm is set to "hierarchy", which support mc2 op comm between two super pod. The changes include: - Adding `enable_mc2_hierarchy_comm` to `AscendConfig`. - Modifying `TokenDispatcherWithMC2` to pass `comm_alg: "hierarchy"` to the underlying `torch_npu` ops when the new config is enabled. - Adding validation to ensure that this feature is only used with compatible PTA/CANN versions and is not used with the conflicting `fused_mc2` op. - Updating `is_hierarchical_communication_enabled` to respect the new configuration flag. ### Does this PR introduce _any_ user-facing change? Yes, this PR introduces a new user-facing configuration option `enable_mc2_hierarchy_comm` in `additional_config` to enable hierarchical communication for MoE. ### How was this patch tested? - vLLM version: v0.18.0 Signed-off-by: zzzzwwjj <1183291235@qq.com>	2026-03-27 09:20:16 +08:00
Wang Kunpeng	0bab629f90	[v0.18.0][bugfix]fixed block_size incorrect setting issue in dsv3.2 (#7630 ) (#7652 ) ### What this PR does / why we need it? https://github.com/vllm-project/vllm/pull/35122 This PR in the vllm community refactors the update mode of block_size. As a result, when the user does not specify `--block-size`, dsv3.2 obtains an incorrect block_size. The root cause of the problem is analyzed from the block_size update process as follows: 1. In NPUPlatform, `check_and_update_config` calls `refresh_block_size` to set block_size to 128. 2. During Modelrunner initialization, the `self.block_size` parameter is generated. At this time, block_size is still 128. This parameter will be used for operations such as kvcache initialization. 3. `update_block_size_for_backend` updates block_size to the size set in attn_backend. The reason why the DSV3.2 is faulty is that it has an additional attn_backend `DeepseekV32IndexerBackend`, and this backend is not rewritten. The block_size obtained from attn_backend is 64. In this case, only `vllm_config.cache_config.block_size` is updated, and other parts are not modified. As a result, the block_size on the entire network is inconsistent. Modification solution: Skip `update_block_size_for_backend` and modify block_size only in the `check_and_update_config` method. In the future, the block_size update logic can be migrated to the `update_block_size_for_backend` method. Ensure that all block_size values on the entire network are updated. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-03-26 22:38:28 +08:00
HarpsealCC	d6661c09b6	[v0.18.0][kernel] Recompilation optimization triggered by triton function parameter optimization (#7647 ) ### What this PR does / why we need it? Some parameters of Triton operators are unnecessarily modified with the "constexpr" modifier. When these parameters change, recompilation is triggered, which significantly affects the model performance. Therefore, these parameters need to be rectified. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: HarpSealCC [844291270@qq.com](mailto:844291270@qq.com) Signed-off-by: l30072083 <liuchengzhuo1@h-partners.com> Co-authored-by: l30072083 <liuchengzhuo1@h-partners.com>	2026-03-26 19:10:45 +08:00
zhangxinyuehfad	d781902ce9	[v0.18.0][CI] Fix releases/v0.18.0 ci test only support vllm v0.18.0 (#7686 ) ### What this PR does / why we need it? Fix releases/v0.18.0 ci test only support vllm v0.18.0 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-26 18:36:04 +08:00
zhangxinyuehfad	124bb00158	[CI][v0.18.0] Build nightly image for releases/v0.18.0 per pr (#7662 ) ### What this PR does / why we need it? This patch add per pr image build for branch `releases/v0.18.0`, Due to the limitations of the quay naming convention, we should not name the image tag the same as branch name, we name the image tag`releases-v0.18.0` for daily build. Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-26 16:48:51 +08:00
cvSoldier	2db33868a4	[kernel] Recompilation optimization triggered by triton function parameter optimization (#7645 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? - Please clarify why the changes are needed. For instance, the use case and bug description. Some parameters of Triton operators are unnecessarily modified with the "constexpr" modifier. When these parameters change, recompilation is triggered, which significantly affects the model performance. Therefore, these parameters need to be rectified. main branch:https://github.com/vllm-project/vllm-ascend/pull/7483 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: cvSoldier <610496306@qq.com>	2026-03-26 16:31:34 +08:00
Mr.WXS	dba34d4915	[v0.18.0][Triton][Qwen3.5] delete expr for kernels args (#7646 ) ### What this PR does / why we need it? Some parameters of Triton operators are unnecessarily modified with the "constexpr" modifier. When these parameters change, recompilation is triggered, which significantly affects the model performance. Therefore, these parameters need to be rectified. backport: https://github.com/vllm-project/vllm-ascend/pull/7482 Signed-off-by: w30012745 <wangxiaoshuai2@h-partners.com> Co-authored-by: w30012745 <wangxiaoshuai2@h-partners.com>	2026-03-25 23:31:27 +08:00
Wangbei25	dd55736ee4	fix uncompatible between fc1 and non-sp-padding (#7643 ) cherry pick https://github.com/vllm-project/vllm-ascend/pull/7614 ### What this PR does / why we need it? fix uncompatible between fc1 and non-sp-padding After PR [non-sp-padding](https://github.com/vllm-project/vllm-ascend/pull/7297), kimi2.5 open flashcomm1 will raise an error : The expanded size of the tensor do not match the existing size at non-singleton dimension 0. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM-Ascend main: `9976e685b7` Signed-off-by: Wangbei25 <wangbei41@huawie.com> Co-authored-by: Wangbei25 <wangbei41@huawie.com>	2026-03-25 23:23:37 +08:00
wangbj127	2ad0ca52a6	Qwen3.5 MoE supports flashcomm v1 (#7644 ) cherry pick from https://github.com/vllm-project/vllm-ascend/pull/7486 <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Multimodal models like Qwen3.5 MoE does embedding in model_runner, so when flash comm is enabled, the first AllGather operation should be skipped. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> - vLLM version: v0.18.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Wangbingjie <wangbj1207@126.com> Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>	2026-03-25 23:09:33 +08:00
Wang Kunpeng	ff1860bd81	[CI]fix lint (#7641 ) ### What this PR does / why we need it? This pull request addresses a linting issue by reordering a specific configuration assignment within the `apply_config_platform_defaults` method in `vllm_ascend/platform.py`. This change ensures compliance with code style guidelines without altering the functional behavior of the system. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-03-25 18:48:10 +08:00
linfeng-yuan	05a561129e	[Graph][Bugfix] Set default cudagraph max capture size via platform defaults (#7572 ) ### What this PR does / why we need it? This PR lets NPU platform provide its own default `max_cudagraph_capture_size` via `NPUPlatform.apply_config_platform_defaults()`. Previously, when cudagraph sizing was left unset, Ascend inherited vLLM's upstream default heuristic in `_set_cudagraph_sizes()`, which uses `max_num_seqs * decode_query_len * 2`. This PR changes Ascend's default to `min(max_num_seqs * decode_query_len, 512)` while keeping the rest of vLLM's cudagraph sizing logic unchanged. ### Does this PR introduce _any_ user-facing change? Yes, but only for Ascend when users do not explicitly configure cudagraph sizing. If `max_cudagraph_capture_size` and `cudagraph_capture_sizes` are both unset, we now uses `max_num_seqs * decode_query_len` (capped at `512`) instead of the upstream `* 2` default. Explicit user settings are unchanged. ### How was this patch tested? Add unit tests to cover: - default max injection via `apply_config_platform_defaults()` - explicit `max_cudagraph_capture_size` is preserved - explicit `cudagraph_capture_sizes` are preserved - Ascend default max no longer uses the upstream `* 2` - late `_set_cudagraph_sizes()` recomputation reuses the current max input - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-25 17:57:19 +08:00
linfeng-yuan	d452d04656	[A5][bugfix] Fix fused MoE A5 MXFP8 scale normalization, load-balance routing and gating_topk ops (#7573 ) ### What this PR does / why we need it? This PR fixes A5 MXFP8 MoE scale handling in the fused MoE path. - It normalizes MXFP8 activation scales to the packed 3D layout expected by A5 kernels, including both precomputed dynamic_scale inputs and gmm1 output scales before they are consumed by downstream grouped matmul ops. - It also refines the MXFP8 force load-balancing path in profiling runs. - This PR also enables npu_gating_top_k from torch_npu instead of custom op when running ascend950 chip. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI and E2E serving tests on Ascend950DT passed. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-25 17:20:28 +08:00
Shaoxu Cheng	e0e585a109	[310P]: add torch chunk gated delta rule and 910b parity ut (#7594 ) ### What this PR does / why we need it? RFC https://github.com/vllm-project/vllm-ascend/issues/7394 Add a PyTorch implementation of the chunk gated delta rule on 310P. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-25 16:46:43 +08:00
Marck	17da96658f	[ModelLoader][Feature] Add rfork support for fast model loading (#7392 ) ### What this PR does / why we need it? Support an new load format: RFORK For implementation details of this feature, please refer to #7441 ### Does this PR introduce _any_ user-facing change? add an new options for load-format: rfork e.g. ```bash vllm serve /workspace/models/Qwen3-8B --load-format rfork ``` ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: Marck <1412354149@qq.com>	2026-03-25 16:40:30 +08:00
pichangping	6ddfc41312	[bugfix] Fixed the error issue when overlaying MTP and full decode on DSV3.1 C8. (#7571 ) …DSV3.1 C8. ### What this PR does / why we need it? DeepSeek v3.1 C8 had a hanging issue when overlaying MTP and full graph modes; this pull request resolves that issue. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- Signed-off-by: pichangping <1337510399@qq.com>	2026-03-25 14:36:26 +08:00
lilinsiman	95d33f05c2	[eagle3][pcp] fix acceptance rate for eagle3 and pcp enabled (#7549 ) ### What this PR does / why we need it? fix the position 3 acceptance rate for eagle3 and pcp enabled detail: In the merged graph of eagle_proposer, the code logic was changed from updating the code once before the forward pass of the draft model to updating all three positions of common_attn_metadata in the merged graph before performing the forward pass of the model. As a result, the update of position 2 and position 3 affected the update of position 1. For example, in the following field: common_attn_metadata.block_table_tensor[:batch_size] = common_attn_metadata.block_table_tensor[block_indices] When updating the block_table_tensor at position 2, the modification of this field occurred at the original address of common_attn_metadata. As a result, the parameter at position 1 was also modified, but the forward pass at position 1 had not been performed. Therefore, a copy of the address of block_table_tensor needs to be made, and the modification needs to be performed on the new address to ensure complete isolation between positions. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests and ut - vLLM version: v0.18.0 - vLLM main: `8b6325758c` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-25 11:52:04 +08:00
meihanc	114ec75a06	[bugfix][CI] fix '_OpNamespace' 'vllm' object has no attribute 'qkv_rmsnorm_rope' (#7620 ) ### What this PR does / why we need it? fix '_OpNamespace' 'vllm' object has no attribute 'qkv_rmsnorm_rope' by uinstall triton - vLLM version: v0.18.0 - vLLM main: `ed359c497a` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-03-25 11:05:34 +08:00
Li Wang	8e3f8bab57	[Nightly] Nightly pre-build image (#7388 ) ### What this PR does / why we need it? This pull request refactor nightly image build and simplify the logic of multi workflows. 1. Nightly image build become the prerequisite when the test are triggered by `schedule` or `workflow_dispatch` 2. Simplify the pull request select case logic 3. Next step: Implement replaceable nightly tests. Specifically, if nightly tests are manually triggered, they can accept any optional docker image to meet the needs of different commits(Which means the image is customizable). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-25 09:24:01 +08:00
Yaphets24	8977be1df3	[Bugfix]Fix deepseek 3.2 C8 precision by rotary tensor (#7537 ) ### What this PR does / why we need it? During the attention quantization process of DeepSeek V3.2, it is necessary to retrieve the Hadamard matrix from the weights to facilitate the computation. ### Does this PR introduce _any_ user-facing change? No. But there will be two new tensor in quant weight. ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `8b6325758c` --------- Signed-off-by: mayumeng <m30059191@china.huawei.com> Co-authored-by: mayumeng <m30059191@china.huawei.com>	2026-03-25 09:18:00 +08:00
Ronald	d96440924a	adapt to main2main for model runner v2 (#7578 ) ### What this PR does / why we need it? This PR aims to adapt to newest commit of vllm main branch for model runner v2. please refer to https://github.com/vllm-project/vllm-ascend/issues/5208 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-25 09:08:44 +08:00
Zhu Yi Lin	fc3ec100bc	[Patch] Fix balance scheduling (#7611 ) ### What this PR does / why we need it? This PR introduces a "balance scheduling" feature, enabled by the `VLLM_ASCEND_BALANCE_SCHEDULING` environment variable. This feature adjusts the scheduling logic to better balance the load across data-parallel workers, preventing a single worker from blocking scheduling for others. This can improve overall throughput. Additionally, this PR includes a number of other updates and fixes to the scheduler, syncing it with a more recent version of the upstream vLLM scheduler. These changes include: - Handling for paused scheduler state. - Support for Mamba block-aligned splits. - Handling for streaming requests. - Refinements in preemption logic and resource management (KV cache, encoder cache). - General code refactoring for clarity and correctness. Fixes # ### Does this PR introduce _any_ user-facing change? Yes, this PR introduces a new feature controlled by the `VLLM_ASCEND_BALANCE_SCHEDULING` environment variable. When enabled, the scheduling behavior changes, which could affect performance and request throughput. ### How was this patch tested? CI passed. Further testing should be done to validate the performance and correctness of the new scheduling logic under various workloads, with and without the feature flag enabled. Signed-off-by: GDzhu01 <809721801@qq.com>	2026-03-25 08:57:06 +08:00
Shaoxu Cheng	3f4087a8f0	[310P]fused recurrent gated delta rule pytorch core and ut (#7398 ) ### What this PR does / why we need it? RFC https://github.com/vllm-project/vllm-ascend/issues/7394 Add a PyTorch implementation of the fused recurrent gated delta ruler on 310P. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-25 08:53:14 +08:00
drizzlezyk	54879467c4	[CI] refine issue triage rules, wan regex and update stale setting (#7531 ) - Update issue labeler regex for wan to match numeric suffix only, including both standalone wan label and multi-modality-generate aggregate rule. - Add title-based gate conditions in issue triage workflow so auto-labeling runs only for expected issue templates ( [Bug]: , [Installation]: , [Usage]: , [Doc]: ). - Adjust scheduled stale workflow configuration for the awaiting-feedback processing block. ### What this PR does / why we need it? - Update issue labeler regex for wan to match numeric suffixes only, in both: - standalone wan label rule - multi-modality-generate aggregate rule - Add title-based gate conditions in issue triage workflow so auto-labeling runs only for expected templates: [Bug]:/ [Installation]:/ [Usage]:/ [Doc]: - Adjust the scheduled stale workflow configuration for the awaiting-feedback processing block. ### Does this PR introduce _any_ user-facing change? - No runtime/API user-facing change. - This PR only updates repository automation behavior in GitHub workflows and issue labeling rules. ### How was this patch tested? - Performed config-level validation by reviewing diffs and final YAML content for: - .github/issue-labeler.yml - .github/workflows/bot_issue_manage.yaml - .github/workflows/schedule_stale_manage.yaml - Verified wan regex now requires numeric suffix (e.g., wan2 , wan2.1 ) and no longer matches alphabetic suffix forms (e.g., wana ). - Verified triage workflow includes title-based if conditions for expected issue templates. - Verified stale workflow’s awaiting-feedback block reflects the intended configuration adjustment. - No unit/e2e tests were added because this PR changes GitHub Actions and labeling configuration only. - vLLM version: v0.18.0 - vLLM main: `8b6325758c` --------- Signed-off-by: drizzlezyk <drizzlezyk@163.com>	2026-03-24 20:11:31 +08:00
SILONG ZENG	1e3c1e76bf	[Lint]Add lint hooks for clang-format, shellcheck, forbidden imports, and boolean context manager checks (#7511 ) ### What this PR does / why we need it? This PR introduces several upstream `vllm`-aligned lint hooks into `vllm-ascend` and makes them part of the actual `pre-commit` flow. Main changes in this PR: - add `check-boolean-context-manager` to catch boolean expressions in `with` statements - add `check-forbidden-imports` to forbid direct `re` imports and disallowed direct `triton` imports - enable shell script linting through `tools/shellcheck.sh` - add root `.clang-format` aligned with upstream `vllm`, enable `clang-format` in `pre-commit`, temporarily exclude all `csrc/` from `clang-format` to avoid bringing a large native code reformat into this PR This PR focuses on landing the smaller and immediately useful lint alignment first, without mixing in the larger requirements-management migration. ### Does this PR introduce _any_ user-facing change? No. This PR only updates repository lint configuration, static checks, and internal import/style enforcement. It does not change runtime behavior or public interfaces. ### How was this patch tested? Tested locally in the project virtual environment. Commands used: ```bash bash format.sh ``` Verified checks passed: ``` bash ruff check...............................................................Passed ruff format..............................................................Passed codespell................................................................Passed typos....................................................................Passed clang-format.............................................................Passed Lint GitHub Actions workflow files.......................................Passed Lint shell scripts.......................................................Passed Lint PNG exports from excalidraw.........................................Passed Check for spaces in all filenames........................................Passed Enforce __init__.py in Python packages...................................Passed Check for forbidden imports..............................................Passed Check for boolean ops in with-statements.................................Passed Suggestion...............................................................Passed - hook id: suggestion - duration: 0s To bypass pre-commit hooks, add --no-verify to git commit. ``` note: clang-format is enabled but currently excludes all csrc/ - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-24 20:03:01 +08:00
rjg-lyh	d1a83a72f7	[doc] add enable_sparse_c8 option in configuration options (#7600 ) ### What this PR does / why we need it? This PR adds enable_sparse_c8 option in configuration options - vLLM version: v0.18.0 - vLLM main: `ed359c497a` Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-24 19:36:34 +08:00
zouyida2052	0210cc0b07	lower log level in PD Disaggregation (#7589 ) ### What this PR does / why we need it? This log is printed too frequently and unecessary, Thus lowering its level from INFO to DEBUG. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2026-03-24 18:03:17 +08:00
lhp-deep	0e3186f07c	[model_runner_v2]:optimize the performance of the _compute_slot_mappings_kernel (#7575 ) ### What this PR does / why we need it? This PR optimizes the `_compute_slot_mappings_kernel` for Ascend NPUs to improve performance. The key changes include: - A new Triton kernel implementation (`_compute_slot_mappings_kernel`) with NPU-specific optimizations, such as using `tl.gather` to handle non-contiguous memory access and replacing modulo operations. - A new method `compute_slot_mappings` in `AscendBlockTables` to use this new kernel. - An end-to-end test to verify the correctness of the new kernel against the reference GPU implementation. The optimization is needed to avoid performance degradation from scalar computation on Ascend devices. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>	2026-03-24 17:29:14 +08:00
realliujiaxu	5d12446573	[Feat][SP] Suport SP for VL MoE models (#7044 ) ### What this PR does / why we need it? 2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712, extend SP to VL MoE models. ### Does this PR introduce _any_ user-facing change? remove `sp_threshold` in additional config and reuse `sp_min_token_num` from vLLM. ### How was this patch tested? - Model: Qwen3-VL-30B-A3B, - TP4 DP2 - 100 reqs - max concurrency 1 \| Seq length \| Mean TTFT (ms) main \| Mean TTFT (ms) this PR \| \|------------\|---------------------\|------------------------\| \| 4k \| 429.40 \| 323.3 \| \| 16k \| 1297.01 \| 911.74 \| - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-24 17:16:00 +08:00
LeeWenquan	9615bc33fd	Fix Qwen3Next CI Config (#7561 ) ### What this PR does / why we need it? This pr modifies qwen3Next nightly CI config. (1) Add a nightly CI . (2) Set a more precise accuracy standard - vLLM version: v0.18.0 - vLLM main: `6a9cceb219` Signed-off-by: Your Name <you@example.com> Co-authored-by: Your Name <you@example.com>	2026-03-24 17:08:17 +08:00
panchao-hub	d98a0727c8	[Feat] Add npugraph_ex enablement logging (#7574 ) ### What this PR does / why we need it? - Replace local logging with vllm.logger for consistency - Add info log when enable_npugraph_ex is enabled - Add info log when enable_static_kernel is enabled - Unify logging message format to use config switch names consistently - This helps users understand which compilation optimizations are active ### Does this PR introduce _any_ user-facing change? Yes. Users will now see informational log messages when enable_npugraph_ex or enable_static_kernel features are enabled, providing better visibility into the compilation optimization settings being used. ### How was this patch tested? - Code passes all pre-commit hooks (ruff check, ruff format, codespell, typos) - Follows project coding conventions and style guidelines - Logger import matches the pattern used elsewhere in the codebase Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com>	2026-03-24 17:04:48 +08:00
Angazenn	bdb65319a9	[UT] Align input arguments with Ascend(Yarn)RotaryEmbedding with vLLM and add ut (#7358 ) ### What this PR does / why we need it? This PR adds missing arguments in `AscendRotaryEmbedding`, `AscendYarnRotaryEmbedding` to conform with vLLM. Besides, corresponding ut is introduced. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-03-24 16:02:56 +08:00
liziyu	568b6d0601	[P/D] Check wildcard address for layerwise connector (#7389 ) ### What this PR does / why we need it? Check wildcard address address for layerwise connector - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-03-24 15:50:06 +08:00
liziyu	73cadecfb4	[P/D] [Bugfix] fix mooncake layerconnector dead when update_decoder_info fail (#7514 ) ### What this PR does / why we need it? Fix mooncake layerconnector dead when update_decoder_info fail. For the scenario where node D is dead, node P failing to update_decoder_info should not cause node P to become dead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? by CI - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-03-24 15:49:46 +08:00
zxr2333	67aad1fce8	[BugFix][P/D] fix padding error on FullGraph mode && fix layerwise connector mamba accuracy (#7506 ) ### What this PR does / why we need it? 1. When the FullGraph mode is used, the branches in the Triton operator are compiled and fixed during the graph capture process, causing the branch condition in the `fused_recurrent_gated_delta_rule` operator, which checks whether `ssm_state_indices >= 0` before writing to the SSM cache, to become invalid. Now, the write operation is performed regardless of the value. This results in the operator performing address offset calculations and writing to the SSM cache based on the -1 offset after -1 is used for padding in vLLM GDN backend. Since the conv cache and SSM cache in vLLM Ascend implementation are actually a single continuous tensor divided into two parts, this leads to data overwriting and the generation of NaN values. This PR addresses two cases where padding -1 is required in the GDN metadata builder. The same logic is used to replace the padding with 0 to avoid the problem of memory overwriting, because block 0 is a reserved block. 2. Fix layerwise connector bug for mamba cache sending on heterogeneous TP. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-24 15:15:55 +08:00
LeeWenquan	475b4b0cea	Revert "GMM custom operator optimization in small batch scenarios (vllm-project#7100)" (#7557 ) ### What this PR does / why we need it? This reverts commit `42bcad7e9b`. The commit cause accuracy decrease of qwen3Next, 150 items of gsm8k, 98 -> 91. - vLLM version: v0.18.0 - vLLM main: `6a9cceb219` Signed-off-by: Your Name <you@example.com> Co-authored-by: Your Name <you@example.com>	2026-03-24 14:24:44 +08:00
Shaoxu Cheng	83bd77c983	[310p]: add rmsnorm gated fallback and unit test (#7424 ) ### What this PR does / why we need it? RFC #7394 310P cannot use the fused `rmsnormgated` operator and must fall back to the native implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? ut - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-24 09:00:11 +08:00
jiaojiao	1de805ce0a	[Ops][Misc] Refactor and optimize CausalConv1d for Ascend (#7495 ) ### What this PR does / why we need it? During the prefill phase of Qwen3-Next and Qwen3.5, the `torch.ops._C_ascend.causal_conv1d_fn` operator exhibits significant performance bottlenecks. To address this, we have re-implemented the optimization using `torch.ops._C_ascend.npu_causal_conv1d_custom`. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? 1 accuracy test ``` [2026-03-20 16:44:22,961] [ais_bench] [INFO] Start launch task state board ... +-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+ \| Task Name \| Process \| Progress \| Time Cost \| Status \| Log Path \| Extend Parameters \| +=============================+===========+============+=============+==========+===========================================+=====================+ \| vllm-api-general-chat/gsm8k \| 2918978 \| NA \| 0:00:01 \| finish \| logs/eval/vllm-api-general-chat/gsm8k.out \| None \| +-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+ [2026-03-20 16:44:34,284] [ais_bench] [INFO] Evaluation tasks completed. [2026-03-20 16:44:34,287] [ais_bench] [INFO] Summarizing evaluation results... dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 271d0b accuracy gen 96.21 ``` 2 ut modify test `pytest -sv /home/c30006096/vllm-ascend/tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_causal_conv1d.py::test_ascend_causal_conv1d` - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: wenba0 <3054239545@qq.com> Signed-off-by: jiaojiao <56385650+wenba0@users.noreply.github.com>	2026-03-24 00:07:12 +08:00

1 2 3 4 5 ...

2757 Commits