xc-llm-ascend

Author	SHA1	Message	Date
zyz111222	dd7e08c6db	[Performance] Use forward_native for Conv3dLayer and add UT (#8375 ) What this PR does / why we need it? switch Ascend conv3d forward_oot to use forward_native and add ut Does this PR introduce any user-facing change? No How was this patch tested? by CI --------- Signed-off-by: zouyizhou <zouyizhou@huawei.com>	2026-04-20 17:20:40 +08:00
LQLlulu	c124e8df07	Revert "[BugFix] dispatch_ffn_combine kernal rollback combinev2 part … (#8439 ) …(#8405)" This reverts commit `b992b11545`. <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: l00893928 <liuquanlu@huawei.com> Co-authored-by: l00893928 <liuquanlu@huawei.com>	2026-04-20 14:26:55 +08:00
zhangxinyuehfad	7a706fb197	[v0.18.0][CI] fix report_template.md (#8429 ) ### What this PR does / why we need it? fix report_template.md the error caused by https://github.com/vllm-project/vllm-ascend/pull/8340 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-04-20 11:13:14 +08:00
wangbj127	e6ba5a88f7	[v0.18.0][BugFix] Fix Qwen3.5 MoE FC1 error under high concurrency when dp>1 (#8395 ) ### What this PR does / why we need it? GDN Attention uses FIA's query_start_loc (padded), which may cause conv1d update errors under high concurrency when dp > 1, and this PR is to make GDN use its own query_start_loc (unpadded). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.18.0 Signed-off-by: Wangbingjie <wangbj1207@126.com>	2026-04-20 10:26:19 +08:00
LQLlulu	b992b11545	[BugFix] dispatch_ffn_combine kernal rollback combinev2 part (#8405 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: l00893928 <liuquanlu@huawei.com> Co-authored-by: l00893928 <liuquanlu@huawei.com>	2026-04-18 22:45:08 +08:00
wangbj127	6bdc72949b	Revert "[v0.18.0][BugFix] Fix dimension mismatch error when SP padding causes num_tokens_padded != num_tokens_unpadded" (#8413 ) Reverts vllm-project/vllm-ascend#8133 - Reversion of Logic: This pull request reverts the changes introduced in a previous commit that attempted to handle dimension mismatches during SP padding. Signed-off-by: Wangbingjie <wangbj1207@126.com>	2026-04-18 20:43:42 +08:00
wangxiaoteng888	363febb6cb	[BugFix][v0.18.0] Gate recompute/balance/fused_mc2 by PD mode (#8374 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? - Enforce recompute scheduler only in PD-disaggregated mode. - Enforce balance scheduling only in PD-mixed mode. - Enforce fused MC2 only on PD-disaggregated D-side (kv_consumer). <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? No <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? By ci <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-04-18 18:06:42 +08:00
1kzk	c995a959e6	[BugFix] fix hang in async scheduling while open ENPU (#8354 ) ### What this PR does / why we need it? 1. there is no synchronization between steps. However, in async scheduling with aclgraph, it is possible that the CPU's record event for the current iteration completes before the previous iteration's graph execution has finished. If cpu is fast enough, device will hang on event_wait in interation i+1 (assume that event_record is executed immediately on update stream of device). 2. Under ENPU, eagle proposers also need to follow event.record first, and then event.Wait. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? --------- Signed-off-by: 1zzk <785396250@qq.com>	2026-04-18 00:07:15 +08:00
shaopeng-666	f81f9a3c89	[Doc] Add Qwen3.5 fused MC2 known issue for 0.18.0 release (#8378 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # -->show known issues for Qwen3.5-397B ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. -->NO ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. -->NA --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2026-04-17 22:54:21 +08:00
wangbj127	f2956ce944	[v0.18.0][BugFix] Fix dimension mismatch error when SP padding causes num_tokens_padded != num_tokens_unpadded (#8133 ) Cherry-picked from https://github.com/vllm-project/vllm-ascend/pull/7858 ### What this PR does / why we need it? This PR fixes a `RuntimeError` (dimension mismatch) that occurs when Sequence Parallelism (SP) is enabled and the padding added for SP causes `num_tokens_padded` to differ from `num_tokens_unpadded`. In such cases, `_pad_query_start_loc_for_fia` adds a dummy request, increasing `num_reqs_padded`. This mismatch between the actual number of requests and the padded number of requests leads to errors in downstream token count computations (e.g., `compute_num_computed_tokens`). The fix modifies the restrictive condition `num_tokens_padded == num_tokens_unpadded` when reverting the dummy request padding if SP is enabled, as SP padding is handled by stripping it after communication and should not be treated as an additional request in the attention metadata. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? vLLM version: v0.18.0 vLLM-Ascend version: releases/v0.18.0 Signed-off-by: Wangbj127 <wangbj1207@126.com>	2026-04-17 22:50:22 +08:00
aipaes	0954fd0912	[BugFix][0.18.0] Fix quant_bias missing in w8a8_static when flashcomm1 is enabled for GLM-5 (#8304 ) ### What this PR does / why we need it? PR #8220 in v0.18.0 In a previous PR #7843 , the o_proj layer of GLM-5 was reverted to TP (Tensor Parallel) splitting when flashcomm1 was enabled. However, this was a temporary workaround and did not address the root cause of the precision issues observed in the o_proj layer under flashcomm1. I am working on a definitive fix for this issue. Currently, a clear bug has been identified in `880e20fdde/vllm_ascend/quantization/methods/w8a8_static.py (L124)`: during quantized matrix multiplication, quant_bias is not added if tp_rank > 0. In the flashcomm1 scenario, all ranks actually require the addition of quant_bias, meaning tp_rank=0 should be passed to ensure the bias is applied correctly. This PR aims to resolve this logic error and fix the underlying precision issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? glm5 e2e test --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: triomino <15924998+triomino@users.noreply.github.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-04-17 22:46:36 +08:00
Zetong Li	b72ade9acd	[0.18.0][BugFix] Update capture sizes after rounding operations (#8380 ) ### What this PR does / why we need it? This PR is partially cherry-picked from #8172. This PR aims to fix mismatched capture sizes after rounding operations when using sp or speculative. The reason is that original `self.cudagraph_capture_sizes` is no longer updated and remains as the initial sizes. Now we use `self.cudagraph_dispatcher.get_capture_descs` to the get up-to-date sizes. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci Signed-off-by: Zetong Li <slippersss@126.com>	2026-04-17 22:46:16 +08:00
herizhen	76cc2204bd	[Doc] Sensitive word modification (#8303 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR updates the documentation to replace specific hardware terms (e.g., HBM, 910B, 310P) with more generic or branded terms (e.g., on-chip memory, Atlas inference products) to comply with sensitive word requirements. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>	2026-04-17 16:30:00 +08:00
vllm-ascend-ci	9c1d58f4d2	[v0.18.0][Doc] Translated Doc files 2026-04-15 (#8309 ) ## Auto-Translation Summary Translated 19 file(s): - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/contributors.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/ModelRunner_prepare_inputs.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Kimi-K2.5.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen2.5-Omni.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Dense.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-397B-A17B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/epd_disaggregation.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/external_dp.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/large_scale_ep.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/release_notes.po</code> --- [Workflow run](https://github.com/vllm-project/vllm-ascend/actions/runs/24447109402) Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com> Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>	2026-04-17 16:29:30 +08:00
pz1116	ceb1e49661	[BugFix][v0.18.0] fix remote KV waiting promotion in balance scheduler (#8280 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? ## Problem In PD-disaggregated serving with `mooncake_connector` and `VLLM_ASCEND_BALANCE_SCHEDULING=1`, requests may enter `WAITING_FOR_REMOTE_KVS` and never be promoted back to runnable state after remote KV transfer finishes. The issue is in `BalanceScheduler`'s handling of `WAITING_FOR_REMOTE_KVS` requests. The current code treats `_update_waiting_for_remote_kv()` as if it returns a boolean readiness flag: ```python is_ready = self._update_waiting_for_remote_kv(request) if is_ready: ... else: ... ``` ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>	2026-04-17 10:06:36 +08:00
Frank Chen	f85144cc57	[BugFix][DSv32] Fix DSA-CP PD role gating for deepseek v3.2 (v0.18.0) (#8291 ) ### What this PR does / why we need it? This PR backports the DSA-CP PD role gating fix to `releases/v0.18.0`. The existing helper logic on the release branch does not handle the PD mixed-role case correctly when deciding whether layer sharding or TP `o_proj` handling should be enabled. Layer sharding should only run on the P-side instance, while TP `o_proj` handling should stay enabled for normal non-PD deployments and for the PD mixed-role (`kv_both`) instance. This patch makes those conditions explicit and adds unit coverage for the allowed and disallowed combinations, including the DSA-CP-disabled path. Such wrong condition lead to vllm serve failures in case: FC1 + PD-colocated KV pooling + no layer_sharding, specifically causing: 1. insufficient Available KV cache memory 2. o_proj shape error in sfa_v1 attention module ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E test with dsv32 + FC1 + FULL_DECODE_ONLY + kv_transfer_config(kv_both) + no layer_sharding --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-04-17 10:05:40 +08:00
sunshine202600	1dd1de8153	[Doc][Misc] Improve readability and fix typos in documentation (#8340 ) ### What this PR does / why we need it? This PR improves the readability of the documentation by fixing typos, correcting command extensions, and fixing broken links in the Chinese README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. --------- Signed-off-by: sunshine202600 <sunshine202600@163.com>	2026-04-17 08:54:38 +08:00
csoulnd	8952fddc7e	[BugFix][310p][Cherry-pick] Handle null quantization config in ShardedStateLoader310&[Feature][310P] Support W8A8 dynamic linear method (#8296 ) ### What this PR does / why we need it? This PR implements the `AscendW8A8DynamicLinearMethod310` quantization scheme specifically for 310P hardware. It includes the logic for weight retrieval, per-channel parameter generation, and the application of dynamic quantization using NPU-specific kernels. Additionally, it updates `ShardedStateLoader310` to handle quantization configurations more robustly when generating parameter type maps. Feedback from the review identified two critical issues in the implementation: 1. The tensor squeezing logic in the `apply` method incorrectly handles 2D inputs, which may lead to shape mismatches in subsequent layers. 2. The weight tensor in `process_weights_after_loading` is transposed after being converted to the private NZ format; the transpose operation should be performed on the ND tensor before conversion to ensure correct physical layout. cherry-pick from : #7546 #7725 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests were added in `tests/ut/_310p/quantization/test_w8a8_dynamic_310.py` to verify the quantization method, and `tests/ut/_310p/test_sharded_state_loader_310p.py` was updated to test the state loader changes. --------- Signed-off-by: csoulnd <daidaicurry@foxmail.com>	2026-04-16 16:53:39 +08:00
1kzk	52f0f9b5e4	[0.18.0][BugFix]: order acl graph updates before model forward for ENPU (#8317 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> For the ENPU scenario, it is required that device events follow the principle of "record first, wait later", otherwise the inference process may become stuck. However, in the current model_forward function, event.wait precedes event.record. Therefore, for the ENPU scenario, graph parameter updates should be performed before model execution. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> N/A ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: 1zzk <785396250@qq.com> Signed-off-by: 1kzk <785396250@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-04-16 16:26:59 +08:00
Frank Chen	2ac8bfb4cb	[BugFix] Enforce C locale for CPU binding subprocess parsing (#8261 ) ### What this PR does / why we need it? This PR backports the CPU binding locale normalization fix from #7274 to `releases/v0.18.0`, including the follow-up review fixes already applied on `main`. The change forces `LC_ALL`, `LANG`, and `LC_MESSAGES` to `C` before spawning subprocesses in `vllm_ascend.cpu_binding.execute_command()`, so parser-dependent command output stays stable on localized systems. It also handles `subprocess.TimeoutExpired` by killing the child process before collecting output, and updates the existing unit tests to keep command-argument coverage while adding timeout-path coverage. Fixes #6992 ### Does this PR introduce _any_ user-facing change? Yes. Users running CPU binding on non-English OS environments should now get consistent English subprocess output for parser-dependent commands, avoiding failures caused by inherited locale settings. ### How was this patch tested? - Updated the existing unit tests in `tests/ut/device_allocator/test_cpu_binding.py` to assert the locale environment, retain command argument coverage, and cover the timeout cleanup path. - Attempted to run targeted pytest cases locally, but the pytest invocation did not complete normally in this environment, so I could not record a clean passing run here. Attribution: - Co-authored-by: stdjhs <1601599324@qq.com> - Signed-off-by: chenchuw886 <chenchuw@huawei.com> Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: stdjhs <1601599324@qq.com>	2026-04-16 16:17:10 +08:00
pz1116	8bc72a807a	[BugFix][v0.18.0] require piecewise cudagraph for layerwise AscendSto… (#8282 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? ref:https://github.com/vllm-project/vllm-ascend/issues/8184 following https://github.com/vllm-project/vllm/pull/31057, add `requires_piecewise_for_cudagraph` for `AscendStoreConnector` ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>	2026-04-16 10:40:14 +08:00
chenweiqiang11	028b8cabc4	[BugFix][Platform] Fix extra function name in final chunk of streaming tool calls (#8178 ) ### What this PR does / why we need it? Fix a bug in the GLM tool call parser where the `function.name` field was incorrectly included in the final (non-first) chunks of streaming tool calls. Per OpenAI streaming semantics, `id`, `type`, and `function.name` must only appear in the first chunk for a given tool call index. When `_create_remaining_args_delta` was called for continuing/finishing chunks, it was incorrectly reading the function name from `delta_message.tool_calls` and re-emitting it, causing clients to see a duplicate/extra function name in the final chunk. Root cause: The original code always looked up the tool call in `delta_message.tool_calls` to get the name, id, and type — even when this was not the first chunk being streamed. This caused the function name to appear again in the final argument-completion chunk. Fix: - Track whether arguments have already been streamed (`already_streamed_args`) for each tool call index. - Only populate `fallback_tool_call_id`, `fallback_tool_call_type`, and `fallback_tool_call_name` when `already_streamed_args` is empty (i.e., this is genuinely the first chunk). - Refactored `_create_remaining_args_delta` to omit header fields entirely when all fallback values are `None`, which is the correct behavior for continuing/finishing chunks. ### Does this PR introduce _any_ user-facing change? Yes. Clients consuming the streaming tool call response will no longer receive a duplicate `function.name` in the final chunk. This fixes incorrect behavior visible in the OpenAI-compatible streaming API output for GLM models using tool calls. ### How was this patch tested? - Code review and logic analysis of the streaming tool call path in `patch_glm_tool_call_parser.py`. - Existing unit tests in `tests/ut/platform/test_patch_glm_tool_call_parser.py`. --------- Signed-off-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com> Signed-off-by: chenweiqiang11 <chenweiqiang11@noreply.github.com> Co-authored-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com>	2026-04-15 17:50:10 +08:00
zhangxinyuehfad	808d00406f	[v0.18.0][CI]Add rank0 process count check for DeepSeek-R1-W8A8-HBM test (#8072 ) ### What this PR does / why we need it? Adds a `check_rank0_process_count` validation step to the DeepSeek-R1-W8A8-HBM nightly single-node test. The check verifies that after the server starts, there is exactly 1 `vllm serve` process running on rank0. This guards against the regression fixed in #8041 (extra NPU context leaking on device 0), ensuring it does not silently reappear in future releases. #### Changes - `tests/e2e/nightly/single_node/models/scripts/test_single_node.py`: Add `run_check_rank0_process_count` async handler. It calls `npu-smi info` for diagnostics, then uses `psutil` to assert exactly one `vllm serve` process exists on rank0. - `tests/e2e/nightly/single_node/models/configs/DeepSeek-R1-W8A8-HBM.yaml`: Register `check_rank0_process_count` in the `test_content` list for the HBM test case. Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-04-15 17:16:27 +08:00
herizhen	95726d20eb	[Doc][Misc] Correcting the document and uploading the model deployment template (#8287 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Correcting the document and uploading the model deployment template ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>	2026-04-15 16:03:11 +08:00
vllm-ascend-ci	147b589f62	[v0.18.0][Doc] Translated Doc files 2026-04-14 (#8257 ) ## Auto-Translation Summary Translated 102 file(s): - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/contributors.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/governance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/user_stories/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/user_stories/llamafactory.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/patch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/testing.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_evalscope.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_lm_eval.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_opencompass.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/msprobe_guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/performance_benchmark.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/service_profiling_guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/faqs.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/installation.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/quick_start.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/graph_mode.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/lora.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/quantization.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sleep_mode.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/structured_output.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/release_notes.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_features.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/ACL_Graph.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/ModelRunner_prepare_inputs.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/add_custom_aclnn_op.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/context_parallel.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/disaggregated_prefill.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/quantization.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/multi_node_test.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_ais_bench.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/optimization_and_tuning.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_colocated_mooncake_multi_instance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/ray.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/suffix_speculative_decoding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/hardwares/310p.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/hardwares/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/DeepSeek-R1.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/DeepSeek-V3.1.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/DeepSeek-V3.2.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM4.x.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM5.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Kimi-K2-Thinking.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Kimi-K2.5.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/MiniMax-M2.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/PaddleOCR-VL.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen-VL-Dense.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen2.5-7B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen2.5-Omni.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-235B-A22B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-30B-A3B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-32B-W4A4.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-8B-W4A8.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Coder-30B-A3B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Dense.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Next.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-VL-235B-A22B-Instruct.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-VL-30B-A3B-Instruct.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-VL-Embedding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-VL-Reranker.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-27B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-397B-A17B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3_embedding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3_reranker.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/deployment_guide/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/deployment_guide/using_volcano_kthena.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Multi_Token_Prediction.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/batch_invariance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/context_parallel.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/dynamic_batch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/epd_disaggregation.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/external_dp.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/kv_pool.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/large_scale_ep.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/layer_sharding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/lmcache_ascend_deployment.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/netloader.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/rfork.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sequence_parallelism.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/speculative_decoding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/ucm_deployment.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/weight_prefetch.po</code> --- [Workflow run](https://github.com/vllm-project/vllm-ascend/actions/runs/24390263284) Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com> Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>	2026-04-15 15:27:09 +08:00
Zetong Li	b6aa5bbdbf	[0.18.0][BugFix] Add PrefillNoCache state in mla _forward_decode for short prompt (#8264 ) ### What this PR does / why we need it? This PR is cherry-pick from #8263. This PR aims to fix short prompt problem. The root cause can be found in #8029. Since the previous pr may miss mixed long and short prompt batch, after discussion, we decide to add PrefillNoCache state in mla _forward_decode now instead. Signed-off-by: Zetong Li <slippersss@126.com>	2026-04-15 09:23:52 +08:00
Qiu	70713c3fd4	[cherry-pick][BugFix] Improve max_cudagraph_capture_size validation (#8252 ) ### What this PR does / why we need it? This PR improves the validation of `max_cudagraph_capture_size` by comparing it against the potential maximum tokens required for decoding, derived from the scheduler configuration. It introduces a warning to alert users when the capture size might be insufficient for the workload, which could lead to suboptimal performance. ref: #8227 ### Does this PR introduce _any_ user-facing change? Yes, a warning log is added when the `max_cudagraph_capture_size` is smaller than the potential decode workload. --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-04-14 22:00:10 +08:00
Yaphets24	13c7392416	[BugFix] fix dsv3.1 service failed to start (#8207 ) ### What this PR does / why we need it? This PR fixes a service startup failure for DeepSeek-V3.1 models by removing a strict type assertion for `MLAAttentionSpec` in `NPUModelRunner.get_kv_cache_spec`. The assertion was failing due to class identity mismatches caused by the runtime patching of `MLAAttentionSpec` with `AscendMLAAttentionSpec`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified that the service starts correctly for DSV3.1 models. Signed-off-by: mayumeng <m30059191@china.huawei.com> Co-authored-by: mayumeng <m30059191@china.huawei.com>	2026-04-14 17:52:55 +08:00
wangbj127	d94b1dc2d0	[v0.18.0][BugFix] Fix Qwen3.5 MoE flash comm v1 shared expert shape error of mtp layer on A2 (#8004 ) ### What this PR does / why we need it? Fix Qwen3.5 MoE MTP layer shared expert shape error when flash comm v1 is enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `35141a7eed` Signed-off-by: Wangbingjie <wangbj1207@126.com>	2026-04-13 17:36:09 +08:00
wangxiaoteng888	39c071a0f5	[BugFix][P/D][0.18.0]Add a retry mechanism to prevent packet loss (#8167 ) ### What this PR does / why we need it? Add a retry mechanism to prevent packet loss Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-04-12 23:30:45 +08:00
wangxiaoteng888	4adc6a68f5	[BugFix][P/D][0.18.0]bugfix short squence has no respone (#8142 ) ### What this PR does / why we need it? bugfix short squence has no respone. This pull request refactors the event handling for KV cache reshaping in mla_v1.py by centralizing the reshape_cache_event creation and recording within the _mla_preprocess function, ensuring it covers both decode and prefill operations. Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-04-12 23:25:01 +08:00
Frank Chen	31186a3a9d	[BugFix] Add async communication check for capturing mode (#8149 ) ### What this PR does / why we need it? Introduce a check to not using asynchronous communication under `enable_dsa_cp_with_layer_shard` branch on capturing mode. This change prevents potential stream and event issues when operating in graph/capturing mode, ensuring safer communication practices. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E test with dsv32 + FC1 + FULL_DECODE_ONLY + kv_transfer_config(kv_both) --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-04-12 21:52:54 +08:00
SparrowMu	c1f323ee46	[Doc] Add new intro to MiniMax-M2.5/M2.7 (#8169 ) ### What this PR does / why we need it? 1. This PR cherry pick commit that contains current best performance at 3.5k/1.5k and 128k/1k from main to 0.18.0 branch. 2. This PR introduce MiniMax-M2.7 0day information to users. 3. To finish previous step we also changes MiniMax doc name from MiniMax-M2.5.md to MiniMax-M2.md --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-04-12 21:45:07 +08:00
Angazenn	9e8da00f95	[V0.18.0][Doc] add preemption in FAQs (#8136 ) ### What this PR does / why we need it? This PR adds description of preemption into FAQs in vLLM-Ascend. This FAQ stats: - how preemption affects the performance of a vLLM server. - how reduce the negative impacts of preemption. The reason why we add this FAQ is that we find that the origin description of preemption in vLLM is not very straightforward. If preemption causes performance drop, users might not be aware that this is caused by Preemption. ### Does this PR introduce _any_ user-facing change? No. Signed-off-by: Angazenn <supperccell@163.com>	2026-04-10 17:36:45 +08:00
Nengjun Ma	99cea6c1b5	[CI] Fix the nightly pip binary install doc test fail. (#8129 ) ### What this PR does / why we need it? Fix the nightly pip binary install doc test fail. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Nightly doc test Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-04-10 17:34:18 +08:00
linfeng-yuan	bd9927d5a9	[releases/v0.18.0][Build][BugFix] support ascend950 npu-smi info interface changes and make SOC_VERSION actually take effect (#8061 ) ### What this PR does / why we need it? Cherry-picked from #8062 This PR adds support for the Ascend950 NPU by updating the `npu-smi info` parsing logic to handle interface changes. It also improves robustness by ensuring that `SOC_VERSION` actually takes effect by disabling `get_chip_type` given this environment variable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-04-10 16:44:38 +08:00
ZYang6263	34386c8896	[v0.18.0][CI] Fix and simplify the CI for Qwen3 32B (#8093 ) ### What this PR does / why we need it? This PR fixes and simplifies the CI configuration for Qwen3 32B. The main changes are: - Remove the redundant `Qwen3-32B-Int8-A3-Feature-Stack3.yaml` config and consolidate the CI setup into `Qwen3-32B-Int8.yaml`. - Improve runtime stability by adding `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` and setting `--max-num-seqs 80`. - Update the accuracy benchmark from `aime2024` to `gsm8k-lite`, and adjust the related dataset config, output length, baseline, and threshold accordingly. These changes make the Qwen3 32B CI easier to maintain and more stable in nightly validation. --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>	2026-04-10 14:22:24 +08:00
DreamerLeader	531d0e6fff	[v0.18.0][BugFix][KV Pool]Fix the conflict between pooling scenarios … (#8101 ) …and PCP across machines <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: DreamLeader <2270923832@qq.com>	2026-04-09 21:55:56 +08:00
Zetong Li	054fde7b72	[0.18.0][BugFix] Fix attention state of short prompt for correct forwarding (#8088 ) ### What this PR does / why we need it? This PR is cherry-pick from #8029. This PR aims to fix attention state of short prompt for correct forwarding. Since a batch of short prompts (prefill tokens less than or equal to num_spec_tokens + 1) will be treated as decode requests (by split_decodes_and_prefills), its original PrefillNoCache attention state contradicts. Thus these short prompts will be passed into a mismatched branch and incur errors. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci Signed-off-by: Zetong Li <slippersss@126.com>	2026-04-09 21:21:24 +08:00
weijinqian0	f668ff9ef0	[v0.18.0][BugFix]Revert the code: Replace npu_ring_mla wit FIA with MLA prefill. (#7961 ) This pull request reverts previous changes to switch to FIA and instead implements npu_ring_mla for MLA prefill operations(#5704 ). The change streamlines the attention mechanism by removing unnecessary metadata tracking and updating the underlying NPU operations to use the ring-based MLA kernel. This adjustment ensures better compatibility and performance for MLA prefill tasks within the vLLM Ascend backend. Highlights - Migration to npu_ring_mla: Replaced the usage of npu_fused_infer_attention_score (FIA) with npu_ring_mla for MLA prefill operations across the codebase to improve performance and alignment with the intended architecture. - Cleanup of redundant metadata: Removed chunk_actual_seq_lengths_kv_list and actual_seq_lengths_q from various metadata structures as they are no longer required for the updated attention implementation. - Test suite updates: Updated unit tests in test_mla_cp.py and test_mla_v1.py to mock npu_ring_mla instead of the deprecated FIA functions and adjusted test assertions to reflect the new implementation details. Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2026-04-09 17:00:25 +08:00
linfeng-yuan	7c9aa498d6	[releases/v0.18.0][BugFix] Restore global_bs=0 and mc2_mask for uniform-token dispatching and support inter-node roce hierarchical MC2 communication (#8040 ) ### What this PR does / why we need it? Cherry-picked from #8039 Restore the setting of MC2 `global_bs` and `mc2_mask` handling when `all_reduce` across DP group cannot be skipped. Ascend MC2 ops require `global_bs=0` + `mc2_mask` while enabling inter-node roce hierarchical communication. PR #4983 always passed non-zero `global_bs` without `mc2_mask`, which is incompatible with hierarchy comm raised in PR #7583 Changes: - Add `should_skip_allreduce_across_dp_group()` to `utils.py` with hierarchy constraint - Set `global_bs=0` when allreduce is not skipped; pass `mc2_mask` accordingly - Add `mc2_mask` field to `MoEMC2CombineMetadata` for dispatch→combine propagation ### Does this PR introduce _any_ user-facing change? No. But this PR fixes cross-super-node communication function on A3 with `enable_mc2_hierarchy_comm=True` in `additional_config` and `export HCCL_INTRA_ROCE_ENABLE=1`. ### How was this patch tested? E2E serving succeeded and CI pssed. - vLLM version: v0.18.0 - vLLM main: `14acf429ac` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-04-09 16:51:17 +08:00
Shaoxu Cheng	82e17f693a	[BugFix][0.18.0][310p] fix post-sampling not working in graph mode on 310p (#8077 ) ### What this PR does / why we need it? Enabling temperature in post-processing on 310P devices can cause the service to stall and eventually hang. We first traced the issue to a timeout where the temperature-related `div` operator was waiting for results from a sub-stream. After investigating the preceding operators, we finally identified the root cause as the `q.exponential_()` operator, which is not well supported on 310P and triggers an internal issue in the `add` kernel. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? This patch was thoroughly tested locally（accuracy-dataset test and stress test）. It is not easy to design a proper unit test for this case, and I appreciate your understanding. Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-04-09 16:31:38 +08:00
herizhen	0d1424d81a	[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073 ) What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>	2026-04-09 15:37:57 +08:00
zouyida2052	c40a387f63	[bugfix]fix extra npu context in device 0 (#8041 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? When we launch a PD-disaggregated process and send requests, an additional processes appear on NPU 0, becasue when a thread has a primary cuda context, the child thread it creates automatically doesn't inherit the cuda context. See https://forums.developer.nvidia.com/t/when-a-thread-has-a-primary-cuda-context-does-the-child-thread-it-creates-automatically-inherit-the-cuda-context/362810. vLLM has fixed this issue in [pr-37449 ](https://github.com/vllm-project/vllm/pull/37449), but version 0.18.0 does not include the fix. Therefore, we need to patch it. <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? no <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: zouyida <zouyida@huawei.com> Co-authored-by: zouyida <zouyida@huawei.com>	2026-04-08 23:35:52 +08:00
hucong	4a628f1042	[UT][v0.18.0] Fix APC nightly UT and TTFT ratio (cherry-pick #7468 ) (#8053 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7468 - Fix TTFT ratio threshold from 0.8 to 0.4 for prefix cache benchmarks - Fix max_out_len values for warm_up and benchmark configs - Applied to both DeepSeek-R1-0528-W8A8 and Qwen3-32B-Int8 configs ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: underfituu <hzhucong@163.com>	2026-04-08 21:08:26 +08:00
Mengqing Cao	044d4c3974	[v0.18.0]feat(quant): add C8 INT8 KV cache support for GQA attention models (#7474 ) (#8007 ) backport of #7474 This PR adds C8 (INT8) KV cache quantization support for standard GQA attention models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel quantization scales to store KV cache in INT8, reducing KV cache memory by ~50% compared to BF16, enabling higher batch concurrency and longer context lengths on the same hardware. Key changes: 1. `attention_v1.py` — New `AscendC8AttentionBackendImpl` subclass of `AscendAttentionBackendImpl`: - `_prepare_c8_scales`: Shards per-channel scales/offsets to the current TP rank and pre-computes BF16 BNSD-shaped antiquant tensors (one-time per layer). - `_quantize_kv_to_int8`: Quantizes BF16 K/V to INT8 before `reshape_and_cache`, using pre-cached inverse scales. - `_forward_c8_decode`: FIA V1 BNSD paged attention with native INT8 KV and `perchannel` antiquant mode. - `_forward_c8_chunked_prefill`: Splits decode (FIA V1 BNSD paged INT8) and prefill (FIA V1 TND float) into two kernel calls. - `_forward_c8_fused_infer_attention`: Handles `PrefillNoCache` and `PrefillCacheHit` states. 2. `quantization/methods/kv_c8.py` — New `AscendC8KVCacheAttentionMethod` scheme: - Creates `k/v_cache_scale/offset` parameters via `_c8_kv_scale_weight_loader`, which handles per-channel scale shapes and lazy resizing. - Sets `layer.kv_cache_torch_dtype = torch.int8` so `get_kv_cache_spec()` returns INT8 dtype automatically. - Upgrades `layer.impl` to `AscendC8AttentionBackendImpl` via class surgery. 3. `quantization/modelslim_config.py` — C8 branch in `get_quant_method()` activates when `kv_cache_type == "C8"` in `quant_model_description.json`. 4. `patch/worker/patch_qwen3_c8.py` — Intercepts per-channel C8 scale/offset weights before `AutoWeightsLoader` discards them, routing them to the parameters created by `AscendC8KVCacheAttentionMethod`. 5. `tests/ut/quantization/test_kv_c8.py` — Unit tests covering `_c8_kv_scale_weight_loader`, `AscendC8KVCacheAttentionMethod`, and `AscendC8AttentionBackendImpl` scale helpers. Yes. Users can now serve Qwen3-32B W8A8C8 quantized models with INT8 KV cache on Ascend NPU. The model checkpoint must contain a `quant_model_description.json` with `"kv_cache_type": "C8"` and per-channel scale/offset tensors in safetensors. No changes to the serving CLI — the feature activates automatically when the quantization config is detected. Benchmarked with `vllm serve` (TP=8, `max_num_seqs=256`, `max_model_len=131072`, `enable_chunked_prefill=true`) + `random_bench` (input_len=10240, output_len=2048, 960 prompts, max_concurrency=192): ``` ============ Serving Benchmark Result ============ Successful requests: 960 Failed requests: 0 Maximum request concurrency: 192 Benchmark duration (s): 1359.81 Total input tokens: 9830400 Total generated tokens: 1966080 Request throughput (req/s): 0.71 Output token throughput (tok/s): 1445.85 Peak output token throughput (tok/s): 2304.00 Total token throughput (tok/s): 8675.12 ---------------Time to First Token---------------- Mean TTFT (ms): 24598.51 Median TTFT (ms): 23167.02 P50 TTFT (ms): 23167.02 P90 TTFT (ms): 47717.08 P99 TTFT (ms): 84402.61 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 120.76 Median TPOT (ms): 121.50 P50 TPOT (ms): 121.50 P90 TPOT (ms): 127.05 P99 TPOT (ms): 130.13 ---------------Inter-token Latency---------------- Mean ITL (ms): 120.70 Median ITL (ms): 90.34 P50 ITL (ms): 90.34 P90 ITL (ms): 93.79 P99 ITL (ms): 101.80 ================================================== ``` All attention states verified: `PrefillNoCache`, `PrefillCacheHit`, `ChunkedPrefill`, `DecodeOnly`. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: LICO67373 <110013619+LICO1314@users.noreply.github.com>	2026-04-08 10:51:58 +08:00
Nagisa125	fbd5d0fd55	[Doc][Misc][v0.18.0] Updated the document configuration for DeepSeek-V3.2 (#7970 ) ### What this PR does / why we need it? To avoid misleading users, the unmaintained DSV32 models, such as the floating-point model, are deleted from the document.This PR removes the BF16 version entries for DeepSeek-V3.2 from the documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation update only. Signed-off-by: wyh145 <1987244901@qq.com>	2026-04-07 16:17:28 +08:00
cvSoldier	6c19270498	[BugFix] fix qwen3-next compilation error (#7977 ) ### What this PR does / why we need it? fix qwen3-next compilation error - vLLM version: v0.18.0 - vLLM release0.18.0: `445dc7196f` --------- Signed-off-by: cvSoldier <610496306@qq.com>	2026-04-03 20:03:39 +08:00
guxin108	81c6f51a45	【CI】add nightly cases: MiniMax-M2.5-W8A8 Qwen3.5-27B-w8a8 Qwen3.5-397B-A1… (#7968 ) ### What this PR does / why we need it? This PR Qwen3.5-27B ;MiniMax-M2.5-w8a8 ;Qwen3.5-397B-w8a8-mtp acc/perf 3 cases on A3, we need test them daily. - vLLM version: v0.18.0 - vLLM main: `35141a7eed` Signed-off-by: guxin108 <1252896542@qq.com>	2026-04-03 17:50:59 +08:00
jiangmengyu18	3f462d251e	[v0.18.0][CI] fix acc baseline of qwen3vl 235b (#7981 ) ### What this PR does / why we need it? fix acc baseline of qwen3vl 235b --------- Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>	2026-04-03 17:38:17 +08:00

1 2 3 4 5 ...

2843 Commits