xc-llm-ascend

Author	SHA1	Message	Date
wangbj127	6bdc72949b	Revert "[v0.18.0][BugFix] Fix dimension mismatch error when SP padding causes num_tokens_padded != num_tokens_unpadded" (#8413 ) Reverts vllm-project/vllm-ascend#8133 - Reversion of Logic: This pull request reverts the changes introduced in a previous commit that attempted to handle dimension mismatches during SP padding. Signed-off-by: Wangbingjie <wangbj1207@126.com>	2026-04-18 20:43:42 +08:00
wangxiaoteng888	363febb6cb	[BugFix][v0.18.0] Gate recompute/balance/fused_mc2 by PD mode (#8374 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? - Enforce recompute scheduler only in PD-disaggregated mode. - Enforce balance scheduling only in PD-mixed mode. - Enforce fused MC2 only on PD-disaggregated D-side (kv_consumer). <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? No <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? By ci <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-04-18 18:06:42 +08:00
1kzk	c995a959e6	[BugFix] fix hang in async scheduling while open ENPU (#8354 ) ### What this PR does / why we need it? 1. there is no synchronization between steps. However, in async scheduling with aclgraph, it is possible that the CPU's record event for the current iteration completes before the previous iteration's graph execution has finished. If cpu is fast enough, device will hang on event_wait in interation i+1 (assume that event_record is executed immediately on update stream of device). 2. Under ENPU, eagle proposers also need to follow event.record first, and then event.Wait. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? --------- Signed-off-by: 1zzk <785396250@qq.com>	2026-04-18 00:07:15 +08:00
wangbj127	f2956ce944	[v0.18.0][BugFix] Fix dimension mismatch error when SP padding causes num_tokens_padded != num_tokens_unpadded (#8133 ) Cherry-picked from https://github.com/vllm-project/vllm-ascend/pull/7858 ### What this PR does / why we need it? This PR fixes a `RuntimeError` (dimension mismatch) that occurs when Sequence Parallelism (SP) is enabled and the padding added for SP causes `num_tokens_padded` to differ from `num_tokens_unpadded`. In such cases, `_pad_query_start_loc_for_fia` adds a dummy request, increasing `num_reqs_padded`. This mismatch between the actual number of requests and the padded number of requests leads to errors in downstream token count computations (e.g., `compute_num_computed_tokens`). The fix modifies the restrictive condition `num_tokens_padded == num_tokens_unpadded` when reverting the dummy request padding if SP is enabled, as SP padding is handled by stripping it after communication and should not be treated as an additional request in the attention metadata. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? vLLM version: v0.18.0 vLLM-Ascend version: releases/v0.18.0 Signed-off-by: Wangbj127 <wangbj1207@126.com>	2026-04-17 22:50:22 +08:00
aipaes	0954fd0912	[BugFix][0.18.0] Fix quant_bias missing in w8a8_static when flashcomm1 is enabled for GLM-5 (#8304 ) ### What this PR does / why we need it? PR #8220 in v0.18.0 In a previous PR #7843 , the o_proj layer of GLM-5 was reverted to TP (Tensor Parallel) splitting when flashcomm1 was enabled. However, this was a temporary workaround and did not address the root cause of the precision issues observed in the o_proj layer under flashcomm1. I am working on a definitive fix for this issue. Currently, a clear bug has been identified in `880e20fdde/vllm_ascend/quantization/methods/w8a8_static.py (L124)`: during quantized matrix multiplication, quant_bias is not added if tp_rank > 0. In the flashcomm1 scenario, all ranks actually require the addition of quant_bias, meaning tp_rank=0 should be passed to ensure the bias is applied correctly. This PR aims to resolve this logic error and fix the underlying precision issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? glm5 e2e test --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: triomino <15924998+triomino@users.noreply.github.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-04-17 22:46:36 +08:00
Zetong Li	b72ade9acd	[0.18.0][BugFix] Update capture sizes after rounding operations (#8380 ) ### What this PR does / why we need it? This PR is partially cherry-picked from #8172. This PR aims to fix mismatched capture sizes after rounding operations when using sp or speculative. The reason is that original `self.cudagraph_capture_sizes` is no longer updated and remains as the initial sizes. Now we use `self.cudagraph_dispatcher.get_capture_descs` to the get up-to-date sizes. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci Signed-off-by: Zetong Li <slippersss@126.com>	2026-04-17 22:46:16 +08:00
pz1116	ceb1e49661	[BugFix][v0.18.0] fix remote KV waiting promotion in balance scheduler (#8280 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? ## Problem In PD-disaggregated serving with `mooncake_connector` and `VLLM_ASCEND_BALANCE_SCHEDULING=1`, requests may enter `WAITING_FOR_REMOTE_KVS` and never be promoted back to runnable state after remote KV transfer finishes. The issue is in `BalanceScheduler`'s handling of `WAITING_FOR_REMOTE_KVS` requests. The current code treats `_update_waiting_for_remote_kv()` as if it returns a boolean readiness flag: ```python is_ready = self._update_waiting_for_remote_kv(request) if is_ready: ... else: ... ``` ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>	2026-04-17 10:06:36 +08:00
Frank Chen	f85144cc57	[BugFix][DSv32] Fix DSA-CP PD role gating for deepseek v3.2 (v0.18.0) (#8291 ) ### What this PR does / why we need it? This PR backports the DSA-CP PD role gating fix to `releases/v0.18.0`. The existing helper logic on the release branch does not handle the PD mixed-role case correctly when deciding whether layer sharding or TP `o_proj` handling should be enabled. Layer sharding should only run on the P-side instance, while TP `o_proj` handling should stay enabled for normal non-PD deployments and for the PD mixed-role (`kv_both`) instance. This patch makes those conditions explicit and adds unit coverage for the allowed and disallowed combinations, including the DSA-CP-disabled path. Such wrong condition lead to vllm serve failures in case: FC1 + PD-colocated KV pooling + no layer_sharding, specifically causing: 1. insufficient Available KV cache memory 2. o_proj shape error in sfa_v1 attention module ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E test with dsv32 + FC1 + FULL_DECODE_ONLY + kv_transfer_config(kv_both) + no layer_sharding --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-04-17 10:05:40 +08:00
csoulnd	8952fddc7e	[BugFix][310p][Cherry-pick] Handle null quantization config in ShardedStateLoader310&[Feature][310P] Support W8A8 dynamic linear method (#8296 ) ### What this PR does / why we need it? This PR implements the `AscendW8A8DynamicLinearMethod310` quantization scheme specifically for 310P hardware. It includes the logic for weight retrieval, per-channel parameter generation, and the application of dynamic quantization using NPU-specific kernels. Additionally, it updates `ShardedStateLoader310` to handle quantization configurations more robustly when generating parameter type maps. Feedback from the review identified two critical issues in the implementation: 1. The tensor squeezing logic in the `apply` method incorrectly handles 2D inputs, which may lead to shape mismatches in subsequent layers. 2. The weight tensor in `process_weights_after_loading` is transposed after being converted to the private NZ format; the transpose operation should be performed on the ND tensor before conversion to ensure correct physical layout. cherry-pick from : #7546 #7725 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests were added in `tests/ut/_310p/quantization/test_w8a8_dynamic_310.py` to verify the quantization method, and `tests/ut/_310p/test_sharded_state_loader_310p.py` was updated to test the state loader changes. --------- Signed-off-by: csoulnd <daidaicurry@foxmail.com>	2026-04-16 16:53:39 +08:00
1kzk	52f0f9b5e4	[0.18.0][BugFix]: order acl graph updates before model forward for ENPU (#8317 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> For the ENPU scenario, it is required that device events follow the principle of "record first, wait later", otherwise the inference process may become stuck. However, in the current model_forward function, event.wait precedes event.record. Therefore, for the ENPU scenario, graph parameter updates should be performed before model execution. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> N/A ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: 1zzk <785396250@qq.com> Signed-off-by: 1kzk <785396250@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-04-16 16:26:59 +08:00
Frank Chen	2ac8bfb4cb	[BugFix] Enforce C locale for CPU binding subprocess parsing (#8261 ) ### What this PR does / why we need it? This PR backports the CPU binding locale normalization fix from #7274 to `releases/v0.18.0`, including the follow-up review fixes already applied on `main`. The change forces `LC_ALL`, `LANG`, and `LC_MESSAGES` to `C` before spawning subprocesses in `vllm_ascend.cpu_binding.execute_command()`, so parser-dependent command output stays stable on localized systems. It also handles `subprocess.TimeoutExpired` by killing the child process before collecting output, and updates the existing unit tests to keep command-argument coverage while adding timeout-path coverage. Fixes #6992 ### Does this PR introduce _any_ user-facing change? Yes. Users running CPU binding on non-English OS environments should now get consistent English subprocess output for parser-dependent commands, avoiding failures caused by inherited locale settings. ### How was this patch tested? - Updated the existing unit tests in `tests/ut/device_allocator/test_cpu_binding.py` to assert the locale environment, retain command argument coverage, and cover the timeout cleanup path. - Attempted to run targeted pytest cases locally, but the pytest invocation did not complete normally in this environment, so I could not record a clean passing run here. Attribution: - Co-authored-by: stdjhs <1601599324@qq.com> - Signed-off-by: chenchuw886 <chenchuw@huawei.com> Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: stdjhs <1601599324@qq.com>	2026-04-16 16:17:10 +08:00
pz1116	8bc72a807a	[BugFix][v0.18.0] require piecewise cudagraph for layerwise AscendSto… (#8282 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? ref:https://github.com/vllm-project/vllm-ascend/issues/8184 following https://github.com/vllm-project/vllm/pull/31057, add `requires_piecewise_for_cudagraph` for `AscendStoreConnector` ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>	2026-04-16 10:40:14 +08:00
chenweiqiang11	028b8cabc4	[BugFix][Platform] Fix extra function name in final chunk of streaming tool calls (#8178 ) ### What this PR does / why we need it? Fix a bug in the GLM tool call parser where the `function.name` field was incorrectly included in the final (non-first) chunks of streaming tool calls. Per OpenAI streaming semantics, `id`, `type`, and `function.name` must only appear in the first chunk for a given tool call index. When `_create_remaining_args_delta` was called for continuing/finishing chunks, it was incorrectly reading the function name from `delta_message.tool_calls` and re-emitting it, causing clients to see a duplicate/extra function name in the final chunk. Root cause: The original code always looked up the tool call in `delta_message.tool_calls` to get the name, id, and type — even when this was not the first chunk being streamed. This caused the function name to appear again in the final argument-completion chunk. Fix: - Track whether arguments have already been streamed (`already_streamed_args`) for each tool call index. - Only populate `fallback_tool_call_id`, `fallback_tool_call_type`, and `fallback_tool_call_name` when `already_streamed_args` is empty (i.e., this is genuinely the first chunk). - Refactored `_create_remaining_args_delta` to omit header fields entirely when all fallback values are `None`, which is the correct behavior for continuing/finishing chunks. ### Does this PR introduce _any_ user-facing change? Yes. Clients consuming the streaming tool call response will no longer receive a duplicate `function.name` in the final chunk. This fixes incorrect behavior visible in the OpenAI-compatible streaming API output for GLM models using tool calls. ### How was this patch tested? - Code review and logic analysis of the streaming tool call path in `patch_glm_tool_call_parser.py`. - Existing unit tests in `tests/ut/platform/test_patch_glm_tool_call_parser.py`. --------- Signed-off-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com> Signed-off-by: chenweiqiang11 <chenweiqiang11@noreply.github.com> Co-authored-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com>	2026-04-15 17:50:10 +08:00
Zetong Li	b6aa5bbdbf	[0.18.0][BugFix] Add PrefillNoCache state in mla _forward_decode for short prompt (#8264 ) ### What this PR does / why we need it? This PR is cherry-pick from #8263. This PR aims to fix short prompt problem. The root cause can be found in #8029. Since the previous pr may miss mixed long and short prompt batch, after discussion, we decide to add PrefillNoCache state in mla _forward_decode now instead. Signed-off-by: Zetong Li <slippersss@126.com>	2026-04-15 09:23:52 +08:00
Qiu	70713c3fd4	[cherry-pick][BugFix] Improve max_cudagraph_capture_size validation (#8252 ) ### What this PR does / why we need it? This PR improves the validation of `max_cudagraph_capture_size` by comparing it against the potential maximum tokens required for decoding, derived from the scheduler configuration. It introduces a warning to alert users when the capture size might be insufficient for the workload, which could lead to suboptimal performance. ref: #8227 ### Does this PR introduce _any_ user-facing change? Yes, a warning log is added when the `max_cudagraph_capture_size` is smaller than the potential decode workload. --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-04-14 22:00:10 +08:00
Yaphets24	13c7392416	[BugFix] fix dsv3.1 service failed to start (#8207 ) ### What this PR does / why we need it? This PR fixes a service startup failure for DeepSeek-V3.1 models by removing a strict type assertion for `MLAAttentionSpec` in `NPUModelRunner.get_kv_cache_spec`. The assertion was failing due to class identity mismatches caused by the runtime patching of `MLAAttentionSpec` with `AscendMLAAttentionSpec`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified that the service starts correctly for DSV3.1 models. Signed-off-by: mayumeng <m30059191@china.huawei.com> Co-authored-by: mayumeng <m30059191@china.huawei.com>	2026-04-14 17:52:55 +08:00
wangbj127	d94b1dc2d0	[v0.18.0][BugFix] Fix Qwen3.5 MoE flash comm v1 shared expert shape error of mtp layer on A2 (#8004 ) ### What this PR does / why we need it? Fix Qwen3.5 MoE MTP layer shared expert shape error when flash comm v1 is enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `35141a7eed` Signed-off-by: Wangbingjie <wangbj1207@126.com>	2026-04-13 17:36:09 +08:00
wangxiaoteng888	39c071a0f5	[BugFix][P/D][0.18.0]Add a retry mechanism to prevent packet loss (#8167 ) ### What this PR does / why we need it? Add a retry mechanism to prevent packet loss Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-04-12 23:30:45 +08:00
wangxiaoteng888	4adc6a68f5	[BugFix][P/D][0.18.0]bugfix short squence has no respone (#8142 ) ### What this PR does / why we need it? bugfix short squence has no respone. This pull request refactors the event handling for KV cache reshaping in mla_v1.py by centralizing the reshape_cache_event creation and recording within the _mla_preprocess function, ensuring it covers both decode and prefill operations. Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-04-12 23:25:01 +08:00
Frank Chen	31186a3a9d	[BugFix] Add async communication check for capturing mode (#8149 ) ### What this PR does / why we need it? Introduce a check to not using asynchronous communication under `enable_dsa_cp_with_layer_shard` branch on capturing mode. This change prevents potential stream and event issues when operating in graph/capturing mode, ensuring safer communication practices. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E test with dsv32 + FC1 + FULL_DECODE_ONLY + kv_transfer_config(kv_both) --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-04-12 21:52:54 +08:00
DreamerLeader	531d0e6fff	[v0.18.0][BugFix][KV Pool]Fix the conflict between pooling scenarios … (#8101 ) …and PCP across machines <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: DreamLeader <2270923832@qq.com>	2026-04-09 21:55:56 +08:00
Zetong Li	054fde7b72	[0.18.0][BugFix] Fix attention state of short prompt for correct forwarding (#8088 ) ### What this PR does / why we need it? This PR is cherry-pick from #8029. This PR aims to fix attention state of short prompt for correct forwarding. Since a batch of short prompts (prefill tokens less than or equal to num_spec_tokens + 1) will be treated as decode requests (by split_decodes_and_prefills), its original PrefillNoCache attention state contradicts. Thus these short prompts will be passed into a mismatched branch and incur errors. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci Signed-off-by: Zetong Li <slippersss@126.com>	2026-04-09 21:21:24 +08:00
weijinqian0	f668ff9ef0	[v0.18.0][BugFix]Revert the code: Replace npu_ring_mla wit FIA with MLA prefill. (#7961 ) This pull request reverts previous changes to switch to FIA and instead implements npu_ring_mla for MLA prefill operations(#5704 ). The change streamlines the attention mechanism by removing unnecessary metadata tracking and updating the underlying NPU operations to use the ring-based MLA kernel. This adjustment ensures better compatibility and performance for MLA prefill tasks within the vLLM Ascend backend. Highlights - Migration to npu_ring_mla: Replaced the usage of npu_fused_infer_attention_score (FIA) with npu_ring_mla for MLA prefill operations across the codebase to improve performance and alignment with the intended architecture. - Cleanup of redundant metadata: Removed chunk_actual_seq_lengths_kv_list and actual_seq_lengths_q from various metadata structures as they are no longer required for the updated attention implementation. - Test suite updates: Updated unit tests in test_mla_cp.py and test_mla_v1.py to mock npu_ring_mla instead of the deprecated FIA functions and adjusted test assertions to reflect the new implementation details. Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2026-04-09 17:00:25 +08:00
linfeng-yuan	7c9aa498d6	[releases/v0.18.0][BugFix] Restore global_bs=0 and mc2_mask for uniform-token dispatching and support inter-node roce hierarchical MC2 communication (#8040 ) ### What this PR does / why we need it? Cherry-picked from #8039 Restore the setting of MC2 `global_bs` and `mc2_mask` handling when `all_reduce` across DP group cannot be skipped. Ascend MC2 ops require `global_bs=0` + `mc2_mask` while enabling inter-node roce hierarchical communication. PR #4983 always passed non-zero `global_bs` without `mc2_mask`, which is incompatible with hierarchy comm raised in PR #7583 Changes: - Add `should_skip_allreduce_across_dp_group()` to `utils.py` with hierarchy constraint - Set `global_bs=0` when allreduce is not skipped; pass `mc2_mask` accordingly - Add `mc2_mask` field to `MoEMC2CombineMetadata` for dispatch→combine propagation ### Does this PR introduce _any_ user-facing change? No. But this PR fixes cross-super-node communication function on A3 with `enable_mc2_hierarchy_comm=True` in `additional_config` and `export HCCL_INTRA_ROCE_ENABLE=1`. ### How was this patch tested? E2E serving succeeded and CI pssed. - vLLM version: v0.18.0 - vLLM main: `14acf429ac` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-04-09 16:51:17 +08:00
Shaoxu Cheng	82e17f693a	[BugFix][0.18.0][310p] fix post-sampling not working in graph mode on 310p (#8077 ) ### What this PR does / why we need it? Enabling temperature in post-processing on 310P devices can cause the service to stall and eventually hang. We first traced the issue to a timeout where the temperature-related `div` operator was waiting for results from a sub-stream. After investigating the preceding operators, we finally identified the root cause as the `q.exponential_()` operator, which is not well supported on 310P and triggers an internal issue in the `add` kernel. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? This patch was thoroughly tested locally（accuracy-dataset test and stress test）. It is not easy to design a proper unit test for this case, and I appreciate your understanding. Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-04-09 16:31:38 +08:00
zouyida2052	c40a387f63	[bugfix]fix extra npu context in device 0 (#8041 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? When we launch a PD-disaggregated process and send requests, an additional processes appear on NPU 0, becasue when a thread has a primary cuda context, the child thread it creates automatically doesn't inherit the cuda context. See https://forums.developer.nvidia.com/t/when-a-thread-has-a-primary-cuda-context-does-the-child-thread-it-creates-automatically-inherit-the-cuda-context/362810. vLLM has fixed this issue in [pr-37449 ](https://github.com/vllm-project/vllm/pull/37449), but version 0.18.0 does not include the fix. Therefore, we need to patch it. <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? no <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: zouyida <zouyida@huawei.com> Co-authored-by: zouyida <zouyida@huawei.com>	2026-04-08 23:35:52 +08:00
Mengqing Cao	044d4c3974	[v0.18.0]feat(quant): add C8 INT8 KV cache support for GQA attention models (#7474 ) (#8007 ) backport of #7474 This PR adds C8 (INT8) KV cache quantization support for standard GQA attention models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel quantization scales to store KV cache in INT8, reducing KV cache memory by ~50% compared to BF16, enabling higher batch concurrency and longer context lengths on the same hardware. Key changes: 1. `attention_v1.py` — New `AscendC8AttentionBackendImpl` subclass of `AscendAttentionBackendImpl`: - `_prepare_c8_scales`: Shards per-channel scales/offsets to the current TP rank and pre-computes BF16 BNSD-shaped antiquant tensors (one-time per layer). - `_quantize_kv_to_int8`: Quantizes BF16 K/V to INT8 before `reshape_and_cache`, using pre-cached inverse scales. - `_forward_c8_decode`: FIA V1 BNSD paged attention with native INT8 KV and `perchannel` antiquant mode. - `_forward_c8_chunked_prefill`: Splits decode (FIA V1 BNSD paged INT8) and prefill (FIA V1 TND float) into two kernel calls. - `_forward_c8_fused_infer_attention`: Handles `PrefillNoCache` and `PrefillCacheHit` states. 2. `quantization/methods/kv_c8.py` — New `AscendC8KVCacheAttentionMethod` scheme: - Creates `k/v_cache_scale/offset` parameters via `_c8_kv_scale_weight_loader`, which handles per-channel scale shapes and lazy resizing. - Sets `layer.kv_cache_torch_dtype = torch.int8` so `get_kv_cache_spec()` returns INT8 dtype automatically. - Upgrades `layer.impl` to `AscendC8AttentionBackendImpl` via class surgery. 3. `quantization/modelslim_config.py` — C8 branch in `get_quant_method()` activates when `kv_cache_type == "C8"` in `quant_model_description.json`. 4. `patch/worker/patch_qwen3_c8.py` — Intercepts per-channel C8 scale/offset weights before `AutoWeightsLoader` discards them, routing them to the parameters created by `AscendC8KVCacheAttentionMethod`. 5. `tests/ut/quantization/test_kv_c8.py` — Unit tests covering `_c8_kv_scale_weight_loader`, `AscendC8KVCacheAttentionMethod`, and `AscendC8AttentionBackendImpl` scale helpers. Yes. Users can now serve Qwen3-32B W8A8C8 quantized models with INT8 KV cache on Ascend NPU. The model checkpoint must contain a `quant_model_description.json` with `"kv_cache_type": "C8"` and per-channel scale/offset tensors in safetensors. No changes to the serving CLI — the feature activates automatically when the quantization config is detected. Benchmarked with `vllm serve` (TP=8, `max_num_seqs=256`, `max_model_len=131072`, `enable_chunked_prefill=true`) + `random_bench` (input_len=10240, output_len=2048, 960 prompts, max_concurrency=192): ``` ============ Serving Benchmark Result ============ Successful requests: 960 Failed requests: 0 Maximum request concurrency: 192 Benchmark duration (s): 1359.81 Total input tokens: 9830400 Total generated tokens: 1966080 Request throughput (req/s): 0.71 Output token throughput (tok/s): 1445.85 Peak output token throughput (tok/s): 2304.00 Total token throughput (tok/s): 8675.12 ---------------Time to First Token---------------- Mean TTFT (ms): 24598.51 Median TTFT (ms): 23167.02 P50 TTFT (ms): 23167.02 P90 TTFT (ms): 47717.08 P99 TTFT (ms): 84402.61 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 120.76 Median TPOT (ms): 121.50 P50 TPOT (ms): 121.50 P90 TPOT (ms): 127.05 P99 TPOT (ms): 130.13 ---------------Inter-token Latency---------------- Mean ITL (ms): 120.70 Median ITL (ms): 90.34 P50 ITL (ms): 90.34 P90 ITL (ms): 93.79 P99 ITL (ms): 101.80 ================================================== ``` All attention states verified: `PrefillNoCache`, `PrefillCacheHit`, `ChunkedPrefill`, `DecodeOnly`. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: LICO67373 <110013619+LICO1314@users.noreply.github.com>	2026-04-08 10:51:58 +08:00
cvSoldier	6c19270498	[BugFix] fix qwen3-next compilation error (#7977 ) ### What this PR does / why we need it? fix qwen3-next compilation error - vLLM version: v0.18.0 - vLLM release0.18.0: `445dc7196f` --------- Signed-off-by: cvSoldier <610496306@qq.com>	2026-04-03 20:03:39 +08:00
jiangmengyu18	3cbd6acc89	[v0.18.0][Feature] Support Flash Comm V1 for Qwen3-VL models (#7893 ) ### What this PR does / why we need it? Enable Flash Comm V1 (sequence parallelism) for Qwen3-VL models (both dense and MoE variants). Root cause: Qwen3-VL's deepstack embeddings remain full-size [N, H] while hidden states become [N/tp_size, H] after reduce-scatter, causing shape mismatch on add. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - [x] Run Qwen3-VL dense model with FC1 enabled (TP > 1), verify correct output - [x] Run Qwen3-VL MoE model with FC1 enabled (TP > 1), verify correct output --------- Signed-off-by: betta18 <jiangmengyu1@huawei.com> Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-04-03 11:38:41 +08:00
jiangmengyu18	85234d096d	[v0.18.0][Feature] support qkv_rmsnorm_mrope for qwen3vl (#7852 ) ### What this PR does / why we need it? Qwen3vl full attention supports enabling the split_qkv_rmsnorm_mrope fusion operator. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - [x] Run Qwen3-VL dense model with the fusion operator, verify correct output - [x] Run Qwen3-VL MoE model with the fusion operator, verify correct output --------- Signed-off-by: jiangmengyu18 <451528648@qq.com> Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com> Signed-off-by: betta18 <jiangmengyu1@huawei.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com>	2026-04-02 17:46:50 +08:00
jiangmengyu18	74699877c9	[v0.18.0][BugFix] fix the weightsmapper bug of qwen3-vl (#7868 ) ### What this PR does / why we need it? This PR fixes a weight loading error in the Qwen3-VL model. The bug was introduced by vLLM. In vLLM's `qwen3-vl.py`, the prefix of the `lm_head` layer is hardcoded as `"lm_head"`. However, `hf_to_vllm_mapper` remaps the weight name of `lm_head` from `lm_head` to `language_model.lm_head`. This causes a mismatch between the keys in the weight file and the prefix of the lm_head layer, resulting in an error. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - [x] Run Qwen3-VL dense model with the fusion operator, verify correct output Signed-off-by: betta18 <jiangmengyu1@huawei.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com>	2026-04-02 12:56:08 +08:00
pz1116	1225c613fb	[BugFix][0.18.0][KV Pool] Fix KV Pool not putting kv cache for vllm v0.18.0 (#7874 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? vLLM v0.18 defers KV connector finalization during target-modelforward when speculative decoding is enable, leading to KV Pool not doing Put Operation. This change is forgotten when we bumpped up the version for vllm-ascend. Fix by adding finalize_kv_connector for spec decode. <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: DreamerLeader <2270923832@qq.com> Co-authored-by: fems14 <1804143737@qq.com>	2026-04-02 10:57:09 +08:00
LI SHENGYONG	4b2f0130bc	[V0.18.0][EPLB][BugFix] Fix moe_load precision in allgather (#7890 ) ### What this PR does / why we need it? Fixed the bug of incorrect reshape usage. For example: ori_tensor: [[1, 2, 3], [4, 5, 6]] after reshape: [[1, 2], [3, 4], [5, 6]] after permute: [[1, 4], [2, 5], [3, 6]] Now, we will directly use squeeze for a more intuitive understanding. pr for main: #7887 ### Does this PR introduce _any_ user-facing change? The actual peak-to-average ratio has successfully decreased. Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-04-02 09:20:31 +08:00
hucong	d3de7333dc	[BugFix][v0.18.0][cherry-pick] Fix embedding prefix caching for APC (#7894 ) ## What this PR does / why we need it? pick-from:https://github.com/vllm-project/vllm-ascend/pull/7452 ### Problem Embedding models produce inconsistent outputs when prefix caching is enabled vs disabled. ### Root Cause The attention router condition was too broad: - All `model_runner_type == "pooling"` → `_forward_encoder_attention()` → uses `npu_fusion_attention` - But `npu_fusion_attention` does NOT support prefix caching - Result: Numerical mismatch when KV cache is managed by prefix caching ### Solution Refine the router condition to check causality: Before: ``` if attn_metadata.model_runner_type == "pooling": → npu_fusion_attention (no prefix caching support) ``` After: ``` if attn_metadata.model_runner_type == "pooling" and not attn_metadata.causal: → npu_fusion_attention (for true encoders) else: → npu_fused_infer_attention_score (prefix caching support) ``` ### Changes Made 1. Fixed router condition (`vllm_ascend/attention/attention_v1.py` L968) - Added `and not attn_metadata.causal` check - Effect: Non-causal embeddings now use correct operator 2. Simplified encoder attention (`vllm_ascend/attention/attention_v1.py` L864-877) - Removed redundant causal branch (encoders never use causal mask) - Reduced from 34 lines to 14 lines 3. Added test (`tests/e2e/singlecard/pooling/test_embedding.py`) - Validates embedding outputs with/without prefix caching are consistent ## Does this PR introduce _any_ user-facing change? ### Functional Changes ✅ Yes - Bug fix: Embedding models now produce consistent outputs with prefix caching ### API Changes ❌ No - All public APIs unchanged ### Configuration Changes ❌ No - No new configuration required ### Backward Compatibility ✅ Fully compatible - Only fixes incorrect behavior ## How was this patch tested? ### New Test Added `test_embed_models_using_prefix_caching_correctness()`: - Tests: `Qwen3-Embedding-0.6B` - Validates numerical consistency between runs with/without prefix caching - Uses long sequences to activate prefix caching - Tolerance: 1e-2 - vLLM version: v0.18.0 Signed-off-by: underfituu <hzhucong@163.com>	2026-04-01 16:57:33 +08:00
zxr2333	ef9964389f	[v0.18.0][BugFix][P/D]Fix layerwise connector out of memory during large buffer transfer (#7752 ) ### What this PR does / why we need Fix layerwise connector out of memory during large buffer transfer. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By nightly. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-31 22:16:53 +08:00
yydyzr	b1cc6ef6ae	[v0.18.0][BugFix] Fix bug of precision when DSA-CP is enabled on GLM5 (#7843 ) ### What this PR does / why we need it? This PR fixs accuracy bug in some cases with additional communication methods. This PR is a specific fix for version 0.18.0 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.18.0 - vLLM main: `35141a7eed` --------- Signed-off-by: rjg-lyh <1318825571@qq.com> Signed-off-by: yydyzr <liuyuncong1@huawei.com> Co-authored-by: rjg-lyh <1318825571@qq.com>	2026-03-31 21:51:10 +08:00
pz1116	0b48ddbc8b	[Bugfix][0.18.0][KV Pool]Fix KV transfer put logic (#7718 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Before when we do put for KV Pool, we find the first non-existing key and put all the blocks starting from that index; however, if the prefix cache blocks is from another request, and some of the blocks are evicted due to LRU, we will be putting blocks that still exist in the pool, and causing MooncakeStore printing unnecessary logs in master service. What this PR does: Now we lookup all the keys and only put the ones that are missing. Fix lookup_scheduler in pool_worker so it handles GQA correctly. Fixes a few existing typos Add UT, written by codex <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: DreamerLeader <2270923832@qq.com> Co-authored-by: fems14 <1804143737@qq.com>	2026-03-31 20:21:23 +08:00
linfeng-yuan	ed4ef1f4e7	[releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties (#7794 ) ### What this PR does / why we need it? Implement get_token_bin_counts_and_mask and apply_penalties with Triton-Ascend kernels. This significantly reduces latency of the sampling process when repetition/frequency/presence penalties are enabled. Cherry-pick from main PR #7569 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2026-03-31 19:01:51 +08:00
wangxiaoteng888	82e26b5a6e	[BugFix][v0.18.0]Adjust request map pop time (#7857 ) ### What this PR does / why we need it? Adjust request map pop time.This pull request optimizes the KV cache transfer mechanism by streamlining how requests are tracked and cleaned up. By removing unnecessary mapping structures and adjusting the timing of request removal, the system achieves more efficient state management during the transfer process. pick-from:https://github.com/vllm-project/vllm-ascend/pull/7855 ### How was this patch tested? By ci <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-03-31 18:55:36 +08:00
jack	7314bbe2df	fix(platform): reimplement MiniMax usage accounting patch (#7835 ) ## Summary - replace the MiniMax usage accounting monkey patch with a runtime wrapper implementation instead of source-text rewriting - preserve MiniMax reasoning-token semantics when `</think>` is missing by counting the emitted output as reasoning tokens - add unit coverage for usage tracking helpers and MiniMax reasoning-token counting ## Why The previous implementation rewrote `OpenAIServingChat` by matching exact source blocks. That was brittle against `vllm` source drift and could crash during early plugin initialization with: `RuntimeError: Failed to locate expected block while patching OpenAIServingChat usage accounting.` This change keeps the usage-accounting backport, but applies it by wrapping the original stream/full generators and tracking output token ids at runtime. For MiniMax reasoning counting, a missing `</think>` should not be treated as zero reasoning tokens. It can mean the whole output is still in thinking mode, or that generation stopped before the closing token was produced. In that case, the emitted output should still be counted as reasoning. ## Validation - `pytest -q tests/ut/patch/platform/test_patch_minimax_usage_accounting.py` - `vllm serve --help` Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-03-31 16:27:00 +08:00
Wangbei25	4f259d4fd8	[Performance]Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder (#7737 ) ### What this PR does / why we need it? Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder and add doc for DeepSeekOCR2.md ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vllm 0.18.0 - vllm-ascend main 1. _create_custom_4d_mask during 141ms49us620ns --> _create_npu_optimized_mask during 1ms227us780ns 2. convd2d : 27ms --> matmul <1ms 3. relposattention：sdpa->prompt_flash_attention --------- Signed-off-by: Wangbei25 <wangbei41@huawie.com> Signed-off-by: Wangbei25 <wangbei41@huawei.com> Co-authored-by: Wangbei25 <wangbei41@huawie.com>	2026-03-31 14:49:29 +08:00
liuchenbing2026	2a0a588311	[0.18.0][BugFix] Disable block verify to avoid incorrect verification on NPU … (#7839 ) …(#7603) ### What this PR does / why we need it? Block verify uses cumprod(target_probs / draft_probs) for joint acceptance. Suffix/ngram methods have draft_probs=None, the fallback draft_token_probs=1.0 with cumprod is not equivalent to per-token verification, causing incorrect accept/reject results. Fix: using_block_verify = max_spec_len >= 3 and draft_probs is not None. MTP/Eagle3 unaffected. - vLLM version: v0.18.0 - vLLM main: `ed359c497a` <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: liuchenbing <chenliumail@163.com> Co-authored-by: liuchenbing <chenliumail@163.com>	2026-03-31 09:36:48 +08:00
zxr2333	ab928ed586	[v0.18.0][P/D][Feature]Layerwise connector supports Mamba prefill prefix caching (#7796 ) ### What this PR does / why we need it? Mooncake layerwise connector supports Mamba prefix caching on prefiller nodes. ### Does this PR introduce _any_ user-facing change? Yes. Use `--enable-prefix-caching` and `--mamba-cache-mode align` to enable mamba align mode prefix caching on P/D prefill nodes. This function does not supports on decode nodes now. ### How was this patch tested? By P/D E2E test. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-31 09:25:22 +08:00
linfeng-yuan	cab5d73633	[releases/v0.18.0][BugFix] Fix server init error when set max_num_seqs not a multiple of tp while FLASHCOMM is on (#7832 ) ### What this PR does / why we need it? Current version will run into init error when user set max_num_seqs to number not a multiple of tp size. The reason is that we will first find out the valid size of sequence parallelism, and then remove numbers that are not the multiple of tp size. This may cause an error when we set a max_num_seqs above a multiple of 8 before a multiple of tp size, say when the tp size is 16 and the max_num_seqs is 90. The system will just drop the calculated max graph capture size 88 from the valid size list but not reset the max_cudagraph_capture_size to the next valid number. Thus, we will need to add the line to match them up. Cherry-pick from main PR #7801 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Full CI passed with this PR. Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-30 20:24:52 +08:00
linfeng-yuan	deceefd305	[releases/v0.18.0][bugfix][eplb] remove unnecessary weight_scale wrap behaviour (#7732 ) ### What this PR does / why we need it? This PR simplifies the apply method in w8a8_dynamic.py by removing the conditional logic that used fused_w1_scale and fused_w2_scale based on the fused_scale_flag. This redundant wrap behavior leads to EPLB break in int8 quantization scenarios. Cherry-picked from #7188. Note that only bugfix lines in that PR are picked. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-30 16:16:03 +08:00
Yang Yuxi	e776d5c0f1	[Bugfix]v0.18.0 support FlashComm1 & DCP for Qwen (#7726 ) ### What this PR does / why we need it? This PR backports the changes from #7673 ([Bugfix] support FlashComm1 & DCP for Qwen) to the releases/v0.18.0 branch. -------- Signed-off-by: Yang Yuxi <907276627@qq.com>	2026-03-29 15:59:19 +08:00
wangbj127	9cc41c9457	[v0.18.0][Bugfix][EAGLE] Fix FIA pad bug under max concurrency (#7754 ) cherry picked from https://github.com/vllm-project/vllm-ascend/pull/7740 Fixes padding problems of FIA op under max concurrency. - vLLM version: v0.18.0 - vLLM main: `35141a7eed` Signed-off-by: Wangbingjie <wangbj1207@126.com>	2026-03-29 12:23:44 +08:00
Wang Kunpeng	5df2ddd8db	[v0.18.0][Bugfix]Fix Error "AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute 'enabling_fa_quant'" (#7748 ) ### What this PR does / why we need it? cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736 Error information When the quantized weights in CompressedTensors format of the kimi-k2 model are used, the following error is reported: `AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute 'enabling_fa_quant'` Error Cause Currently, FA3 quantization supports only the weights of modelslim quantization. The added methods are not defined in AscendCompressedTensorsConfig. Solution Before invoking related methods, check whether the FA3 feature is enabled. Additionally, the unused `get_scaled_act_names` method and its corresponding unit test have been removed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests were updated by removing a deprecated test case, and the refactored logic was reviewed for correctness. Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-03-28 17:03:56 +08:00
jack	f83cb0e6dc	[Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710 ) ### What this PR does / why we need it? This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0` after the MiniMax usage-accounting patch merged upstream on March 27, 2026. It fixes OpenAI chat tool-call streaming for GLM47 by: - draining terminal parser chunks that contain both the final argument text and the closing `</tool_call>` suffix - computing finish backfill from the tool argument bytes actually emitted to the client, instead of trusting parser-internal buffered state - adding focused regression tests for finish backfill and terminal chunk handling ### Does this PR introduce _any_ user-facing change? Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit correct final chunks and argument payloads on `releases/v0.18.0`. ### How was this patch tested? - `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py` - `python -m pre_commit run --files vllm_ascend/patch/platform/patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_glm_tool_call_parser.py vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py` --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-03-28 09:15:04 +08:00
SparrowMu	6fbd0049df	[v0.18.0] Apply Eagle3 to MiniMax-M2.5 (#7619 ) (#7714 ) ### What this PR does / why we need it? Apply Eagle3 to MiniMax-M2.5 to increase model performance This will be discard after Eagle3 weight for MiniMax-M2.5 releases and code change accepted by official repo https://github.com/vllm-project/vllm/pull/37512/changes backport: #7619 - vLLM version: v0.18.0 - vLLM main: `ed359c497a` Signed-off-by: limuyuan <limuyuan3@huawei.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-27 18:33:29 +08:00

1 2 3 4 5 ...

1696 Commits