xc-llm-ascend/vllm_ascend/patch/worker/__init__.py

#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from vllm.triton_utils import HAS_TRITON

from vllm_ascend.utils import vllm_version_is

if HAS_TRITON:
    import vllm_ascend.patch.worker.patch_triton

if not vllm_version_is("v0.16.0"):
    import vllm_ascend.patch.worker.patch_qwen3_5  # noqa

# isort: off
import vllm_ascend.patch.platform.patch_sched_yield  # noqa
import vllm_ascend.patch.worker.patch_unquantized_gemm  # noqa
import vllm_ascend.patch.worker.patch_bert  # noqa
import vllm_ascend.patch.worker.patch_distributed  # noqa
import vllm_ascend.patch.worker.patch_minimax_m2  # noqa
import vllm_ascend.patch.worker.patch_minimax_m2_linear_attn  # noqa
import vllm_ascend.patch.worker.patch_multimodal_merge  # noqa
import vllm_ascend.patch.worker.patch_qwen3_next  # noqa
import vllm_ascend.patch.worker.patch_qwen3_next_mtp  # noqa
import vllm_ascend.patch.worker.patch_rejection_sampler  # noqa
import vllm_ascend.patch.worker.patch_v2_eagle  # noqa
import vllm_ascend.patch.worker.patch_v2_uva  # noqa
import vllm_ascend.patch.worker.patch_huanyuan_vl  # noqa
import vllm_ascend.patch.worker.patch_routed_experts_capturer  # noqa
import vllm_ascend.patch.worker.patch_npugraph_ex_triton  # noqa
import vllm_ascend.patch.worker.patch_kimi_k25  # noqa
import vllm_ascend.patch.worker.patch_draft_quarot  # noqa
import vllm_ascend.patch.worker.patch_cudagraph  # noqa
[MISC] Add patch module (#526) This PR added patch module for vllm 1. platform patch: the patch will be registered when load the platform 2. worker patch: the patch will be registered when worker is started. The detail is: 1. patch_common: patch for main and 0.8.4 version 4. patch_main: patch for main verison 5. patch_0_8_4: patch for 0.8.4 version 2025-04-16 09:28:58 +08:00			`#`
			`# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.`
			`# This file is a part of the vllm-ascend project.`
			`#`
			`# Licensed under the Apache License, Version 2.0 (the "License");`
			`# you may not use this file except in compliance with the License.`
			`# You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`
			`#`

[Refactor] refactor patch module (#3555) ### What this PR does / why we need it? we notice that `patch_main` is never used. Usually the patch is for all version. And if it's for specified version, we can use `vllm_version_is` instead. So let's remove the useless sub folder in patch module to make it clear. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-10-21 20:19:46 +08:00			`from vllm.triton_utils import HAS_TRITON`

Add patch_qwen3_5 for triton ops fused_recurrent_gated_delta_rule (#7109) ### What this PR does / why we need it? The ops `torch_npu.npu_recurrent_gated_delta_rule` currently does not support `ssm_state` inputs in float32 format, we temporarily retain the _forward_core implementation with triton for Qwen3_5 --------- Signed-off-by: pppeng <zepengliu912@qq.com> Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com> 2026-03-10 23:28:58 +08:00			`from vllm_ascend.utils import vllm_version_is`

[Refactor] refactor patch module (#3555) ### What this PR does / why we need it? we notice that `patch_main` is never used. Usually the patch is for all version. And if it's for specified version, we can use `vllm_version_is` instead. So let's remove the useless sub folder in patch module to make it clear. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-10-21 20:19:46 +08:00			`if HAS_TRITON:`
			`import vllm_ascend.patch.worker.patch_triton`

Add patch_qwen3_5 for triton ops fused_recurrent_gated_delta_rule (#7109) ### What this PR does / why we need it? The ops `torch_npu.npu_recurrent_gated_delta_rule` currently does not support `ssm_state` inputs in float32 format, we temporarily retain the _forward_core implementation with triton for Qwen3_5 --------- Signed-off-by: pppeng <zepengliu912@qq.com> Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com> 2026-03-10 23:28:58 +08:00			`if not vllm_version_is("v0.16.0"):`
			`import vllm_ascend.patch.worker.patch_qwen3_5 # noqa`

[Refactor] refactor patch module (#3555) ### What this PR does / why we need it? we notice that `patch_main` is never used. Usually the patch is for all version. And if it's for specified version, we can use `vllm_version_is` instead. So let's remove the useless sub folder in patch module to make it clear. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-10-21 20:19:46 +08:00			`# isort: off`
【main】patch sched_yield (#3648) ### What this PR does / why we need it? On Arm systems, os.sched_yield() does not take effect, causing the GIL (Global Interpreter Lock) to remain unrelinquished and resulting in CPU bound issues. This PR applies a patch to sched_yield in vLLM, making the process execute time.sleep(0) instead to release the GIL. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: fems14 <1804143737@qq.com> 2025-10-24 00:06:45 +08:00			`import vllm_ascend.patch.platform.patch_sched_yield # noqa`
[Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (#5034) This PR add `MatmulAllreduceRmsnorm` operator and introduces a graph fusion pass for `matmul_allreduce_rmsnorm` operations. The implementation includes a new configuration flag, a pattern matching pass using `torch._inductor.pattern_matcher`. Co-authored-by: Trunrain [270250579@qq.com](mailto:270250579@qq.com) - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: tongrunze <t00574058@china.huawei.com> 2026-01-19 09:28:07 +08:00			`import vllm_ascend.patch.worker.patch_unquantized_gemm # noqa`
[Model] Support pooling models (#3122) ### What this PR does / why we need it? Support pooling models (like `bge-reranker-v2-m3`) in vllm-ascend, this pr covered the three model types of embed (cls_token, mean_token, lasttoken). After this [commit](https://github.com/vllm-project/vllm/commit/17373dcd93ca60554d72cef4e159e70abbfd15af), vllm has provided support for adapting pooling models on the v1 engine. This PR includes corresponding adaptations on the vllm-ascend side. Fixes #1960 - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: lianyibo <lianyibo1@kunlunit.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com> 2025-12-10 11:37:57 +08:00			`import vllm_ascend.patch.worker.patch_bert # noqa`
[Refactor] refactor patch module (#3555) ### What this PR does / why we need it? we notice that `patch_main` is never used. Usually the patch is for all version. And if it's for specified version, we can use `vllm_version_is` instead. So let's remove the useless sub folder in patch module to make it clear. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-10-21 20:19:46 +08:00			`import vllm_ascend.patch.worker.patch_distributed # noqa`
[Model] Support Minimax-m2.5 on NPU (#7105) ### What this PR does / why we need it? Initial version to support minimax-m2.5 on vllm-ascend. This commit coverting original fp8 weight to a quantilized bf16 to support Minimax-m2.5 on NPU. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d ### Test Report Self tested precision summary, where the official precision score of AIME2025 is 86.3 <img width="426" height="84" alt="image" src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a" /> --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com> Co-authored-by: limuyuan <limuyuan3@huawei.com> 2026-03-11 00:12:02 +08:00			`import vllm_ascend.patch.worker.patch_minimax_m2 # noqa`
			`import vllm_ascend.patch.worker.patch_minimax_m2_linear_attn # noqa`
[Refactor] refactor patch module (#3555) ### What this PR does / why we need it? we notice that `patch_main` is never used. Usually the patch is for all version. And if it's for specified version, we can use `vllm_version_is` instead. So let's remove the useless sub folder in patch module to make it clear. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-10-21 20:19:46 +08:00			`import vllm_ascend.patch.worker.patch_multimodal_merge # noqa`
qwen3_next add triton ops : fused_qkvzba_split_reshape (#4788) ### What this PR does / why we need it? add triton ops fused_qkvzba_split_reshape_cat for qwen3_next GatedDeltaNet ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com> 2025-12-18 11:31:04 +08:00			`import vllm_ascend.patch.worker.patch_qwen3_next # noqa`
[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770) ### What this PR does / why we need it? The pad `-1` modification is from https://github.com/vllm-project/vllm/pull/25743. It still has bugs for batched chunked prefill. - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: drslark <slarksblood@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> 2025-12-10 22:54:24 +08:00			`import vllm_ascend.patch.worker.patch_qwen3_next_mtp # noqa`
[Feat] Refactor rejection sampler (#4975) ### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - https://github.com/vllm-project/vllm/pull/19482 - https://github.com/vllm-project/vllm/pull/26060 - https://github.com/vllm-project/vllm/pull/29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following https://github.com/vllm-project/vllm/pull/26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with https://github.com/vllm-project/vllm-ascend/pull/4893/) - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> 2025-12-16 11:32:26 +08:00			`import vllm_ascend.patch.worker.patch_rejection_sampler # noqa`
[Feature] adapt to uva buffer and main2main (#6657) ### What this PR does / why we need it? vllm model runner v2 use uva buffer to prepare input data, but npu doesn't support uva yet, this pr implement a uvawrapper class to mimic gpu's uva backend. what's more, this pr make some modifications to adapt to the newer main branch. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM main: https://github.com/vllm-project/vllm/commit/13397841ab469cecf1ed425c3f52a9ffc38139b5 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2026-02-12 10:36:31 +08:00			`import vllm_ascend.patch.worker.patch_v2_eagle # noqa`
			`import vllm_ascend.patch.worker.patch_v2_uva # noqa`
[main2main] upgrade vllm main 0202 (#6560) ### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to https://github.com/vllm-project/vllm/pull/32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to https://github.com/vllm-project/vllm/pull/33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to https://github.com/vllm-project/vllm/pull/33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to https://github.com/vllm-project/vllm/pull/32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to https://github.com/vllm-project/vllm/pull/32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to https://github.com/vllm-project/vllm/pull/27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to https://github.com/vllm-project/vllm/pull/33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to https://github.com/vllm-project/vllm/pull/32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> 2026-02-05 19:31:17 +08:00			`import vllm_ascend.patch.worker.patch_huanyuan_vl # noqa`
[Feat] Support routing replay (#6696) ### What this PR does / why we need it? [Feat] Support routing replay same as https://github.com/vllm-project/vllm-ascend/pull/6666 resubmit because of DOC failure ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: liyongwen <1310439159@qq.com> Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> 2026-02-26 10:22:47 +08:00			`import vllm_ascend.patch.worker.patch_routed_experts_capturer # noqa`
[npugraph_ex]enable npugraph_ex by default (#6664) ### What this PR does / why we need it? This pull request enables the `npugraph_ex` backend by default to improve performance on Ascend NPUs, as proposed in the [RFC](https://github.com/vllm-project/vllm-ascend/issues/6214). ### Does this PR introduce _any_ user-facing change? Yes. `npugraph_ex` is now enabled by default. Users can disable it by setting `enable: false` in the `npugraph_ex_config` section of the `additional_config`. ### How was this patch tested? CI passed. The changes are covered by existing and new E2E tests (`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`) that have been updated to reflect the new default behavior. The tests verify correctness and consistency with `npugraph_ex` enabled and disabled, as well as with the new static kernel option. Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com> Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com> 2026-02-12 08:44:06 +08:00			`import vllm_ascend.patch.worker.patch_npugraph_ex_triton # noqa`
[Bugfix] Support Kimi-K2.5 models (#6755) ### What this PR does / why we need it? This PR supports the Kimi-K2.5 models on the NPU of bf16 and w4a8 weights. The corresponding PR in the vllm community has been merged: https://github.com/vllm-project/vllm/pull/34501 ### Does this PR introduce _any_ user-facing change? - No. ### How was this patch tested? We test the Kimi-K2.5 weights. The weights path: https://modelscope.cn/models/Eco-Tech/Kimi-K2.5-W4A8 Successfully ran on 910B NPU using vllm-ascend by the w4a8 weights. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: LoganJane <LoganJane73@hotmail.com> 2026-02-25 14:51:46 +08:00			`import vllm_ascend.patch.worker.patch_kimi_k25 # noqa`
[main][feature] Support quarot for eagle3 without embedding (#7038) ### What this PR does / why we need it? If some `eagle3` model without embed_tokens works with `quarot` target model, the acceptence rate will drop. We solve it in this PR. The relative vllm pr is https://github.com/vllm-project/vllm/pull/36225. - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d Signed-off-by: drslark <slarksblood@qq.com> 2026-03-09 10:43:06 +08:00			`import vllm_ascend.patch.worker.patch_draft_quarot # noqa`
[main][bugfix] Fixed the problem of speculative decoding in FULL mode (#7148) ### What this PR does / why we need it? Fixed the error of speculative decoding in FULL mode when `num_spec + 1` not in `cudagraph_capture_sizes`. Now, we can run speculative decoding in FULL mode, but with drafter as eager. It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 . ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test code is shown as below: ```python prompts = [ "1.Who are you?", "2. Who are you?", ] sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200) llm = LLM( model="/home/some-model/Meta-Llama-3.1-8B-Instruct", tensor_parallel_size=1, max_num_seqs=32, # enforce_eager=True, disable_log_stats=False, distributed_executor_backend="mp", gpu_memory_utilization=0.7, async_scheduling=True, speculative_config={ "enforce_eager": True, "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B", "disable_padded_drafter_batch": False, "method": "eagle3", "num_speculative_tokens": 2, }, compilation_config={ "cudagraph_mode": "FULL", "cudagraph_num_of_warmups": 1, }, max_model_len=4096, enable_prefix_caching=False, ) outputs = llm.generate(prompts, sampling_params) ``` The result before: ```text File "/vllm-workspace/vllm/vllm/v1/cudagraph_dispatcher.py", line 140, in _create_padded_batch_descriptor assert num_tokens_padded % uniform_decode_query_len == 0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError ``` The result after: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 249 num_draft_tokens: 498 num_accepted_tokens: 149 mean acceptance length: 1.60 -------------------------------------------------- acceptance at token 0: 0.43 acceptance at token 1: 0.17 ``` - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d Signed-off-by: drslark <slarksblood@qq.com> 2026-03-12 14:51:12 +08:00			`import vllm_ascend.patch.worker.patch_cudagraph # noqa`