xc-llm-ascend/vllm_ascend/patch/__init__.py

#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# ----------------------------------------------------------------------------------
# This module manage the patch for vllm. There are two folders in this module:
# - platform: contains the patches applied before worker starts. It's called by
#             `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
#             `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
# - worker: contains the patches applied when worker starts. It's called by
#           `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
#           each worker's `__init__` function.
#
# Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
# ----------------------------------------------------------------------------------

# What's Patched and how it works:
# --------------------------------
# * Platform Patch:
# =================
# ** File: platform/patch_distributed.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.config.ParallelConfig.get_next_dp_init_port`
#    Why:
#       vllm doesn't support get port from environment.
#    How：
#       Add the logic to get port from environment.
#    Related PR (if no, explain why):
#       Need a PR to vllm to support get port from environment.
#    Future Plan:
#       Remove those patch when vllm merged them
#   2. `torch.distributed.all_reduce`, `torch.distributed.broadcast`
#    Why:
#       tensor alignment for 310p
#    How：
#       rewrite all_reduce and broadcast in torch.distributed
#    Related PR (if no, explain why):
#       No, not ready yet.
#    Future Plan:
#       Find a better way to support tensor alignment for 310p without this patch.
#
# ** File: worker/patch_multimodal_merge.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.utils._merge_multimodal_embeddings`
#    Why:
#       '_merge_multimodal_embeddings' func of vllm is incompatible with Ascend.
#    How：
#       Replace with CPU operation that can be executed asynchronously.
#    Related PR (if no, explain why):
#       This is a bug by Ascend only. It can' be fixed in vLLM.
#    Future Plan:
#       Identify this pattern in torch-npu and remove this patch.
#
# * Worker Patch:
# ===============
# ** File: worker/patch_minicpm.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.minicpm.MiniCPMAttention.forward`
#    Why:
#       The forward func of MiniCPMAttention in vllm do a datatype convert
#       (original datatype --> float32) to ensure the precision on cuda.
#       However float32 is not supported in cann rope op, thus we keep this patch
#    How：
#       Removed the dtype convert operations in forward
#    Related PR (if no, explain why):
#       NO, only for npu due to rope op.
#    Future Plan:
#       Keep this patch in vllm-ascend.
#
# ** File: worker/patch_distributed.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.distributed.parallel_state.GroupCoordinator`
#   (1) __init__()
#    Why:
#       The original GroupCoordinator initialization lacks pg_options to generate new
#       process group with customized options.
#    How:
#       Inject HCCL options during process group initialization.
#    Related PR (if no, explain why):
#       Need a PR to vllm to support a dictionary as input while initializing distributed
#       environment (e.g., Dict[str, torch.distributed.ProcessGroupHCCL.Options])
#       https://github.com/vllm-project/vllm/pull/25417
#    Future Plan:
#       Remove this patch when vllm merges this PR.
#   (2) all_to_all()
#    Why:
#       vllm doesn't support all_to_all for GroupCoordinator.
#    How：
#       Add all_to_all implementation for GroupCoordinator.
#    Related PR (if no, explain why):
#       Need a PR to vllm to support all_to_all for GroupCoordinator.
#    Future Plan:
#       Remove this patch when vllm merged them.
#
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.sample.sampler.Sampler.gather_logprobs`
#    Why:
#       We need to patch gather_logprobs to make sure call batched_count_greater_than
#       with backend=current_platform.simple_compile_backend
#    How：
#       Patch gather_logprobs call new batched_count_greater_than
#    Related PR (if no, explain why):
#       - https://github.com/vllm-project/vllm/pull/21591
#    Future Plan:
#       Revert it when vLLM merge #21591 and release new version
# ** File: worker/patch_logits.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm._custom_ops.apply_repetition_penalties`
#    Why:
#       apply_repetition_penalties in vLLM use tensor.is_cuda to check if tensor is on cuda. But the value is always True
#       on ascend, thus we need to patch apply_repetition_penalties.
#    How：
#       Remove the related cuda check in apply_repetition_penalties.
#    Related PR (if no, explain why):
#       - this is a bug by Ascend only. It can' be fixed in vLLM.
#    Future Plan:
#       Fix this bug in torch-npu, bump torch-npu version and remove this patch.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.roberta.RobertaEmbedding.forward`
#    Why:
#       shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
#    How：
#       Replace shift operation with multiplication and division.
#    Related PR (if no, explain why):
#       No, this need CANN add an aclnn shift operation
#    Future Plan:
#       Revert this when CANN support shift aclnn operation
#   2. `vllm.model_executor.models.roberta.RobertaForSequenceClassification.forward `
#    Why:
#       shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
#    How：
#       Replace shift operation with multiplication and division.
#    Related PR (if no, explain why):
#       No, this need CANN add an aclnn shift operation
#    Future Plan:
#       Revert this when CANN support shift aclnn operation
#
# ** File: worker/patch_deepseek_mtp.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.deepseek_mtp.DeepSeekMultiTokenPredictorLayer.__init__`
#    Why:
#       '__init__' func of DeepSeekMultiTokenPredictorLayer didn't pass prefix to SharedHead.
#    How：
#       Replace with a new __init__.
#       Use a new SharedHead which passes prefix to ParallelLMHead.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/25805
#    Future Plan:
#       Remove this patch when adapted vllm version contains the above PR.
#
# ** File: worker/patch_attention_layer.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.attention.layer.Attention.forward`
#    Why:
#       There is a zerolike operator before the attention operation in each decoding stage.
#    How
#       Replace this zerolike operator with torch.empty
#    Related PR (if no, explain why):
#       - https://github.com/vllm-project/vllm/pull/26680
#    Future Plan:
#       Remove this to match the optimization supported in the VLLM version.
#
-												port deepseekv2 and mtp to main branch (#429)

### What this PR does / why we need it?
This PR ports all the deepseek graph mode code and mtp code from v0.7.3
to the main branch
---------

Signed-off-by: SidaoY <1024863041@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Signed-off-by: mengwei805 <mengwei25@huawei.com>
Signed-off-by: libaokui <libaokui@huawei.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: SidaoY <1024863041@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
Co-authored-by: libaokui <libaokui@huawei.com>
											
										
										
											2025-04-19 17:38:18 +08:00
+								#
 								# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 								# This file is a part of the vllm-ascend project.
 								#
 								# Licensed under the Apache License, Version 2.0 (the "License");
 								# you may not use this file except in compliance with the License.
 								# You may obtain a copy of the License at
 								#
 								#     http://www.apache.org/licenses/LICENSE-2.0
 								#
 								# Unless required by applicable law or agreed to in writing, software
 								# distributed under the License is distributed on an "AS IS" BASIS,
 								# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 								# See the License for the specific language governing permissions and
 								# limitations under the License.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
 								# ----------------------------------------------------------------------------------
 								# This module manage the patch for vllm. There are two folders in this module:
 								# - platform: contains the patches applied before worker starts. It's called by
 								#             `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
 								#             `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
 								# - worker: contains the patches applied when worker starts. It's called by
 								#           `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
 								#           each worker's `__init__` function.
 								#
 								# Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
 								# ----------------------------------------------------------------------------------
 								# What's Patched and how it works:
 								# --------------------------------
 								# * Platform Patch:
 								# =================
-												[Refactor] refactor patch module (#3555)

### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-10-21 20:19:46 +08:00
+								# ** File: platform/patch_distributed.py**
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-												[Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681)

### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced

This is a part of #1422 backport.

Fixes https://github.com/vllm-project/vllm-ascend/issues/1396
https://github.com/vllm-project/vllm-ascend/issues/1154

### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.

### How was this patch tested?
CI passed with new added and existing test.


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/fe8a2c544ad97119f4dafd316e5d9664521b73f9

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-07-21 09:08:04 +08:00
+								#   1. `vllm.config.ParallelConfig.get_next_dp_init_port`
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								#    Why:
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#       vllm doesn't support get port from environment.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								#    How：
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#       Add the logic to get port from environment.
 								#    Related PR (if no, explain why):
 								#       Need a PR to vllm to support get port from environment.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								#    Future Plan:
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#       Remove those patch when vllm merged them
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#   2. `torch.distributed.all_reduce`, `torch.distributed.broadcast`
 								#    Why:
 								#       tensor alignment for 310p
 								#    How：
 								#       rewrite all_reduce and broadcast in torch.distributed
 								#    Related PR (if no, explain why):
 								#       No, not ready yet.
 								#    Future Plan:
 								#       Find a better way to support tensor alignment for 310p without this patch.
-												[Model][MiniCPM] support MiniCPM (#645)

### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-04-27 11:27:24 +08:00
+								#
-												[Refactor] refactor patch module (#3555)

### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-10-21 20:19:46 +08:00
+								# ** File: worker/patch_multimodal_merge.py**
-												[Model][VLM][Patch]Modify ascend affinity _merge_multimodal_embeddings (#3071)

### What this PR does / why we need it?

This PR aims to address the incompatibility of the `.masked_scatter_`
operation in the current `_merge_multimodal_embeddings` function on
Ascend. For now, it reverts to the previous version of the CPU
operation, which can be executed asynchronously on the device side to
enhance performance.

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/f225ea7dd98e9f29752e5c032cd4a8ee1d712f16

---------

Signed-off-by: booker123456 <945658361@qq.com>
											
										
										
											2025-09-24 10:25:28 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.utils._merge_multimodal_embeddings`
 								#    Why:
 								#       '_merge_multimodal_embeddings' func of vllm is incompatible with Ascend.
 								#    How：
 								#       Replace with CPU operation that can be executed asynchronously.
 								#    Related PR (if no, explain why):
 								#       This is a bug by Ascend only. It can' be fixed in vLLM.
 								#    Future Plan:
 								#       Identify this pattern in torch-npu and remove this patch.
 								#
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								# * Worker Patch:
 								# ===============
-												[Refactor] refactor patch module (#3555)

### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-10-21 20:19:46 +08:00
+								# ** File: worker/patch_minicpm.py **
-												[Model][MiniCPM] support MiniCPM (#645)

### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-04-27 11:27:24 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.minicpm.MiniCPMAttention.forward`
 								#    Why:
 								#       The forward func of MiniCPMAttention in vllm do a datatype convert
 								#       (original datatype --> float32) to ensure the precision on cuda.
 								#       However float32 is not supported in cann rope op, thus we keep this patch
 								#    How：
 								#       Removed the dtype convert operations in forward
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#    Related PR (if no, explain why):
-												[Model][MiniCPM] support MiniCPM (#645)

### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-04-27 11:27:24 +08:00
+								#       NO, only for npu due to rope op.
 								#    Future Plan:
 								#       Keep this patch in vllm-ascend.
 								#
-												[Refactor] refactor patch module (#3555)

### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-10-21 20:19:46 +08:00
+								# ** File: worker/patch_distributed.py **
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.distributed.parallel_state.GroupCoordinator`
-												[feat] support customized and separated hccl_buffer_size for process group initialization (#3073)

### What this PR does / why we need it?
Currently, users have to set `HCCL_BUFFSIZE` to 512~1024 to perform mc2
operators (dispatch and combine) while running moe models with large
`ep_size` and `batch_size`. This environmental variable not only affects
allocated VRAM for mc2 group, but also increases VRAM allocation for dp,
tp & ep groups, leading to significant kvcache and free_memory drops.
This PR supports to automatically calculate and set `hccl_buffer_size`
for each process group **(except mc2 group)** separately when users set
`HCCL_BUFFSIZE` for mc2 group. This can significantly reduce wasted
buffer_size set for dp, tp & ep groups.

Note that current mc2 operators can only perform communication space
partitioning based on `HCCL_BUFFSIZE` configuration. Once they support
`hccl_buffer_size` configuration with `pg_options` while initializing
process group, we'll caculate the required buffer size and users would
avoid set `HCCL_BUFFSIZE` themselves.

### Does this PR introduce _any_ user-facing change?
No. 

### How was this patch tested?
We performed E2E serving with deepseek_r1 initializing DP/TP/EP/MC2
process group and observed significant kv_cache and free_memory
increase!


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
											
										
										
											2025-10-11 15:55:22 +08:00
+								#   (1) __init__()
 								#    Why:
 								#       The original GroupCoordinator initialization lacks pg_options to generate new
 								#       process group with customized options.
 								#    How:
 								#       Inject HCCL options during process group initialization.
 								#    Related PR (if no, explain why):
 								#       Need a PR to vllm to support a dictionary as input while initializing distributed
 								#       environment (e.g., Dict[str, torch.distributed.ProcessGroupHCCL.Options])
 								#       https://github.com/vllm-project/vllm/pull/25417
 								#    Future Plan:
 								#       Remove this patch when vllm merges this PR.
 								#   (2) all_to_all()
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#    Why:
 								#       vllm doesn't support all_to_all for GroupCoordinator.
 								#    How：
 								#       Add all_to_all implementation for GroupCoordinator.
 								#    Related PR (if no, explain why):
 								#       Need a PR to vllm to support all_to_all for GroupCoordinator.
 								#    Future Plan:
 								#       Remove this patch when vllm merged them.
 								#
-												Upgrade vLLM to v0.10.0 (#1927)

### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
https://github.com/vllm-project/vllm/commit/f3a683b7c9df8b251092e48e53d58220bb920f2c
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-07-26 15:43:29 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.sample.sampler.Sampler.gather_logprobs`
 								#    Why:
 								#       We need to patch gather_logprobs to make sure call batched_count_greater_than
 								#       with backend=current_platform.simple_compile_backend
 								#    How：
 								#       Patch gather_logprobs call new batched_count_greater_than
 								#    Related PR (if no, explain why):
 								#       - https://github.com/vllm-project/vllm/pull/21591
 								#    Future Plan:
 								#       Revert it when vLLM merge #21591 and release new version
-												[Refactor] refactor patch module (#3555)

### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-10-21 20:19:46 +08:00
+								# ** File: worker/patch_logits.py **
-												[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2025-07-28 15:13:37 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#   1. `vllm._custom_ops.apply_repetition_penalties`
-												[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2025-07-28 15:13:37 +08:00
+								#    Why:
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#       apply_repetition_penalties in vLLM use tensor.is_cuda to check if tensor is on cuda. But the value is always True
 								#       on ascend, thus we need to patch apply_repetition_penalties.
-												[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2025-07-28 15:13:37 +08:00
+								#    How：
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#       Remove the related cuda check in apply_repetition_penalties.
-												[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2025-07-28 15:13:37 +08:00
+								#    Related PR (if no, explain why):
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#       - this is a bug by Ascend only. It can' be fixed in vLLM.
-												[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2025-07-28 15:13:37 +08:00
+								#    Future Plan:
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#       Fix this bug in torch-npu, bump torch-npu version and remove this patch.
-												[Feat] Supports Aclgraph for bge-m3 (#3171)

### What this PR does / why we need it?
[Feat] Supports Aclgraph for bge-m3

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
pytest -s tests/e2e/singlecard/test_embedding.py
pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py
```
to start an online server with bs 10, each batch's seq length=8192, we
set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked:
```
vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}'
```
For bs10, each batch's seq length=8192, QPS is improved from 85 to 104,
which is a 22% improvement, lots of host bound is reduced.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Co-authored-by: wangyongjun <1104133197@qq.com>
											
										
										
											2025-10-14 23:07:45 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.roberta.RobertaEmbedding.forward`
 								#    Why:
 								#       shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
 								#    How：
 								#       Replace shift operation with multiplication and division.
 								#    Related PR (if no, explain why):
 								#       No, this need CANN add an aclnn shift operation
 								#    Future Plan:
 								#       Revert this when CANN support shift aclnn operation
 								#   2. `vllm.model_executor.models.roberta.RobertaForSequenceClassification.forward `
 								#    Why:
 								#       shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
 								#    How：
 								#       Replace shift operation with multiplication and division.
 								#    Related PR (if no, explain why):
 								#       No, this need CANN add an aclnn shift operation
 								#    Future Plan:
 								#       Revert this when CANN support shift aclnn operation
-												[BugFix][v0.11.0] Fix quantization related mtp bug with patch (#3619)

vLLM 0.11.0 didn't bring PR
(https://github.com/vllm-project/vllm/pull/25805) thus missing the
prefix of mtp's SharedHead. This PR fixes this bug with a patch to
vllm's deepseek_mtp.

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
											
										
										
											2025-10-22 23:06:09 +08:00
+								#
 								# ** File: worker/patch_deepseek_mtp.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.deepseek_mtp.DeepSeekMultiTokenPredictorLayer.__init__`
 								#    Why:
 								#       '__init__' func of DeepSeekMultiTokenPredictorLayer didn't pass prefix to SharedHead.
 								#    How：
 								#       Replace with a new __init__.
 								#       Use a new SharedHead which passes prefix to ParallelLMHead.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/25805
 								#    Future Plan:
 								#       Remove this patch when adapted vllm version contains the above PR.
 								#
-												[v0.11.0][Perf] Eliminating the zerolike operator through patch (#3632)

### What this PR does / why we need it?
There is a zero-like operator before the attention operation in each
decoding stage. After analysis, this operator can be eliminated. The
purpose of this PR is to remove this operator and improve performance.

---------

Signed-off-by: ZYang6263 <zy626375@gmail.com>
											
										
										
											2025-10-23 14:49:28 +08:00
+								# ** File: worker/patch_attention_layer.py **
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.attention.layer.Attention.forward`
 								#    Why:
 								#       There is a zerolike operator before the attention operation in each decoding stage.
 								#    How
 								#       Replace this zerolike operator with torch.empty
 								#    Related PR (if no, explain why):
 								#       - https://github.com/vllm-project/vllm/pull/26680
 								#    Future Plan:
 								#       Remove this to match the optimization supported in the VLLM version.
 								#