xc-llm-ascend/vllm_ascend/patch/__init__.py

#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# ----------------------------------------------------------------------------------
# This module manage the patch for vllm. There are two folders in this module:
# - platform: contains the patches applied before worker starts. It's called by
#             `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
#             `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
# - worker: contains the patches applied when worker starts. It's called by
#           `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
#           each worker's `__init__` function.
#
# Then in each kind of patch, there are three folders:
# - patch_0_10_0: contains the patches applied when vllm version is 0.10.0.
# - patch_main: contains the patches applied when vllm version is main branch.
# - patch_common: contains the patches applied in both 0.10.0 and main branch.
#
# Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
# ----------------------------------------------------------------------------------

# What's Patched and how it works:
# --------------------------------
# * Platform Patch:
# =================
# ** File: platform/patch_common/patch_distributed.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.config.ParallelConfig.get_next_dp_init_port`
#    Why:
#       vllm doesn't support get port from environment.
#    How：
#       Add the logic to get port from environment.
#    Related PR (if no, explain why):
#       Need a PR to vllm to support get port from environment.
#    Future Plan:
#       Remove those patch when vllm merged them
#   2. `torch.distributed.all_reduce`, `torch.distributed.broadcast`
#    Why:
#       tensor alignment for 310p
#    How：
#       rewrite all_reduce and broadcast in torch.distributed
#    Related PR (if no, explain why):
#       No, not ready yet.
#    Future Plan:
#       Find a better way to support tensor alignment for 310p without this patch.
#
# ** File: platform/patch_common/patch_multimodal_merge.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.utils._merge_multimodal_embeddings`
#    Why:
#       '_merge_multimodal_embeddings' func of vllm is incompatible with Ascend.
#    How：
#       Replace with CPU operation that can be executed asynchronously.
#    Related PR (if no, explain why):
#       This is a bug by Ascend only. It can' be fixed in vLLM.
#    Future Plan:
#       Identify this pattern in torch-npu and remove this patch.
#
# * Worker Patch:
# ===============
# ** File: worker/patch_common/patch_minicpm.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.minicpm.MiniCPMAttention.forward`
#    Why:
#       The forward func of MiniCPMAttention in vllm do a datatype convert
#       (original datatype --> float32) to ensure the precision on cuda.
#       However float32 is not supported in cann rope op, thus we keep this patch
#    How：
#       Removed the dtype convert operations in forward
#    Related PR (if no, explain why):
#       NO, only for npu due to rope op.
#    Future Plan:
#       Keep this patch in vllm-ascend.
#
# ** File: worker/patch_common/patch_distributed.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.distributed.parallel_state.GroupCoordinator`
#    Why:
#       vllm doesn't support all_to_all for GroupCoordinator.
#    How：
#       Add all_to_all implementation for GroupCoordinator.
#    Related PR (if no, explain why):
#       Need a PR to vllm to support all_to_all for GroupCoordinator.
#    Future Plan:
#       Remove this patch when vllm merged them.
#
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.sample.sampler.Sampler.gather_logprobs`
#    Why:
#       We need to patch gather_logprobs to make sure call batched_count_greater_than
#       with backend=current_platform.simple_compile_backend
#    How：
#       Patch gather_logprobs call new batched_count_greater_than
#    Related PR (if no, explain why):
#       - https://github.com/vllm-project/vllm/pull/21591
#    Future Plan:
#       Revert it when vLLM merge #21591 and release new version
# ** File: worker/patch_common/patch_logits.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm._custom_ops.apply_repetition_penalties`
#    Why:
#       apply_repetition_penalties in vLLM use tensor.is_cuda to check if tensor is on cuda. But the value is always True
#       on ascend, thus we need to patch apply_repetition_penalties.
#    How：
#       Remove the related cuda check in apply_repetition_penalties.
#    Related PR (if no, explain why):
#       - this is a bug by Ascend only. It can' be fixed in vLLM.
#    Future Plan:
#       Fix this bug in torch-npu, bump torch-npu version and remove this patch.
-												port deepseekv2 and mtp to main branch (#429)

### What this PR does / why we need it?
This PR ports all the deepseek graph mode code and mtp code from v0.7.3
to the main branch
---------

Signed-off-by: SidaoY <1024863041@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Signed-off-by: mengwei805 <mengwei25@huawei.com>
Signed-off-by: libaokui <libaokui@huawei.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: SidaoY <1024863041@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
Co-authored-by: libaokui <libaokui@huawei.com>
											
										
										
											2025-04-19 17:38:18 +08:00
+								#
 								# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 								# This file is a part of the vllm-ascend project.
 								#
 								# Licensed under the Apache License, Version 2.0 (the "License");
 								# you may not use this file except in compliance with the License.
 								# You may obtain a copy of the License at
 								#
 								#     http://www.apache.org/licenses/LICENSE-2.0
 								#
 								# Unless required by applicable law or agreed to in writing, software
 								# distributed under the License is distributed on an "AS IS" BASIS,
 								# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 								# See the License for the specific language governing permissions and
 								# limitations under the License.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
 								# ----------------------------------------------------------------------------------
 								# This module manage the patch for vllm. There are two folders in this module:
 								# - platform: contains the patches applied before worker starts. It's called by
 								#             `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
 								#             `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
 								# - worker: contains the patches applied when worker starts. It's called by
 								#           `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
 								#           each worker's `__init__` function.
 								#
 								# Then in each kind of patch, there are three folders:
-												Upgrade vLLM to v0.10.0 (#1927)

### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
https://github.com/vllm-project/vllm/commit/f3a683b7c9df8b251092e48e53d58220bb920f2c
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-07-26 15:43:29 +08:00
+								# - patch_0_10_0: contains the patches applied when vllm version is 0.10.0.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								# - patch_main: contains the patches applied when vllm version is main branch.
-												Upgrade vLLM to v0.10.0 (#1927)

### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
https://github.com/vllm-project/vllm/commit/f3a683b7c9df8b251092e48e53d58220bb920f2c
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-07-26 15:43:29 +08:00
+								# - patch_common: contains the patches applied in both 0.10.0 and main branch.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								#
 								# Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
 								# ----------------------------------------------------------------------------------
 								# What's Patched and how it works:
 								# --------------------------------
 								# * Platform Patch:
 								# =================
 								# ** File: platform/patch_common/patch_distributed.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-												[Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681)

### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced

This is a part of #1422 backport.

Fixes https://github.com/vllm-project/vllm-ascend/issues/1396
https://github.com/vllm-project/vllm-ascend/issues/1154

### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.

### How was this patch tested?
CI passed with new added and existing test.


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/fe8a2c544ad97119f4dafd316e5d9664521b73f9

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-07-21 09:08:04 +08:00
+								#   1. `vllm.config.ParallelConfig.get_next_dp_init_port`
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								#    Why:
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#       vllm doesn't support get port from environment.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								#    How：
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#       Add the logic to get port from environment.
 								#    Related PR (if no, explain why):
 								#       Need a PR to vllm to support get port from environment.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								#    Future Plan:
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#       Remove those patch when vllm merged them
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#   2. `torch.distributed.all_reduce`, `torch.distributed.broadcast`
 								#    Why:
 								#       tensor alignment for 310p
 								#    How：
 								#       rewrite all_reduce and broadcast in torch.distributed
 								#    Related PR (if no, explain why):
 								#       No, not ready yet.
 								#    Future Plan:
 								#       Find a better way to support tensor alignment for 310p without this patch.
-												[Model][MiniCPM] support MiniCPM (#645)

### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-04-27 11:27:24 +08:00
+								#
-												[Model][VLM][Patch]Modify ascend affinity _merge_multimodal_embeddings (#3071)

### What this PR does / why we need it?

This PR aims to address the incompatibility of the `.masked_scatter_`
operation in the current `_merge_multimodal_embeddings` function on
Ascend. For now, it reverts to the previous version of the CPU
operation, which can be executed asynchronously on the device side to
enhance performance.

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/f225ea7dd98e9f29752e5c032cd4a8ee1d712f16

---------

Signed-off-by: booker123456 <945658361@qq.com>
											
										
										
											2025-09-24 10:25:28 +08:00
+								# ** File: platform/patch_common/patch_multimodal_merge.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.utils._merge_multimodal_embeddings`
 								#    Why:
 								#       '_merge_multimodal_embeddings' func of vllm is incompatible with Ascend.
 								#    How：
 								#       Replace with CPU operation that can be executed asynchronously.
 								#    Related PR (if no, explain why):
 								#       This is a bug by Ascend only. It can' be fixed in vLLM.
 								#    Future Plan:
 								#       Identify this pattern in torch-npu and remove this patch.
 								#
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								# * Worker Patch:
 								# ===============
-												[Model][MiniCPM] support MiniCPM (#645)

### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-04-27 11:27:24 +08:00
+								# ** File: worker/patch_common/patch_minicpm.py **
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.minicpm.MiniCPMAttention.forward`
 								#    Why:
 								#       The forward func of MiniCPMAttention in vllm do a datatype convert
 								#       (original datatype --> float32) to ensure the precision on cuda.
 								#       However float32 is not supported in cann rope op, thus we keep this patch
 								#    How：
 								#       Removed the dtype convert operations in forward
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#    Related PR (if no, explain why):
-												[Model][MiniCPM] support MiniCPM (#645)

### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-04-27 11:27:24 +08:00
+								#       NO, only for npu due to rope op.
 								#    Future Plan:
 								#       Keep this patch in vllm-ascend.
 								#
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								# ** File: worker/patch_common/patch_distributed.py **
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.distributed.parallel_state.GroupCoordinator`
 								#    Why:
 								#       vllm doesn't support all_to_all for GroupCoordinator.
 								#    How：
 								#       Add all_to_all implementation for GroupCoordinator.
 								#    Related PR (if no, explain why):
 								#       Need a PR to vllm to support all_to_all for GroupCoordinator.
 								#    Future Plan:
 								#       Remove this patch when vllm merged them.
 								#
-												Upgrade vLLM to v0.10.0 (#1927)

### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
https://github.com/vllm-project/vllm/commit/f3a683b7c9df8b251092e48e53d58220bb920f2c
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-07-26 15:43:29 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.sample.sampler.Sampler.gather_logprobs`
 								#    Why:
 								#       We need to patch gather_logprobs to make sure call batched_count_greater_than
 								#       with backend=current_platform.simple_compile_backend
 								#    How：
 								#       Patch gather_logprobs call new batched_count_greater_than
 								#    Related PR (if no, explain why):
 								#       - https://github.com/vllm-project/vllm/pull/21591
 								#    Future Plan:
 								#       Revert it when vLLM merge #21591 and release new version
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								# ** File: worker/patch_common/patch_logits.py **
-												[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2025-07-28 15:13:37 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#   1. `vllm._custom_ops.apply_repetition_penalties`
-												[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2025-07-28 15:13:37 +08:00
+								#    Why:
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#       apply_repetition_penalties in vLLM use tensor.is_cuda to check if tensor is on cuda. But the value is always True
 								#       on ascend, thus we need to patch apply_repetition_penalties.
-												[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2025-07-28 15:13:37 +08:00
+								#    How：
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#       Remove the related cuda check in apply_repetition_penalties.
-												[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2025-07-28 15:13:37 +08:00
+								#    Related PR (if no, explain why):
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#       - this is a bug by Ascend only. It can' be fixed in vLLM.
-												[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2025-07-28 15:13:37 +08:00
+								#    Future Plan:
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#       Fix this bug in torch-npu, bump torch-npu version and remove this patch.