xc-llm-ascend/vllm_ascend/patch/__init__.py

#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# ----------------------------------------------------------------------------------
# This module manage the patch for vllm. There are two folders in this module:
# - platform: contains the patches applied before worker starts. It's called by
#             `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
#             `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
# - worker: contains the patches applied when worker starts. It's called by
#           `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
#           each worker's `__init__` function.
#
# Then in each kind of patch, there are three folders:
# - patch_0_10_0: contains the patches applied when vllm version is 0.10.0.
# - patch_main: contains the patches applied when vllm version is main branch.
# - patch_common: contains the patches applied in both 0.10.0 and main branch.
#
# Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
# ----------------------------------------------------------------------------------

# What's Patched and how it works:
# --------------------------------
# * Platform Patch:
# =================
# ** File: platform/patch_common/patch_distributed.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.config.ParallelConfig.get_next_dp_init_port`
#    Why:
#       vllm doesn't support get port from environment.
#    How：
#       Add the logic to get port from environment.
#    Related PR (if no, explain why):
#       Need a PR to vllm to support get port from environment.
#    Future Plan:
#       Remove those patch when vllm merged them
#
# * Worker Patch:
# ===============
# ** File: worker/patch_common/patch_minicpm.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.minicpm.MiniCPMAttention.forward`
#    Why:
#       The forward func of MiniCPMAttention in vllm do a datatype convert
#       (original datatype --> float32) to ensure the precision on cuda.
#       However float32 is not supported in cann rope op, thus we keep this patch
#    How：
#       Removed the dtype convert operations in forward
#    Related PR (if no, explain why):
#       NO, only for npu due to rope op.
#    Future Plan:
#       Keep this patch in vllm-ascend.
#
# ** File: worker/patch_common/patch_distributed.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.distributed.parallel_state.GroupCoordinator`
#    Why:
#       vllm doesn't support all_to_all for GroupCoordinator.
#    How：
#       Add all_to_all implementation for GroupCoordinator.
#    Related PR (if no, explain why):
#       Need a PR to vllm to support all_to_all for GroupCoordinator.
#    Future Plan:
#       Remove this patch when vllm merged them.
#
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.sample.sampler.Sampler.gather_logprobs`
#    Why:
#       We need to patch gather_logprobs to make sure call batched_count_greater_than
#       with backend=current_platform.simple_compile_backend
#    How：
#       Patch gather_logprobs call new batched_count_greater_than
#    Related PR (if no, explain why):
#       - https://github.com/vllm-project/vllm/pull/21591
#    Future Plan:
#       Revert it when vLLM merge #21591 and release new version
# ** File: worker/patch_common/patch_linear.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.layers.linear.RowParallelLinear`
#    Why:
#       We need to fuse matmul and allreuce in `RowParallelLinear`
#       to improve performance.
#    How：
#       Create a new class `AscendRowParallelLinear` that inherits from `RowParallelLinear`.
#       In this class, we override the `forward` method to use
#       torch_npu.npu_mm_all_reduce_base to replace matmul and allreduce.
#    Related PR (if no, explain why):
#       - https://github.com/vllm-project/vllm-ascend/pull/1926
#    Future Plan:
#       Validate more models in all kinds of scenario,
#       if performance is always improved, we can enable this patch by default and remove the env
#       variable `VLLM_ASCEND_ENABLE_FUSE_MATMUL_ALLREDUCE` in the future.
-												port deepseekv2 and mtp to main branch (#429)

### What this PR does / why we need it?
This PR ports all the deepseek graph mode code and mtp code from v0.7.3
to the main branch
---------

Signed-off-by: SidaoY <1024863041@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Signed-off-by: mengwei805 <mengwei25@huawei.com>
Signed-off-by: libaokui <libaokui@huawei.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: SidaoY <1024863041@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
Co-authored-by: libaokui <libaokui@huawei.com>
											
										
										
											2025-04-19 17:38:18 +08:00
+								#
 								# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 								# This file is a part of the vllm-ascend project.
 								#
 								# Licensed under the Apache License, Version 2.0 (the "License");
 								# you may not use this file except in compliance with the License.
 								# You may obtain a copy of the License at
 								#
 								#     http://www.apache.org/licenses/LICENSE-2.0
 								#
 								# Unless required by applicable law or agreed to in writing, software
 								# distributed under the License is distributed on an "AS IS" BASIS,
 								# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 								# See the License for the specific language governing permissions and
 								# limitations under the License.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
 								# ----------------------------------------------------------------------------------
 								# This module manage the patch for vllm. There are two folders in this module:
 								# - platform: contains the patches applied before worker starts. It's called by
 								#             `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
 								#             `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
 								# - worker: contains the patches applied when worker starts. It's called by
 								#           `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
 								#           each worker's `__init__` function.
 								#
 								# Then in each kind of patch, there are three folders:
-												Upgrade vLLM to v0.10.0 (#1927)

### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
https://github.com/vllm-project/vllm/commit/f3a683b7c9df8b251092e48e53d58220bb920f2c
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-07-26 15:43:29 +08:00
+								# - patch_0_10_0: contains the patches applied when vllm version is 0.10.0.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								# - patch_main: contains the patches applied when vllm version is main branch.
-												Upgrade vLLM to v0.10.0 (#1927)

### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
https://github.com/vllm-project/vllm/commit/f3a683b7c9df8b251092e48e53d58220bb920f2c
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-07-26 15:43:29 +08:00
+								# - patch_common: contains the patches applied in both 0.10.0 and main branch.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								#
 								# Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
 								# ----------------------------------------------------------------------------------
 								# What's Patched and how it works:
 								# --------------------------------
 								# * Platform Patch:
 								# =================
 								# ** File: platform/patch_common/patch_distributed.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-												[Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681)

### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced

This is a part of #1422 backport.

Fixes https://github.com/vllm-project/vllm-ascend/issues/1396
https://github.com/vllm-project/vllm-ascend/issues/1154

### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.

### How was this patch tested?
CI passed with new added and existing test.


- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/fe8a2c544ad97119f4dafd316e5d9664521b73f9

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-07-21 09:08:04 +08:00
+								#   1. `vllm.config.ParallelConfig.get_next_dp_init_port`
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								#    Why:
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#       vllm doesn't support get port from environment.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								#    How：
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#       Add the logic to get port from environment.
 								#    Related PR (if no, explain why):
 								#       Need a PR to vllm to support get port from environment.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								#    Future Plan:
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#       Remove those patch when vllm merged them
-												[Model][MiniCPM] support MiniCPM (#645)

### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-04-27 11:27:24 +08:00
+								#
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								# * Worker Patch:
 								# ===============
-												[Model][MiniCPM] support MiniCPM (#645)

### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-04-27 11:27:24 +08:00
+								# ** File: worker/patch_common/patch_minicpm.py **
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.minicpm.MiniCPMAttention.forward`
 								#    Why:
 								#       The forward func of MiniCPMAttention in vllm do a datatype convert
 								#       (original datatype --> float32) to ensure the precision on cuda.
 								#       However float32 is not supported in cann rope op, thus we keep this patch
 								#    How：
 								#       Removed the dtype convert operations in forward
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								#    Related PR (if no, explain why):
-												[Model][MiniCPM] support MiniCPM (#645)

### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-04-27 11:27:24 +08:00
+								#       NO, only for npu due to rope op.
 								#    Future Plan:
 								#       Keep this patch in vllm-ascend.
 								#
-												[CI] Run e2e after pre check pass (#1132)

Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:18:09 +08:00
+								# ** File: worker/patch_common/patch_distributed.py **
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.distributed.parallel_state.GroupCoordinator`
 								#    Why:
 								#       vllm doesn't support all_to_all for GroupCoordinator.
 								#    How：
 								#       Add all_to_all implementation for GroupCoordinator.
 								#    Related PR (if no, explain why):
 								#       Need a PR to vllm to support all_to_all for GroupCoordinator.
 								#    Future Plan:
 								#       Remove this patch when vllm merged them.
 								#
-												Upgrade vLLM to v0.10.0 (#1927)

### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
https://github.com/vllm-project/vllm/commit/f3a683b7c9df8b251092e48e53d58220bb920f2c
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-07-26 15:43:29 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.sample.sampler.Sampler.gather_logprobs`
 								#    Why:
 								#       We need to patch gather_logprobs to make sure call batched_count_greater_than
 								#       with backend=current_platform.simple_compile_backend
 								#    How：
 								#       Patch gather_logprobs call new batched_count_greater_than
 								#    Related PR (if no, explain why):
 								#       - https://github.com/vllm-project/vllm/pull/21591
 								#    Future Plan:
 								#       Revert it when vLLM merge #21591 and release new version
-												[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)

### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2025-07-28 15:13:37 +08:00
+								# ** File: worker/patch_common/patch_linear.py **
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.layers.linear.RowParallelLinear`
 								#    Why:
 								#       We need to fuse matmul and allreuce in `RowParallelLinear`
 								#       to improve performance.
 								#    How：
 								#       Create a new class `AscendRowParallelLinear` that inherits from `RowParallelLinear`.
 								#       In this class, we override the `forward` method to use
 								#       torch_npu.npu_mm_all_reduce_base to replace matmul and allreduce.
 								#    Related PR (if no, explain why):
 								#       - https://github.com/vllm-project/vllm-ascend/pull/1926
 								#    Future Plan:
 								#       Validate more models in all kinds of scenario,
 								#       if performance is always improved, we can enable this patch by default and remove the env
 								#       variable `VLLM_ASCEND_ENABLE_FUSE_MATMUL_ALLREDUCE` in the future.