Files
xc-llm-ascend/vllm_ascend/patch/__init__.py
wangxiyuan 3362be7f86 Update patch doc (#4869)
Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-10 23:27:45 +08:00

256 lines
10 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ----------------------------------------------------------------------------------
# This module manage the patch for vllm. There are two folders in this module:
# - platform: contains the patches applied before worker starts. It's called by
# `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
# `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
# - worker: contains the patches applied when worker starts. It's called by
# `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
# each worker's `__init__` function.
#
# Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
# ----------------------------------------------------------------------------------
# What's Patched and how it works:
# --------------------------------
# * Platform Patch:
# =================
# ** 1. File: platform/patch_distributed.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `torch.distributed.all_reduce`, `torch.distributed.broadcast`
# Why:
# tensor alignment for 310p
# How
# rewrite all_reduce and broadcast in torch.distributed
# Related PR (if no, explain why):
# No, not ready yet.
# Future Plan:
# Find a better way to support tensor alignment for 310p without this patch.
#
# ** 2. File: platform/patch_ec_connector.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.distributed.ec_transfer.ec_connector.shared_storage_connector.ECSharedStorageConnector.start_load_caches`
# Why:
# it's hard code to cuda
# How
# change the cuda to npu
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/30225
# Future Plan:
# Remove this patch when vllm merges the PR.
#
# ** 3. File: platform/patch_mamba_config.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.config.HybridAttentionMambaModelConfig.verify_and_update_config`
# Why:
# block size is set to 16 in vLLM which is not supported by Ascend.
# How
# Set block size to 128 on npu.
# Related PR (if no, explain why):
# we'll fix this in vLLM soon.
# Future Plan:
# Remove this patch when vLLM merges the PR.
#
# ** 4. File: platform/patch_multiproc_executor.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.executor.multiproc_executor.MultiprocExecutor`
# Why:
# vLLM create child process with daemon=True, which doesn't work with EPLB case, since EPLB will create
# a new process which is not allowed by daemon=True.
# How
# Set daemon=False in MultiprocExecutor.
# Related PR (if no, explain why):
# Find a way to support daemon=False in vLLM
# Future Plan:
# Remove this patch when vLLM fix the issue.
#
# ** 5. File: platform/patch_sched_yield.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.distributed.utils.USE_SCHED_YIELD`
# Why:
# os.sched_yield() doesn't work on Arm systems.
# How
# avoid using os.sched_yield() on Arm systems.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/30228
# Future Plan:
# Remove this patch when vLLM merge the PR.
#
#
# * Worker Patch:
# ===============
#
# ** 1. File: worker/patch_deepseek.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `DeepseekV2Model.forward`
# Why:
# getattr(self.config, "llama_4_scaling", None) will raise AttributeError
# on npu with graph mode.
# How
# catch the AttributeError and set llama_4_scaling to None.
# Related PR (if no, explain why):
# No, this is a bug in vLLM Ascend
# Future Plan:
# Find the root cause of this bug and fix it in vLLM Ascend.
#
# ** 2. File: worker/patch_distributed.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.distributed.parallel_state.GroupCoordinator`
# Why:
# vllm doesn't support all_to_all for GroupCoordinator.
# How
# Add all_to_all implementation for GroupCoordinator.
# Related PR (if no, explain why):
# No, we should use vlLM all2all manager to support all_to_all for npu.
# Future Plan:
# Remove this patch when the refactor of all2all manager is done.
#
# ** 3. File: worker/patch_minicpm.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.minicpm.MiniCPMAttention.forward`
# Why:
# The forward func of MiniCPMAttention in vllm do a datatype convert
# (original datatype --> float32) to ensure the precision on cuda.
# However float32 is not supported in cann rope op, thus we keep this patch
# How
# Removed the dtype convert operations in forward
# Related PR (if no, explain why):
# NO, only for npu due to rope op.
# Future Plan:
# Keep this patch in vllm-ascend.
#
# ** 4. File: worker/patch_multimodal_merge.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.utils._merge_multimodal_embeddings`
# Why:
# '_merge_multimodal_embeddings' func of vllm is incompatible with Ascend.
# How
# Replace with CPU operation that can be executed asynchronously.
# Related PR (if no, explain why):
# This is a bug by Ascend only. It can' be fixed in vLLM.
# Future Plan:
# Identify this pattern in torch-npu and remove this patch.
#
# ** 5. File: worker/patch_qwen2_5_omni.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.qwen2_5_omni_thinker.Qwen2_5OmniThinkerForConditionalGeneration`
# Why:
# we have ascend forward context which doesn't work with upstream.
# How
# override forward_context in the model file
# Related PR (if no, explain why):
# This is a bug by Ascend only. we should drop set_ascend_forward_context
# Future Plan:
# Remove this patch once forward_context is refactor.
#
# ** 6. File: worker/patch_qwen2_5_vl.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.qwen2_5_vl.Qwen2_5_VLForConditionalGeneration`
# Why:
# we have ascend forward context which doesn't work with upstream.
# How
# override forward_context in the model file
# Related PR (if no, explain why):
# This is a bug by Ascend only. we should drop set_ascend_forward_context
# Future Plan:
# Remove this patch once forward_context is refactor.
#
# 2. `vllm.model_executor.models.qwen2_vl.Qwen2VisionAttention.forward`
# Why:
# the attention is not custom ops
# How
# make it to custom ops and pluggable
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/30125
# Future Plan:
# Remove this patch one the PR is merged into vLLM.
#
# ** 7. File: worker/patch_qwen3_vl.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.qwen3_vl.Qwen3_VisionTransformer.forward`
# Why:
# the attention is not custom ops
# How
# make it to custom ops and pluggable
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/30125
# Future Plan:
# Remove this patch one the PR is merged into vLLM.
#
# ** 8. File: worker/patch_roberta.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.bert `
# Why:
# shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
# How
# Replace shift operation with multiplication and division.
# Related PR (if no, explain why):
# No, this need CANN add an aclnn shift operation
# Future Plan:
# Revert this when CANN support shift aclnn operation
#
# ** 9. File: worker/patch_triton.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.layers.mamba.ops`, `vllm.model_executor.layers.fla.ops`
# Why:
# triton ops in vLLM perform not good on NPU. And there is no dispatch mechanism for triton ops.
# How
# override triton ops in vLLM with ascend implementation
# Related PR (if no, explain why):
# Let vLLM support triton ops dispatch.
# Future Plan:
# Remove this patch when vLLM support the dispatch function.
#
# ** 10. File: worker/patch_weight_loader.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.layers.linear.UnquantizedLinearMethod`
# Why:
# vLLM Ascend doesn't work with weight loader v2
# How
# patch it to fix the bug.
# Related PR (if no, explain why):
# This is a bug by Ascend only. We should fix it soon
# Future Plan:
# Remove this patch when the bug is fixed.
#
# ** File: worker/patch_qwen3_next_mtp.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.worker.utils.bind_kv_cache`
# Why:
# 'bind_kv_cache' func will raise an exception when current_platform is npu.
# How
# Replace with a new bind_kv_cache.
# Skip the raise.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/4770
# Future Plan:
# Remove this patch after discussing with vllm community and adapting bind_kv_cache to npu.
#
# ** File: worker/patch_module.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.attention.backends.gdn_attn.torch.argsort`
# Why:
# 'torch.argsort' func of npu does not support bool.
# How
# Replace with a new torch.argsort that will cast the input to torch.int32.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/4770
# Future Plan:
# Remove this patch when bool is supported in 'torch.argsort' func of npu.
#