Files
xc-llm-ascend/vllm_ascend/patch/__init__.py

531 lines
24 KiB
Python
Raw Normal View History

#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ----------------------------------------------------------------------------------
# This module manage the patch for vllm. There are two folders in this module:
# - platform: contains the patches applied before worker starts. It's called by
# `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
# `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
# - worker: contains the patches applied when worker starts. It's called by
# `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
# each worker's `__init__` function.
#
# Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
# ----------------------------------------------------------------------------------
# What's Patched and how it works:
# --------------------------------
# * Platform Patch:
# =================
# ** 1. File: platform/patch_distributed.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `torch.distributed.all_reduce`, `torch.distributed.broadcast`
# Why:
# tensor alignment for 310p
# How
# rewrite all_reduce and broadcast in torch.distributed
# Related PR (if no, explain why):
# No, not ready yet.
# Future Plan:
# Find a better way to support tensor alignment for 310p without this patch.
#
# ** 2. File: platform/patch_mamba_config.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.config.HybridAttentionMambaModelConfig.verify_and_update_config`
# Why:
# block size is set to 16 in vLLM which is not supported by Ascend.
# How
# Set block size to 128 on npu.
# Related PR (if no, explain why):
# we'll fix this in vLLM soon.
# Future Plan:
# Remove this patch when vLLM merges the PR.
#
# ** 3. File: platform/patch_multiproc_executor.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.executor.multiproc_executor.MultiprocExecutor`
# Why:
# vLLM create child process with daemon=True, which doesn't work with EPLB case, since EPLB will create
# a new process which is not allowed by daemon=True.
# How
# Set daemon=False in MultiprocExecutor.
# Related PR (if no, explain why):
# Find a way to support daemon=False in vLLM
# Future Plan:
# Remove this patch when vLLM fix the issue.
#
# ** 4. File: platform/patch_sched_yield.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.distributed.utils.USE_SCHED_YIELD`
# Why:
# os.sched_yield() doesn't work on Arm systems.
# How
# avoid using os.sched_yield() on Arm systems.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/30228
# Future Plan:
# Remove this patch when vLLM merge the PR.
#
# ** 5. File: platform/patch_balance_schedule.py**
[Main] [Patch] support balance scheduling patch (#5212) ### Motivation. **Limitations of the current vLLM v1 scheduling strategy** vLLM v1 scheduling currently enables chunkedprefill by default, which processes prefill and decode requests simultaneously in a single scheduling session. This can impact the overall system throughput and performance in some scenarios. Balance scheduling addresses this issue by synchronizing the number of running queues across all schedulers to delay the scheduling of new requests, thereby improving the overall system's steady-state decoding time. This achieves: ✅Adding `balance_gather` to the scheduler synchronizes the number of requests in the running queues between DPs. ✅Balance scheduling improves the decode steady-state time, thereby increasing the overall output throughput of the inference system. ### Proposed Change. **1.Feature Overview** In the vLLM scheduler, running requests (i.e., requests that are already undergoing pre-filled computation) have the highest priority, followed by waiting requests (i.e., requests that have not yet been computed). As shown in the diagram above, when the entire inference system exits from a steady state, the scheduler will schedule a batch of new requests for prefill operations and then synchronize them among the dynamic programming (DP) models. This can cause some DP models that are entirely decoded to synchronize with the number of prefilled tokens. Frequent prefill scheduling by certain DP models can lead to a deterioration in the overall system output throughput. Balance scheduling synchronizes the number of running queue requests across different DPs, and only schedules new requests for prefilling when at least every scheduler has fewer than max_nun_requst. **2.Implementation Design** **3.Experiment Results** - Fixed-length input scenario: In the performance test scenario with 3.5K fixed-length input and 1.5K fixed-length output, the throughput performance was improved by approximately **18%** after adding balance scheduling. | Method | Model | Input Len | Request Count | Output Len | BatchSize | Average TTFT | Average TPOT | e2e duration | Input Token Throughput | Output Token Throughput | Request Throughput | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | Baseline | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 6600 | 86.85 | 591.9s | 3030.5 | 1297.3 | 0.86 | | Balance scheduling | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 7012 | 70.63 | 501.7s | 3575.7 | 1530.7 | 1.02 | **4.Demo PR** [#29721 ](https://github.com/vllm-project/vllm/pull/29721) --------- Signed-off-by: GDzhu01 <809721801@qq.com>
2025-12-23 09:04:38 +08:00
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.engine.core.EngineCoreProc.run_engine_core`
# `vllm.v1.core.sched.scheduler.Scheduler`
# Why:
# vLLM v1 scheduling currently enables chunkedprefill by default, which processes prefill and decode
# requests simultaneously in a single scheduling session. This can impact the overall system throughput
# and performance in some scenarios.
# How
# Set environmental variables VLLM_ASCEND_BALANCE_SCHEDULING=1 in startup script.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/29721
# Future Plan:
# Remove this patch when vLLM merge the PR.
#
[Main2Main] Upgrade vLLM to 0303 (#6944) ### What this PR does / why we need it? break: - https://github.com/vllm-project/vllm/pull/34102 Disable_full param replaced with valid_modes/invalid_modes API - https://github.com/vllm-project/vllm/pull/35503 Now must return float compilation_time - https://github.com/vllm-project/vllm/pull/35564 New sequence_lengths param added - https://github.com/vllm-project/vllm/pull/33807 A check was performed (if runner_backend != "auto") - https://github.com/vllm-project/vllm/pull/34861 `BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to check process group state - https://github.com/vllm-project/vllm/pull/35274 **Important change:** - https://github.com/vllm-project/vllm/pull/28672 `matcher_utils` directly accesses `torch.ops._C.*` during the import phase. In the Ascend environment, some unregistered ops trigger `AttributeError`, causing e2e initialization failure. https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323 https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29 This PR adds temporary compatibility placeholders (rms_norm, fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant, silu_and_mul) to `vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to ensure no crashes during the import phase. Upstream repairs will be considered later. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>
2026-03-06 09:08:52 +08:00
# ** 6. File: platform/patch_fusion_matcher_compat_ops.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `torch.ops._C.rms_norm`, `torch.ops._C.fused_add_rms_norm`,
# Why:
# upstream vLLM initializes fusion matcher global operators at import time.
# On Ascend environment these symbols may be absent and cause import failure.
# How
# inject placeholders only when the symbols are missing so import can continue.
# Related PR (if no, explain why):
# temporary compatibility patch before upstream adjustment is merged.
# Future Plan:
# remove this patch once upstream no longer requires these global symbols or
# provides a backend-safe initialization path.
#
# ** 7. File: platform/patch_minimax_m2_config.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.config.model.ModelConfig._verify_quantization`
# Why:
# MiniMax-M2 fp8 checkpoints on NPU may fail upstream quantization validation.
# vllm-ascend needs to disable fp8 quantization and load bf16 dequantized
# weights in worker-side patches instead.
# How
# Monkey-patch `_verify_quantization` and intercept platform quantization
# verification to force `cfg.quantization=None` for MiniMax-M2 fp8 on NPU.
# Related PR (if no, explain why):
# No, upstream behavior differs across versions and needs discussion.
# Future Plan:
# Remove this patch once upstream supports MiniMax-M2 fp8 on NPU or provides
# a backend-safe validation / override mechanism.
#
# 2. `vllm.config.model.ModelConfig._verify_cuda_graph`
# Why:
# For MiniMax-M2 on NPU with ACL graph capture enabled, HCCL op expansion
# mode affects graph shape coverage. Users may forget to set it.
# How
# If user doesn't set it, set `HCCL_OP_EXPANSION_MODE=AIV` for this model
# and log a warning when a different value is detected.
# Related PR (if no, explain why):
# No, this is an environment-specific tuning knob.
# Future Plan:
# Remove this patch if upstream provides an official NPU graph-capture
# guidance / auto-configuration path for HCCL.
#
# ** 8. File: platform/patch_kv_cache_interface.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.kv_cache_interface.MLAAttentionSpec`
# Why:
# The default `MLAAttentionSpec` is mainly built around `kv_lora_rank`
# and `qk_rope_head_dim`. On NPU, we also use this class to describe DSA
# models. Unlike the GPU path, where cache management is handled by an
# additional indexer module, extending this class directly simplifies the
# corresponding `model_runner` implementation on NPU.
#
# This patch also adds Sparse C8 support for DSA models on NPU. As part
# of that support, members such as `page_size_bytes` need to be adapted,
# so they are overridden here as well to preserve overall readability.
# How:
# This patch subclasses the original implementation, overrides selected
# methods, and adds DSA-specific attributes and helpers with default
# values where needed.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/25896
# Future Plan:
# Remove this patch after the upcoming KV cache spec refactor.
#
# * Worker Patch:
# ===============
#
# ** 1. File: worker/patch_distributed.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.distributed.parallel_state.GroupCoordinator`
# Why:
# vllm doesn't support all_to_all for GroupCoordinator.
# all_reduce in vLLM not is a customop, which will make MatmulAllReduceAddRMSNorm fusion failure.
# How
# Add all_to_all implementation for GroupCoordinator.
# make all_reduce as a customop.
# Related PR (if no, explain why):
# No, we should use vlLM all2all manager to support all_to_all for npu.
# Future Plan:
# Remove this patch when the refactor of all2all manager is done.
# Remove this patch when vLLM support all_reduce as customop.
#
# ** 2. File: worker/patch_multimodal_merge.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.utils._merge_multimodal_embeddings`
# Why:
# '_merge_multimodal_embeddings' func of vllm is incompatible with Ascend.
# How
# Replace with CPU operation that can be executed asynchronously.
# Related PR (if no, explain why):
# This is a bug by Ascend only. It can' be fixed in vLLM.
# Future Plan:
# Identify this pattern in torch-npu and remove this patch.
#
# ** 3. File: worker/patch_bert.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.bert._encode_token_type_ids`
# `vllm.model_executor.models.bert._decode_token_type_ids`
# Why:
# shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
# How
# Replace shift operation with multiplication and division.
# Related PR (if no, explain why):
# No, this need CANN add an aclnn shift operation
# Future Plan:
# Revert this when CANN support shift aclnn operation
#
# ** 4. File: worker/patch_triton.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.layers.mamba.ops`, `vllm.model_executor.layers.fla.ops`,
# `vllm.v1.worker.gpu.sample.gumbel.gumbel_sample`
# Why:
# triton ops in vLLM perform not good on NPU. And there is no dispatch mechanism for triton ops.
# How
# override triton ops in vLLM with ascend implementation
# Related PR (if no, explain why):
# Let vLLM support triton ops dispatch.
# Future Plan:
# Remove this patch when vLLM support the dispatch function.
#
# ** 5. File: worker/patch_qwen3_next_mtp.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.worker.utils.bind_kv_cache`
# Why:
# 'bind_kv_cache' func will raise an exception when current_platform is npu.
# How
# Replace with a new bind_kv_cache.
# Skip the raise.
# Related PR (if no, explain why):
# It need discuss.
# Future Plan:
# Remove this patch after discussing with vllm community and adapting bind_kv_cache to npu.
#
# ** 6. File: worker/patch_rejection_sampler.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.sample.rejection_sampler`
# Why:
# - some functions from `rejection_sampler` are not supported or slow on npu.
# How
# - add npu_top_k_top_p to 'apply_sampling_constraints' func
# - add custom triton kernel to `expand_batch_to_tokens` and `rejection_sample`
# Related PR (if no, explain why):
# Let vLLM support triton ops dispatch.
# Future Plan:
# 1. make these functions as class func of RejectionSampler, create AscendRejectionSampler
# to override them, then delete the patch file `worker/patch_rejection_sampler.py`.
# 2. make these functions as costom op, then remove AscendRejectionSampler
#
## ** 7. File: worker/patch_module.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.attention.backends.gdn_attn.torch.argsort`
# Why:
# 1. 'torch.argsort' func of npu does not support bool.
# 2. Without `stable=True`, the output will have a lot of redundant tokens.
# How
# Replace with a new torch.argsort that will cast the input to torch.int32
# and do stable sort.
# Related PR (if no, explain why):
# 1. It depends on torch_npu.
# 2. https://github.com/vllm-project/vllm/pull/30632
# Future Plan:
# Remove this patch when bool is supported in 'torch.argsort' func of npu.
# Make 'torch.argsort' in `vllm.v1.attention.backends.gdn_attn` be stable.
#
# ** 8. File: worker/patch_qwen3_next.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet.forward`
# Why:
# The Qwen3Next GatedDeltaNet forward cannot directly add custom operators.
# How
# Add a branch in Qwen3NextGatedDeltaNet.forward to adapt to fused_qkvzba_split_reshape_cat.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/30863
# Future Plan:
# Remove this patch when vLLM support these operators.
#
# 2. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet._forward_core`
# Why:
# triton ops fused_recurrent_gated_delta_rule and fused_gdn_gating in vLLM perform not good on NPU.
# How
# add a new fused triton ops in vLLM with ascend implementation.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/30860
# Future Plan:
# Remove this patch when vLLM support these operators.
#
# 3. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet._forward_core`
# Why:
# The Qwen3Next GatedDeltaNet _forward_core cannot directly add custom operators.
# How
# Add a branch in Qwen3NextGatedDeltaNet._forward_core to adapt to fused_gdn_gating_patch.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/31002
# Future Plan:
# Remove this patch when vLLM support these operators.
#
# ** 9. File: worker/patch_huanyuan_vl.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.transformers_utils.processors.hunyuan_vl.HunYuanVLProcessor.__call__`
# Why:
# The `add_special_tokens` parameter is not supported by default in the processor.
# How
# Remove the `add_special_tokens` parameter from kwargs before calling the original method.
# Future Plan:
# Remove this patch when vLLM aligns with the latest processor implementation.
#
# ** 10. File: worker/patch_v2_eagle.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.worker.gpu.spec_decode.eagle.EagleSpeculator.propose`
# Why:
# `propose` method use torch.gather, but the gather operator will
# pollute the arguments passed to it. the bug is reported to huawei
# CANN team, but not fixed yet.
# How
# clone the out attribute ahead of gather to avoid the bug.
# Future Plan:
# Remove this patch when cann fix the gather bug.
#
# ** 11. File: worker/patch_unquantized_gemm.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.layers.utils.default_unquantized_gemm`
# Why:
# unquantized_gemm in vLLM not is a customop, which will make MatmulAllReduceAddRMSNorm fusion failure.
# How
# make unquantized_gemm as a customop.
# Future Plan:
# Remove this patch when vLLM support the operator as customop.
#
# ** 12. File: worker/patch_npugraph_ex_triton.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `torchair.core._concrete_graph.ValuePack`,
# `torchair.npu_fx_compiler._unpack_meta`,
# `torchair.npu_fx_compiler._NpuGraphConverter._unpack_npu`
# Why:
# In the Triton scenario, npugraph_ex backend needs to process the value pack of the input parameters.
# How
# Supplement the relevant processing logic through patches.
# Related PR (if no, explain why):
# https://gitcode.com/Ascend/torchair/pull/2575
# Future Plan:
# Remove this patch when the PTA version used by vllm-ascend has been upgraded.
#
# ** 13. File: worker/patch_v2_uva.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.worker.gpu.states.UvaBuffer`
# Why:
# ASCEND NPUs do not support UVA yet, so we need to wrap it in vLLM.
# How
# make UvaBuffer a dummy class, mimic the interface of vllm UvaBuffer.
# Future Plan:
# Remove this patch when NPU support UVA.
#
# ** 14. File: worker/patch_kimi_k25.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.kimi_k25_vit.Learnable2DInterpPosEmbDivided_fixed.forward`
# Why:
# The forward method uses interpolate with ops not supported on NPU.
# How
# Replace with a new forward that uses CPU for interpolate when shape mismatch,
# and use get_rope_shape to handle the rope shape interpolation.
# Future Plan:
# Remove this patch when vLLM aligns with the latest main.
#
# ** 15. File: worker/patch_routed_experts_capturer.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.layers.fused_moe.routed_experts_capturer.RoutedExpertsCapturer.init_buffer`
# Why:
# The `_device_buffer` initialization in vLLM uses `device="cuda"` hardcoded,
# which doesn't work on NPU.
# How
# Replace `device="cuda"` with `device=current_platform.device_name` to support NPU.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/34336
# Future Plan:
# Remove this patch when vLLM merges the PR.
# ** 16. File: worker/patch_draft_quarot.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.llama_eagle3.Eagle3LlamaForCausalLM.load_weights`
# Why:
# vllm-ascend reused the loading logic of drafter model from vllm,
# but vllm doesn't need to apply to Ascend quantization.
# How
# Dynamically replace the `load_weights` function at runtime,
# and fix `target_config` into the new implementation with a closure.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/36225
# Future Plan:
# Remove this patch when vLLM merges the PR.
#
# ** 17. File: worker/patch_minimax_m2.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.minimax_m2.MiniMaxM2MoE.forward`
# Why:
# In TP mode, MiniMax-M2 MoE needs a backend-aware reduction path to avoid
# unnecessary communication / maintain correctness on NPU.
# How
# Replace the forward to call `experts.maybe_all_reduce_tensor_model_parallel`
# when `tp_size > 1`.
# Related PR (if no, explain why):
# No, model-specific behavior.
# Future Plan:
# Move this behavior upstream once a generic MoE reduce hook exists.
#
# 2. `vllm.model_executor.models.minimax_m2.MiniMaxM2Attention.__init__`
# Why:
# When total kv heads < TP world size, kv head replication happens and k_norm
# weights should be sharded to match the replication layout.
# How
# Add `num_kv_head_replicas` and create sharded `k_norm` via
# `MiniMaxText01RMSNormTP(..., weight_shard_world_size=total_num_kv_heads, ...)`.
# Related PR (if no, explain why):
# No, depends on Ascend kernel behavior and TP layout.
# Future Plan:
# Remove this patch if upstream implements kv-head-aware norm sharding.
#
# 3. `vllm.model_executor.models.minimax_m2.MiniMaxM2Model.load_weights`
# Why:
# MiniMax-M2 fp8 checkpoints may store fp8 weights with per-block inverse
# scales. On NPU we load bf16 weights by dequantizing at load time.
# How
# Inject fp8 dequant helpers and wrap `load_weights` to convert fp8 weight +
# `weight_scale_inv` pairs into bf16 blocks before delegating to upstream.
# Related PR (if no, explain why):
# No, fp8 load format and backend constraints are model/backend specific.
# Future Plan:
# Remove this patch when upstream supports MiniMax-M2 fp8 loading on NPU.
#
# ** 18. File: worker/patch_minimax_m2_linear_attn.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.layers.mamba.linear_attn.MiniMaxText01RMSNormTP.__init__`
# `vllm.model_executor.layers.mamba.linear_attn.MiniMaxText01RMSNormTP.weight_loader`
# Why:
# MiniMax-M2 linear attention RMSNorm needs weight sharding that can follow
# TP layout (and sometimes kv-head replication) on NPU.
# How
# Override `__init__` to parameterize weight shard world/rank and install a
# sharded `weight_loader` implementation.
# Related PR (if no, explain why):
# No, upstream API surface differs across versions.
# Future Plan:
# Remove this patch when upstream exposes stable sharding hooks for this layer.
#
# 2. `vllm.model_executor.layers.mamba.linear_attn.MiniMaxText01RMSNormTP.forward_qk`
# (or older `_normalize_qk`)
# Why:
# q/k norm for linear attention is performance-sensitive. On NPU, a fused
# rms_norm kernel is faster and TP needs a global rstd correction.
# How
# Replace q/k normalization with NPU rms_norm fast path and TP-global rstd
# correction; fall back to upstream implementation on non-NPU.
# Related PR (if no, explain why):
# No, backend-specific optimization.
# Future Plan:
# Remove this patch when upstream adds a backend dispatch path for q/k norm.
#
# ** 19. File: worker/patch_qwen3_5.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.qwen3_5.Qwen3_5GatedDeltaNet._forward_core`
# Why:
# The class Qwen3_5GatedDeltaNet reuse the `_forward_core` method of Qwen3NextGatedDeltaNet,
# but the ascendC ops of Qwen3NextGatedDeltaNet do not support ssm_state with float32 format.
# How
# patch Qwen3_5GatedDeltaNet._forward_core to use triton ops like `fused_recurrent_gated_delta_rule`.
# Future Plan:
# Remove this patch when all ops in _forward_core support both Qwen3_5 and Qwen3Next.
[main][bugfix] Fixed the problem of speculative decoding in FULL mode (#7148) ### What this PR does / why we need it? Fixed the error of speculative decoding in FULL mode when `num_spec + 1` not in `cudagraph_capture_sizes`. Now, we can run speculative decoding in FULL mode, but with drafter as eager. It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 . ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test code is shown as below: ```python prompts = [ "1.Who are you?", "2. Who are you?", ] sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200) llm = LLM( model="/home/some-model/Meta-Llama-3.1-8B-Instruct", tensor_parallel_size=1, max_num_seqs=32, # enforce_eager=True, disable_log_stats=False, distributed_executor_backend="mp", gpu_memory_utilization=0.7, async_scheduling=True, speculative_config={ "enforce_eager": True, "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B", "disable_padded_drafter_batch": False, "method": "eagle3", "num_speculative_tokens": 2, }, compilation_config={ "cudagraph_mode": "FULL", "cudagraph_num_of_warmups": 1, }, max_model_len=4096, enable_prefix_caching=False, ) outputs = llm.generate(prompts, sampling_params) ``` The result before: ```text File "/vllm-workspace/vllm/vllm/v1/cudagraph_dispatcher.py", line 140, in _create_padded_batch_descriptor assert num_tokens_padded % uniform_decode_query_len == 0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError ``` The result after: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 249 num_draft_tokens: 498 num_accepted_tokens: 149 mean acceptance length: 1.60 -------------------------------------------------- acceptance at token 0: 0.43 acceptance at token 1: 0.17 ``` - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d Signed-off-by: drslark <slarksblood@qq.com>
2026-03-12 14:51:12 +08:00
#
# ** 20. File: worker/patch_cudagraph.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.cudagraph_dispatcher.CudagraphDispatcher._create_padded_batch_descriptor`
# Why:
# vllm's FULL mode will cause error, we use a patch to avoid it.
# After that, FULL can be enable now.
# How
# Dynamically replace the `_create_padded_batch_descriptor` function at runtime,
# and change the condition of if.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/34880
# Future Plan:
# Remove this patch when vLLM merges the PR.
#
# ** 21. File: worker/patch_deepseek_mtp.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.deepseek_v2.get_spec_layer_idx_from_weight_name` and
# `vllm.model_executor.models.deepseek_mtp.get_spec_layer_idx_from_weight_name`
# Why:
# When GLM5 uses rotary quant in vllm-ascend, the MTP layer needs to load an extra weight
# named `rot.weight`.
# How
# If weight name starts with `rot`, return `layer_id + i` like other tensors in MTP layer.
# Related PR (if no, explain why):
# Rotary quant is a unique feature of vllm-ascend.
# Future Plan:
# Remove this patch when vllm supports rotary quant or pluggable `MultiTokenPredictorLayer`.
# 2. `vllm.model_executor.models.deepseek_mtp.DeepSeekMultiTokenPredictorLayer`
# Why:
# When GLM5 uses rotary quant in vllm-ascend, the `previous_hidden_states` does not .
# How
# If the target model uses rotary quant, a new linear operation is added before `ehnorm`.
# Related PR (if no, explain why):
# Rotary quant is a unique feature of vllm-ascend.
# Future Plan:
# Remove this patch when vllm supports rotary quant or pluggable `MultiTokenPredictorLayer`.
# 3. `vllm.model_executor.models.deepseek_mtp.DeepSeekMTP._rewrite_spec_layer_name`
# Why:
# Rename `rot.weight` to match the format of weights in `DeepSeekMTP`.
# How
# If the weight name is `rot`, rename it to `model.layers.{spec_layer}.rot.weight`.
# Related PR (if no, explain why):
# Rotary quant is a unique feature of vllm-ascend.
# Future Plan:
# Remove this patch when vllm supports rotary quant or pluggable `MultiTokenPredictorLayer`.
[Hybrid] support prefix cache for Qwen3.5/Next with `--mamba-cache-mode align` (#7103) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d --------- Signed-off-by: Angazenn <supperccell@163.com>
2026-03-15 09:44:09 +08:00
# ** 22. File: worker/patch_mamba_utils.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.worker.mamba_utils.batch_memcpy_kernel = batch_memcpy_kernel`
# Why:
# Oringnal batch_memcpy_kernel implemented in vLLM might encounter bugs when running on
# Ascend hardwares.
# How
# patch to fix related bugs.
# Future Plan:
# Remove this patch when:
# (1) oringnal batch_memcpy_kernel can run on Ascend hardware.
# or
# (2) design a dispatch mechanism for batch_memcpy_kernel.
# 2. `vllm.v1.worker.mamba_utils.batch_memcpy = batch_memcpy`
# Why:
# vLLM use BLOCK_SIZE 1024 for batch_memcpy_kernel. This results in suboptimal performance
# on Ascend hardwares.
# How
# patch to change BLOCK_SIZE to 8192.
# Future Plan:
# Remove this patch when:
# design a dispatch mechanism for batch_memcpy_kernel.